Individuals and Groups (was Re: public health)

[From Rick Marken (2007.07.16.1255)]

I changed the subject line because I think this is now about something
other than public health issues.

Bill Powers (2007.07.16.0800 MDT)--

It occurs to me that it might be worthwhile mining the public health data
bank for other relationships.

That would be fine. I may try. I presented the data I did only because
Martin had mentioned doing such an analysis so, give the easy access
to data on the internet and my love of spreadsheets, I thought I would
check to see what the relationship was between some of the variables
Martin mentioned.

I think a truer story about any statistical relationship is obtained by
calculating the chances that a decision made about an individual on the
basis of population statistics is incorrect.

That's true only when you are using the group data to make decisions
about individuals, which I think is _always_ wrong no matter what the
calculated odds for the individual. I think group level statistics are
useful only at the group level. If taking a drug improves the health
of 20% of heart patients and hurts the health of 1% then policy makers
(representing the group) have to decide whether they should recommend
(or even require) the drug. But the individuals taking the drug have
no idea whether it will help or hurt them. It's like seat belts, which
probably save the lives of 98% of people wearing them in an accident
and kill 2% _because_ they were wearing them. The policy is made at
the group level because it makes things better for the group and
doesn't seem to create a major inconvenience for most individuals.

The present discussion is about
a set of individual countries and their health-care systems.

That's true. It doesn't allow us to say whether a policy will make it
better for any particular country. The relationships observed pertain
to "countries" in general.

Statistics seems to be used in medicine and psychology mainly
as a way of salvaging some slight positive effect out of treatments
that, for any individual, are more often ineffective than effective.

I think this research is perfectly OK as long as the researchers know
that they are dealing with groups. Of course, researchers often don't
know this. Group level data is often used as a way to _study_
individuals. Thats when we have a problem. But if a drug (like the
cholesterol drug I'm taking) has some slight positive effect at the
group level and no negative effects at the individual level, then I'm
willing to take it. It would be nice if individuals were not mislead
into thinking that doing so will necessarily get good results for
them. It's just like wearing seat belts; it might work and it isn't a
big cost (because I have health insurance) if it doesn't.

Remember the "coefficient of uselessness"

Yes. I think it should be called "coefficient of uselessness for
individual decisions". And there already is something called the
standard error or estimate that tells you basically what the
"coefficient of uselessness" tells you.

As far as Richard Kennaway's nice mention of Phil Runkel, I would
point out that Phil's wonderful book on psychological research
(Casting Nets and Testing Specimens) was about using group statistics
properly; it was not about group statistics always being
inappropriate. When you are casting nets (doing what I call policy
research), where the group is the subject of study, then group
statistics are perfectly appropriate. When you are testing specimens,
where you are trying to understand the processes that underly the
behavior of individuals (which is presumably what psychology is
supposed to be doing) then group statistics are inappropriate at best
and misleading at worst.

Best

Rick

···

--
Richard S. Marken PhD
Lecturer in Psychology
UCLA
rsmarken@gmail.com

I think a truer story
about any statistical relationship is obtained by

calculating the chances that a decision made about an individual on
the

basis of population statistics is incorrect.

That’s true only when you are using the group data to make decisions

about individuals, which I think is always wrong no matter what
the

calculated odds for the individual. I think group level statistics
are

useful only at the group level. If taking a drug improves the health

of 20% of heart patients and hurts the health of 1% then policy
makers

(representing the group) have to decide whether they should
recommend

(or even require) the drug. But the individuals taking the drug have

no idea whether it will help or hurt them.
[From Bill Powers (2007.07.16.1730 <DT)]

Rick Marken (2007.07.16.1255)–

That’s not true, if they have access to the same data. In the particular
case defined, the individuals know that the chances are 5 to 1 against
its helping, and 1 to 100 in favor of its harming them. So it comes down
to how much they can afford to pay for a treatment that will most likely
do nothing for them.

The present
discussion is about

a set of individual countries and their health-care
systems.

That’s true. It doesn’t allow us to say whether a policy will make
it

better for any particular country. The relationships observed
pertain

to “countries” in general.

I thought the point of your discussion was whether or not the United
States should switch to a single-payer health insurance system. You have
apparently been trying to support such a switch on the basis of a
statistical analysis of a group of 121 countries. For example, infant
mortality is worse in the US than in other countries with socialized
medicine. You seem to be predicting that if the US adopted socialized
medicine (which I am all in favor of), infant mortality figures would
improve. However, the correlation on which this prediction of behavior of
an individual country seems to have been based is 0.66, which means that
the correlation with a vector that is independent of (orthogonal to)
infant mortality would be 0.75, yielding a negative predictive value of
-0.14. It is more likely that there is no relationship than it is that
there is a relationship.

I’ll take on Martin Taylor shortly – of course I’m not sure that my
conclusion or my way of reaching it is valid, but if it works out, this
is a rather important point.

Statistics
seems to be used in medicine and psychology mainly

as a way of salvaging some slight positive effect out of treatments

that, for any individual, are more often ineffective than
effective.

I think this research is perfectly OK as long as the researchers
know

that they are dealing with groups. Of course, researchers often
don’t

know this. Group level data is often used as a way to study

individuals. Thats when we have a problem. But if a drug (like the

cholesterol drug I’m taking) has some slight positive effect at the

group level and no negative effects at the individual level, then
I’m

willing to take it. It would be nice if individuals were not mislead

into thinking that doing so will necessarily get good results for

them. It’s just like wearing seat belts; it might work and it isn’t
a

big cost (because I have health insurance) if it
doesn’t.

Researchers may deal with groups, but patients and doctors deal with
individuals. The basis on which an individual accepts or rejects
treatment is different from the basis on which an insurance company or a
doctor concerned with his track record (rather than the patient before
him) makes decisions. When a person without insurance has to choose
between buying medicine and sending children to college, the
effectiveness of a costly treatment becomes very germane to the decision.

Remember the
“coefficient of uselessness”

Yes. I think it should be called "coefficient of uselessness
for

individual decisions".

That’s understood. The thrust of my argument concerns using group
statistics to predict the behavior (or response to treatment) of
individuals.

And there already
is something called the

standard error or estimate that tells you basically what the

“coefficient of uselessness” tells
you.

I haven’t seen that correspondence yet, but you may be right. How is the
standard error of estimate related to sqrt[1 -
sqr(correlation)]?

As far as Richard
Kennaway’s nice mention of Phil Runkel, I would

point out that Phil’s wonderful book on psychological research

(Casting Nets and Testing Specimens) was about using group
statistics

properly; it was not about group statistics always being

inappropriate. When you are casting nets (doing what I call
policy

research), where the group is the subject of study, then group

statistics are perfectly appropriate. When you are testing
specimens,

where you are trying to understand the processes that underly the

behavior of individuals (which is presumably what psychology is

supposed to be doing) then group statistics are inappropriate at
best

and misleading at worst.

Yes. So we need some reason other than group statistics about 121 nations
to decide the best course for a single one of those nations – either
that, or we have to find correlations that are much greater than 0.7071
to use as a basis.

Also, we have to remember that lurking in the background there is that
wild card on which is written “All else being equal …”.

Best,

Bill P.

[From Rick Marken (2007.07.16.2150)]

Bill Powers (2007.07.16.1730 <DT)--

Rick Marken (2007.07.16.1255)--

If taking a drug improves the health
of 20% of heart patients and hurts the health of 1% then
policy makers (representing the group) have to decide
whether they should recommend (or even require) the drug.
But the individuals taking the drug have no idea whether it
will help or hurt them.

That's not true, if they have access to the same data.

Yes. Sorry. What I meant is that they can't know whether the drug will
_really_ help or hurt them. All they can know is some estimate of the
probability that it will. And I have no idea what to make of knowing
that a drug has a certain probability of helping me. I take the
cholesterol drug because it causes no harm, it's cheap and it _may_
be better than nothing. But who knows.

In the particular case defined, the individuals know that the
chances are 5 to 1 against its helping, and 1 to 100 in favor
of its harming them. So it comes down to how much they can
afford to pay for a treatment that will most likely do nothing
for them.

Sure, this is a game that can be played at the individual level based
on the group data. But it's just a gambling game and it either pays
off or it doesn't. But the group level data show you what you know you
_will_ get (within determinable confidence limits) at the group level.
Survival rates in accidents when people were wearing a seat belt are
known to to be much higher (say 80%) then when not wearing them (say
10%). So the data suggest that if everyone wears seat belts the group
level survival rate will be about 80%. So the policy of requiring
seat belts is implemented to get a known group level result, not to
save anyone in particular.

Individuals may feel safer when they wear their seat belts but, in
fact, what happens if they get in an accident is what happens; the
seat belt either saves them, kills them or is irrelevant. We don't
know what will actually happen in the individual case but the group
data does tell us what will get (to within a determinable confidence
interval of percentages) at the group level if we require everyone to
wear seat belts; the survival rate of accidents for the group will be
about 80%. That's what statistics is for; determining group
(population) characteristics based on sample data.

I thought the point of your discussion was whether or not the
United States should switch to a single-payer health insurance
system.

That was another discussion. The analysis of the relationship between
per capita income, Gini and infant mortality was just a look at the
relationship between these variables. I was actually hoping that
perhaps Gini would turn out to be the best predictor of infant
mortality rate but it turns out to be per capita income.

You have apparently
been trying to support such a switch on the basis of a
statistical analysis of a group of 121 countries.

This is not the statistical analysis I would use. I would look for a
relationship between type of insurance (pure single payer, hybrid,
completely private, etc) and some measure of performance like infant
mortality/cost per patient. I would only use data that showed a
correlation of close to 1.0 between type of insurance and outcome/cost
ratio.

For example, infant mortality is worse in the
US than in other countries with socialized medicine.
You seem to be predicting that if the US adopted socialized
medicine (which I am all in favor of), infant mortality figures
would improve.

Not from this data I'm not.

However, the correlation
on which this prediction of behavior of an individual country
seems to have been based is 0.66

Actually, the correlation between log per capita income and infant
mortality is .83. So the PPV is about .4.

Yes. So we need some reason other than group statistics
about 121 nations to decide the best course for a single one
of those nations -- either that, or we have to find correlations
that are much greater than 0.7071 to use as a basis.

I agree. The non-statistical reason is simply that it fits my values;
basic health care should be a tax supported public investment, like
roads, police, fire departments, education and parks. But I think it
would be easy to find a correlation close to 1.0 if you look at some
measure of outcome/unit cost in relation to type of health insurance.
I think you will find that in every case the single payer type system
will come out better (in terms of outcome/cost) than a private or
hybrid system. That would be a correlation of 1.0.

Best

Rick

···

--
Richard S. Marken PhD
Lecturer in Psychology
UCLA
rsmarken@gmail.com

Yes. Sorry. What I meant is that
they can’t know whether the drug will

really help or hurt them. All they can know is some estimate of
the

probability that it will.
Sure, this is a game that can be
played at the individual level based

on the group data. But it’s just a gambling game and it either pays

off or it doesn’t.
But the group level data
show you what you know you

will get (within determinable confidence limits) at the group
level.
Survival rates in accidents when
people were wearing a seat belt are

known to to be much higher (say 80%) then when not wearing them (say

10%). So the data suggest that if everyone wears seat belts the
group

level survival rate will be about 80%. So the policy of
requiring

seat belts is implemented to get a known group level result, not to

save anyone in particular.
This is not the statistical
analysis I would use. I would look for a

relationship between type of insurance (pure single payer, hybrid,

completely private, etc) and some measure of performance like infant

mortality/cost per patient. I would only use data that showed a

correlation of close to 1.0 between type of insurance and
outcome/cost

ratio.

However, the correlation

on which this prediction of behavior of an individual country

seems to have been based is 0.66

Actually, the correlation between log per capita income and infant

mortality is .83. So the PPV is about .4.

Yes. So we need some
reason other than group statistics

about 121 nations to decide the best course for a single one

of those nations – either that, or we have to find correlations

that are much greater than 0.7071 to use as a basis.

I agree. The non-statistical reason is simply that it fits my
values;

basic health care should be a tax supported public investment, like

roads, police, fire departments, education and parks.
But I think it

would be easy to find a correlation close to 1.0 if you look at some

measure of outcome/unit cost in relation to type of health
insurance.

I think you will find that in every case the single payer type
system

will come out better (in terms of outcome/cost) than a private or

hybrid system. That would be a correlation of 1.0.
[From Bill Powers (2007.07.17.0740 MDT)]

Rick Marken (2007.07.16.2150) –

That’s the number I’m trying to get. How do you estimate the probability
that taking the drug will help an individual (and how much), given the
population statistics? It’s extraordinarily hard to find out what that
is. I should think this would be the MAIN thing people would want to know
about a statistical analyis: what is the probability that the regression
equation will predict the outcome for an individual within a specified
amount of error?

Isn’t that one of the gambler’s distorted-probability things that
psychologists study? The way you put it, the odds are 50-50 (it either
pays off or it doesn’t), when in fact the example is a case where they
are 20-80 (for every time it pays off, there are four times when it
doesn’t). Someone wins almost every state lottery. But you have about one
chance in a few million. That’s why the lottery ads show the winners, not
the losers.

At the group level we have only a prediction that N people will get paid
off, not which ones they are. If it doesn’t matter which ones they are,
you can predict with a certain small relative error (the bigger the
group, the smaller). If you’re an individual, it matters which ones get
paid off – you want to know the probability that a particular person
will get paid off, YOU.

That is because the interested parties are concerned only about how many
people die, not which ones. It is worth it, for the Surgeon General, to
recommend a treatment of everyone that saves an additional 5% of the
population. But the chances are 19 out of 20 that it will do an
individual no good. That makes income, the cost of the treatment, and its
side-effects important.

What’s “close to 1?” 0.99? or 0.51? 0.51 is closer to 1 than 0.
I’m trying to find some way of judging what “close to 1”
means.

Outcome/cost ratio is how I originally conceived my “predictive
value” number. The ratio of correlation to anticorrelation is
something like the number of times a prediction will hold true divided by
the number of times it will be false, in a given run of trials. At a
correlation of 0.7071, these numbers should be equal, shouldn’t they? So
as I saw it, at that correlation a prediction of an event will be false
half of the time. I still think that is true, even if the prediction
accounts for half of the variance. It’s impressive to hear that some
variable accounts for half of the variance of a measure. But it’s not so
impressive to hear that predictions based on that variable will be wrong
half of the time (if that’s the truth, which is what I’m trying to find
out).

OK, here’s another way of putting my question. If variable A accounts for
X percent of the variance in B, what is the probability that given A, a
prediction of the occurrance or non-occurrance of an event B will be
correct?

For continuous variables, what is the probability that a prediction of B,
given A, will be within x% of the correct value? That is, if B is
predicted to be 100, what is the probability that the observed value will
be between 100 - x and 100 + x?

I was going to suggest that – it looked as if a negative exponential
would fit the data a lot better than a straight line.

Even if that would mean reducing the quality of those services? If we’re
going to bother using numbers and data, shouldn’t we base our perceptions
on what they tell us, instead of “values”? If you already have
selected the only answer you will accept, what’s this spreadsheet
exercise for? Proving that your values are right? What will you do if
they don’t do that?

Well, maybe, but you have to prove that, don’t you? Why not just put your
personal preferences aside and find out what the truth is?

Best.

Bill P.

[From Rick Marken (2007.07.17.0920)]

Bill Powers (2007.07.17.0740 MDT)--

I should think this would be the MAIN thing people would want to know about a
statistical analysis: what is the probability that the regression equation
will predict the outcome for an individual within a specified amount of
error?

I think you are talking about the standard error of prediction
(s[y.x]). This is the average error in predicting Y values using
predicted Y (Y') values given by the regression equation (Y' = aX+b).
Assuming that the error (Y - Y') of prediction is normally distributed
around the true score, you can use s[y.x] to form a confidence
interval around the Y' score predicted by a particular X score. That's
how you get the probability that the regression equation will predict
the outcome for an individual within a specified amount. So let's say
we find that the equation for predicting infant mortality (Y) from per
capita income (X) is Y' = -.001X+1.01. Given that an individual's
income is $1000 we would predict their probability of having an infant
die at birth is .01 (Y'=.01) . If s[y.x] is .001 then the probability
is .95 (given the assumption of a normal distribution of error) that
the actual infant mortality for this person will be between .008 and
.011.

Isn't that one of the gambler's distorted-probability things that
psychologists study?

I don't study it. I'm terrible with probability. I recently
encountered the Monty Hall problem (look it up on wikipedia) and
flunked completely. I had to write a little program to prove to myself
that the answer to the problem really is correct; and it is!

If you're an individual, it matters which ones get paid off -- you
want to know the probability that a particular person will get paid off,
YOU.

I'm not a gambler. I'm a controller. Not a very good controller, true.
But an even worse gambler. Sure, I do certain things based on my
estimates of chances. But these are imaginations. Sometimes they help;
sometimes not. When things work out it seems to me that it's usually a
result of controlling perceptions, not imaginations.

That is because the interested parties are concerned only about how many
people die, not which ones. It is worth it, for the Surgeon General, to
recommend a treatment of everyone that saves an additional 5% of the
population. But the chances are 19 out of 20 that it will do an individual
no good. That makes income, the cost of the treatment, and its side-effects
important.

Yes. Those are part of the policy selection process.

> I agree. The non-statistical reason is simply that it fits my values;
> basic health care should be a tax supported public investment, like
> roads, police, fire departments, education and parks.

Even if that would mean reducing the quality of those services?

Of course not. If a private company could provide higher quality
roads, police , fire, education, healthcare and parks to the entire
population, then I'd go for it.

If we're going to bother using numbers and data, shouldn't we base our
perceptions on what they tell us, instead of "values"?

Not necessarily. I think the data can be used by anyone to support
their values. My exercise in economic analysis and my stint at RAND
convinces me that data plays very little role in policy decisions or
beliefs. Data is fun and it's fun to argue but to imagine that data ,
properly analyzed and used, will be decisive, is just dreaming.
Ultimately, it's human values that determine policy. The data, for
example, on gun control seems to me to be overwhelming. But people
whose values say that everyone should get to have a gun will find ways
to question the data or find data that seems to support their point.
And the game will go on. The same is true of healthcare. The data in
favor of a single payer system is overwhelming; countries with single
payer systems have better outcomes for less. This is the big result.
But people who don't value healthcare as a "right" can find all kinds
of ways to question the data. There is really no winning this game --
just as there is no way of winning the PCT game with conventional
psychologists. The evidence for PCT is overwhelming but that's not
going to change people who value doing psychology in the conventional
way. That doesn't mean that the data is useless; it just means that
it's a fool's errand to think it will be decisive.

If you already have selected the
only answer you will accept, what's this spreadsheet exercise for? Proving
that your values are right? What will you do if they don't do that?

The spreadsheet exercise was just for fun. It really wasn't even
pertinent to the question of single payer (notice that the variables
were per capita income and Gini). Other exercises (like my tour
through the economic data) were pertinent to specific economic
policies; if they had contradicted my preference for certain policies
I suppose I would have changed my ideas. For example, I think highly
progressive taxation is a good policy. The economic data suggest that
economic growth has been greatest in periods when taxes were high and
progressive. So that was a nice finding. If it had turned out that
growth was actually lower in high tax periods I might have reassessed
my feelings about the importance of growth. But I wouldn't be able to
argue any more for the merits of taxation as a growth promoter;-)

Best regards

Rick

···

--
Richard S. Marken PhD
Lecturer in Psychology
UCLA
rsmarken@gmail.com

[From Bryan Thalhammer (2007.07.17.1340 CDT)]

P.S.

NEWS: Political Hack and Secretary of the VA Nicholson (former real estate
salesman) has just resigned.

I rest my case, Kenny.

--Bry

I think you are talking about
the standard error of prediction

(s[y.x]). This is the average error in predicting Y values using

predicted Y (Y’) values given by the regression equation (Y’ =
aX+b).
[From Bill Powers (2007.07.18.0640 MDT)]

Rick Marken (2007.07.17.0920) –

How do you calculate that? My mathematics manual gives a method but it
applies only to predicting a mean value of Y from a single mean value of
X. Chi-squared looks more suited to testing predictions from a
relationship, but it’s horribly complicated to compute its distribution
and convert that to probabilities. I can’t figure out what chi-squared
means, either, since it involves dividing the square of X - E(X) by the
first power of E(X) (expected value of X). That means that if you scale
up the whole relationship, both X and E(X), chi-shared increases, too,
which makes no sense to me.

I once asked David Goldstein how psychologists predict behavior from
regression equations, and his reply was that they don’t use regression
equations that way. So maybe nobody has ever asked the question I’m
asking (and that would seem wierd, too).

My mind is tired from all this. I guess there’s no simple
answer.

Best,

Bill P.

···

[From Rick Marken (2007.07.18.1010)]

Bill Powers (2007.07.18.0640 MDT)--

Rick Marken (2007.07.17.0920) --

I think you are talking about the standard error of prediction

> (s[y.x]). This is the average error in predicting Y values using
> predicted Y (Y') values given by the regression equation (Y' = aX+b).

How do you calculate that?

It's pretty easy. Once you have the regression equation (from the
regression analysis) then you either just read the print out from the
analysis, which often reports the standard error, or you compute Y'
for each X in your data set and then compute s[y.x] = [Sum(Y-Y')^2]/N
where N is the number of X,Y data pairs.

Chi-squared looks more suited to testing predictions from a relationship,

It's not. It has nothing to do with it.

I once asked David Goldstein how psychologists predict behavior from
regression equations, and his reply was that they don't use regression
equations that way.

I'm sure David didn't say that (or didn't mean it). Psychologists use
regression equations all the time to predict behavior. For example,
regression equations are used regularly to predict college performance
of applicants for college. This is the "selection equation" you've
heard tell of -- and that is so controversial. But admissions offices
still use the output of a regression equation like:

College Performance = b1 * SAT + b2 * Letter of Rec + b3 * Essay ...+
bn * whatever

as at least a partial basis for decisions about college admission. And
the idea is that the equation is a _prediction_ of how the a student
with these scores (SAT, Letter of Rec, Essay, whatever) is likely to
perform in the future (if they were admitted to the college).

So maybe nobody has ever asked the question I'm asking
(and that would seem wierd, too).

My mind is tired from all this. I guess there's no simple answer.

I think the answer may be simpler than you imagine. You may not like
it but there is an answer, nonetheless.

Best

Rick

···

--
Richard S. Marken PhD
Lecturer in Psychology
UCLA
rsmarken@gmail.com

It’s pretty easy. Once you have
the regression equation (from the

regression analysis) then you either just read the print out from
the

analysis, which often reports the standard error, or you compute Y’

for each X in your data set and then compute s[y.x] =
[Sum(Y-Y’)^2]/N

where N is the number of X,Y data pairs.
[From Bill Powers (2007.7.18.1255 MDT)]

Rick Marken (2007.07.18.1010)

What printout from what analysis? I’m doing this by hand, pal, I mean I
have to write the program. Talk about “let 'em eat
cake.”

Anyway, that equation works only when I have a bunch of estimates of a
single value of X and a bunch of estimates of a single value of Y, and am
fitting a straight line to the scatter plot. If I have a theoretical
curve, Y = f(X), the sigma of X is zero, I can’t even calculate a
correlation with an experimental data set, because calculating a
correlation involves dividing by the sigma of X.

I am getting the impression that I’m the only one here who wants to
calculate the probability that a prediction of data from a theory will be
wrong by more then some percentage of the prediction. If it’s so easy,
why isn’t someone better at it than I am doing it? Or isn’t this
important to know?

Best,

Bill P.

[Martin Taylor 2007.07.18.16.29]

[From Bill Powers (2007.7.18.1255 MDT)]

Rick Marken (2007.07.18.1010)

It's pretty easy. Once you have the regression equation (from the
regression analysis) then you either just read the print out from the
analysis, which often reports the standard error, or you compute Y'
for each X in your data set and then compute s[y.x] = [Sum(Y-Y')^2]/N
where N is the number of X,Y data pairs.

What printout from what analysis? I'm doing this by hand, pal, I mean I have to write the program. Talk about "let 'em eat cake."

Anyway, that equation works only when I have a bunch of estimates of a single value of X and a bunch of estimates of a single value of Y, and am fitting a straight line to the scatter plot. If I have a theoretical curve, Y = f(X), the sigma of X is zero, I can't even calculate a correlation with an experimental data set, because calculating a correlation involves dividing by the sigma of X.

How could the sigma of X be zero if you are fitting a function? That's a contradiction in terms, isn't it?

I am getting the impression that I'm the only one here who wants to calculate the probability that a prediction of data from a theory will be wrong by more then some percentage of the prediction. If it's so easy, why isn't someone better at it than I am doing it? Or isn't this important to know?

Didn't you read my exposition [Martin Taylor 2007.07.17.11.08]? I tried to show what you would do both in the simple straight-line regression case and in the more general situation. If you did read it, what parts of the explanation need fixing up?

Why do you have to write the program? Is it as a learning exercise so that you know just what the program is doing? I find that's often a very good way to learn, but if that's why you are doing the program rather than trying to find one in the various public domain libraries, then maybe working from my tutorial might be a way to start.

Rick's answer with equations was perfectly fine for the linear Gaussian case. I tried to show why, so that you could generalize to whatever case interests you.

Martin

[From Rick Marken (2007.07.18.1435)]

Bill Powers (2007.7.18.1255 MDT)--

Rick Marken (2007.07.18.1010)

What printout from what analysis? I'm doing this by hand, pal, I mean I
have to write the program. Talk about "let 'em eat cake."

If you have a spreadsheet you can do the calculations I suggested. The
toughest calculations are those that give you the coefficients of the
best fitting (in the least squared deviations sense) regression line.
But the calculations are actually pretty easy for simple regression
(one predictor).

I was assuming, however, that you have the regression equation and
just wanted to measure the likely accuracy with which that equation
predicts individual scores. That is, give a person's X score,what is
the accuracy of the predicted Y' score for that person using the
regression equation? That accuracy is usually expressed in terms of a
confidence interval around the predicted score, Y'. The assumption is
that the actual Y score for individuals who get a particular X score
is normally distributed around the predicted Y' score. That is, it is
assumed that the true Y score for each individual who gets a
particular X score is normally distributed around the predicted score
and the standard deviation of this normal distribution of Y scores is
s[y.x]. Since 95% of normally distributed scores fall between 1.96 and
-1.96 standard deviations of a normal distribution, the a 95%
confidence interval around the predicted Y' for a particular person --
an interval that has a 95% chance of enclosing the individual's true Y
score -- is Y'+or-1.96*s[x.y]. That's how statisticians state the
accuracy of prediction of an individual person's true Y score given
their X score. This approach is based on the _assumption_ that true Y
scores are normally distributed around a predicted Y' for a particular
X score and that the standard deviation of the normal distribution
(s[y.x]) is the same for a Y' predicted from any X.

Anyway, that equation works only when I have a bunch of estimates of a
single value of X and a bunch of estimates of a single value of Y,

No, the Y' s are for different values of X -- the actual X values
observed -- and the Y values are the Y values that were observed to
have been associated with the appropriate X. I'm attaching a
spreadsheet that includes the calculation of s[y.x] for the log income
(X) vs infant mortality (Y) data.

I am getting the impression that I'm the only one here who wants to
calculate the probability that a prediction of data from a theory will be
wrong by more then some percentage of the prediction. If it's so easy, why
isn't someone better at it than I am doing it? Or isn't this important to
know?

I sorry, I'm trying bu I guess I don't understand your problem. Maybe
you can give me a more tangible example of what you are looking for.

In the spreadsheet I sent, I show a calculation of s[y.x] as well as a
calculation of s[y] (the standard deviation of Y scores). s[y] is
32.3 and s[y.x] is 18.16. What this means is that if for every X score
(each countries log per capita income) you guess the average Y score
of the set of countries, on average you would miss the county's actual
infant mortality by 32 infant mortality units. If you use the Y' score
instead, based on the regression equation, then on average you would
miss the country's actual infant mortality by only 18 units. So the
regression equation has increased the accuracy of your predictions of
an individual country's behavior. Isn't this close to what you're
after?

Best

Rick

SXY.xls (31.5 KB)

···

---
Richard S. Marken PhD
Lecturer in Psychology
UCLA
rsmarken@gmail.com
Content-Type: application/vnd.ms-excel; name=SXY.xls
X-Attachment-Id: f_f4abwz62

[From Mike Acree (2007.07.18.1457 PDT)]

Rick Marken (2007.07.18.1435)--

I was assuming, however, that you have the regression equation and
just wanted to measure the likely accuracy with which that equation
predicts individual scores. That is, give a person's X score,what is
the accuracy of the predicted Y' score for that person using the
regression equation? That accuracy is usually expressed in terms of a
confidence interval around the predicted score, Y'. The assumption is
that the actual Y score for individuals who get a particular X score
is normally distributed around the predicted Y' score. That is, it is
assumed that the true Y score for each individual who gets a
particular X score is normally distributed around the predicted score
and the standard deviation of this normal distribution of Y scores is
s[y.x]. Since 95% of normally distributed scores fall between 1.96 and
-1.96 standard deviations of a normal distribution, the a 95%
confidence interval around the predicted Y' for a particular person --
an interval that has a 95% chance of enclosing the individual's true Y
score -- is Y'+or-1.96*s[x.y]. That's how statisticians state the
accuracy of prediction of an individual person's true Y score given
their X score.

Possibly more relevant in this context is the confidence interval for
the true score for an individual, rather than for the predicted score,
assuming a linear relationship between x and y. That interval is based
on s[y.x]*S[j], where

S[j] = sqrt(1 + (1/N) + (((x[j] - M)**2)/(N*s**2))),

where x[j] is the x score of the individual, M is the mean of the x
scores, and s**2 is the variance of the x scores. This interval is
typically much wider than for predicted y.

But I don't claim that this answers Bill's question, either.

Mike

I was assuming, however, that
you have the regression equation and

just wanted to measure the likely accuracy with which that equation

predicts individual scores. That is, give a person’s X score,what is

the accuracy of the predicted Y’ score for that person using the

regression equation? That accuracy is usually expressed in terms of
a

confidence interval around the predicted score, Y’
In the spreadsheet I sent, I
show a calculation of s[y.x] as well as a

calculation of s[y] (the standard deviation of Y scores). s[y]
is

32.3 and s[y.x] is 18.16. What this means is that if for every X
score

(each countries log per capita income) you guess the average Y score

of the set of countries, on average you would miss the county’s
actual

infant mortality by 32 infant mortality units. If you use the Y’
score

instead, based on the regression equation, then on average you would

miss the country’s actual infant mortality by only 18
units.
[From Bill Powers (2007.07.18.1850 MDT)]

Rick Marken (2007.07.18.1435)

Yes, but that interval should change with the size of Y. Suppose you have
a regression line Y = aX + b. If X has some noise in it, or a does, then
the larger Y is the larger the spread in the scatter plot should be (sort
of the reverse of what your graphics showed in the linear spreadsheet).
Like this:

306757a7.jpg

The center line is the regression line, and the two flanking lines show
the same position in the spread relative to the mean at that level, like
the 5% probability point.

But doesn’t that mean missing by the same number of units for countries
with low mortalities and high mortalities? For example, Norway’s actual
infant mortality is 3.64, while the value of Y’ for that entry is -9.7:
an error of -13.34 (within the 18 units you mention) – but a
proportional error of 366%. Not to mention a nonsense number – Norway’s
children do not become more numerous as they get older.

I think we need a column for 100*(Y - Y’)/Y’ – the percentage deviation
of the actual mortality from the predicted mortality (predicted by the
regression equation). By sorting on this column you can easily show the
range of accuracies of prediction. That is the sort of thing I’m looking
for. Can you add that?

Best,

Bill P.

[From Rick Marken (2007.07.18.1240)]

Bill Powers (2007.07.18.1850 MDT)

Yes, but that interval should change with the size of Y.

I agree.

I think we need a column for 100*(Y - Y’)/Y’ – the percentage deviation
of the actual mortality from the predicted mortality (predicted by the
regression equation). By sorting on this column you can easily show the
range of accuracies of prediction. That is the sort of thing I’m looking
for. Can you add that?

Yes, I computed the percentage deviation (as above) and correlated it with infant mortality. The correlation is -.13 meaning that the larger the infant mortality, the smaller the % deviation of Y’ from actual infant mortality, Y (as you had predicted and as I would expect as well). But the correlation is surprisingly small.

Best

Rick

SXY2.xls (34.5 KB)

···


Richard S. Marken PhD
Lecturer in Psychology
UCLA
rsmarken@gmail.com

[From Bill Powers (2007.07.19.1415 MDT)]

Rick Marken (2007.07.18.1240) –

I think we need a column for 100*(Y - Y’)/Y’ – the percentage
deviation of the actual mortality from the predicted mortality (predicted
by the regression equation). By sorting on this column you can easily
show the range of accuracies of prediction. That is the sort of thing I’m
looking for. Can you add that?

Yes, I computed the percentage deviation (as above) and correlated it
with infant mortality. The correlation is -.13 meaning that the larger
the infant mortality, the smaller the % deviation of Y’ from actual
infant mortality, Y (as you had predicted and as I would expect as well).
But the correlation is surprisingly small.

I think I said that backward. It should be the deviation of the predicted
from the actual, divided by the actual – divide by B2 instead of
D2. In other words, how far off is the prediction as a percentage
of the actual value?

A correlation of -0.13 is “no relationship” in my
world.

Best,

Bill P.

···

Best

Rick

Richard S. Marken PhD

Lecturer in Psychology

UCLA

rsmarken@gmail.com

X-Attachment-Id: f_f4bn86rm

Content-Type: application/vnd.ms-excel; name=SXY2.xls;
x-avg-checked=avg-ok-5A6F3D98

Content-Disposition: attachment; filename=“SXY2.xls”

Internal Virus Database is out-of-date.

Checked by AVG Free Edition.

Version: 7.5.476 / Virus Database: 269.10.2/894 - Release Date: 7/10/2007
5:44 PM