Statistics again

[From Rick Marken (2006.11.05.1800)]

···

On Sunday, November 5, 2006, at 12:10 PM, Bill Powers wrote:

Hi, Richard -- (copies to Goldstein and CSGnet)

I feel a great reluctance to use the kinds of tests that psychologists have devised over the years, tests that David Goldstein feels might be useful in evaluating outcomes of therapy. I don't want to be dogmatic about this, so I'm looking (still) for some consensual way to evaluate this approach....

Hi Bill,

Who is Richard and what was the question?

Thanks

Rick
----

Richard S. Marken Consulting
marken@mindreadings.com
Home 310 474-0313
Cell 310 729-1400

Hi, Rick --

[From Rick Marken (2006.11.05.1800)]

Hi, Richard -- (copies to Goldstein and CSGnet)

I feel a great reluctance to use the kinds of tests that psychologists have devised over the years, tests that David Goldstein feels might be useful in evaluating outcomes of therapy. I don't want to be dogmatic about this, so I'm looking (still) for some consensual way to evaluate this approach....

Hi Bill,

Who is Richard and what was the question?

I hope more came through than that one paragraph. Richard is Kennaway, and the post was asking about calculating the chances that one test can predict the characteristics revealed by a second test, given the correlation between the two tests.

Bill

···

On Sunday, November 5, 2006, at 12:10 PM, Bill Powers wrote:

Dear Bill and listmates,
I think I found a website that contains articles and programs related to
the questions that you are asking.
Take a look at:
http://www.abdn.ac.uk/~psy086/dept/SingleCaseMethodology.htm
The person who wrote these programs and articles references Payne &
Jones (1957) which is an article that I have which addresses the same
issues. They are updates to the Payne and Jones article and they provide
computer programs.
David
David M. Goldstein, Ph.D.

···

-----Original Message-----
From: Control Systems Group Network (CSGnet)
[mailto:CSGNET@LISTSERV.UIUC.EDU] On Behalf Of Bill Powers
Sent: Sunday, November 05, 2006 9:20 PM
To: CSGNET@LISTSERV.UIUC.EDU
Subject: Re: Statistics again

Hi, Rick --

[From Rick Marken (2006.11.05.1800)]

On Sunday, November 5, 2006, at 12:10 PM, Bill Powers wrote:

Hi, Richard -- (copies to Goldstein and CSGnet)

I feel a great reluctance to use the kinds of tests that
psychologists have devised over the years, tests that David
Goldstein feels might be useful in evaluating outcomes of therapy.
I don't want to be dogmatic about this, so I'm looking (still) for
some consensual way to evaluate this approach....

Hi Bill,

Who is Richard and what was the question?

I hope more came through than that one paragraph. Richard is
Kennaway, and the post was asking about calculating the chances that
one test can predict the characteristics revealed by a second test,
given the correlation between the two tests.

Bill

--
No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.409 / Virus Database: 268.13.28/518 - Release Date:
11/4/2006

--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.409 / Virus Database: 268.13.28/518 - Release Date:
11/4/2006

[From Bill Powers (2006.11.07.1332 MST)]

Richard Kennaway (20061107.1803 GMT) --

Thanks for including that pdf file of your paper -- I now have it
where I can find it.

Please remind us of the history and fate of this paper. Was it, in
fact, ever published in the "normal" literature? If not, was there
any good reason why not?

In discussions with David Goldstein on these matters, the
conventional uses of statistics have been brought up. Just
considering those concepts, it seems as if one can use tests with
correlations on the order of 0.6 and get at least some ability to
predict. But that conclusion clashes, it seems to me, with the
conclusions of your paper, and not just a little. The difference is
drastic. What's going on here?

I hate to saddle you with doing all the work here, but the fact is
that I am far from capable of coming to reliable conclusions about
these matters. I understand a small part of your paper, and
intuitively grasp most of it, but the rigor is beyond me, and it is
rigor we need here. First we need for your paper to be accepted in
the mathematical community, to remove any arguments about its basic
correctness. Then we need to figure out why those who use
low-correlation statistics still feel that the result of doing so is
worthwhile -- either that, or demonstrate beyond the possibility of
falsification that it never has been worthwhile.

I think you have a book here, one of the more important books one
could imagine being written. It will be, of course, violently
controversial, because if your analysis holds up (I'm confident it
will, but we have to test this) the repercussions will be enormous.
Anything you can do to soften the blow will be to everyone's
advantage, but if there is nothing that can be done, the net gain
will be even larger, though resisted that much more intensely.

I attach a paper David sent me, one that seems to be recognized as a
seminal paper on the uses of statistics in psychology. I'd be
surprised if it's new to you, but the point is that it contains the
manipulations that give me pause and keep me from being sure of my
position. Is the analysis in Payne and Jones wrong, incomplete, or
what? Have they failed to ask the right questions? Or has your
analysis gone (as no doubt some will claim) off on a tangent and
missed simple truths?

I guess the point of all this is that I'm trying not to be a
curmudgeon and reject the statistical approach just because I don't
like it. If I'm wrong I want to know it.

On the other hand, I also want your contributions for the new book,
so if this is all too much of a load, we can put it off for now. But
we really should come back to it when there's time.

Best,

Bill P.

Payne & Jones 1957.pdf (542 KB)

[From Richard Kennaway (20061108.1244 GMT)]

[From Bill Powers (2006.11.07.1332 MST)]
Please remind us of the history and fate of this paper. Was it, in fact, ever published in the "normal" literature? If not, was there any good reason why not?

I never had it published. I tried "Science", but they said (with justification, in hindsight), that it wasn't the sort of paper for them, and I never got round to sending it anywhere else. Statistical methodology isn't my area -- where would be the best place to try? Is there anyone here with a respectable professional knowledge of the subject who might comment on the paper for me?

In discussions with David Goldstein on these matters, the conventional uses of statistics have been brought up. Just considering those concepts, it seems as if one can use tests with correlations on the order of 0.6 and get at least some ability to predict. But that conclusion clashes, it seems to me, with the conclusions of your paper, and not just a little. The difference is drastic. What's going on here?

You get some ability, just not a lot if you need to be right in an individual case. If you're interested primarily in success rate over a large number of individuals, then correlations of 0.6 will give you somthing. But you won't know which individuals you made the right prediction for.

I hate to saddle you with doing all the work here, but the fact is that I am far from capable of coming to reliable conclusions about these matters. I understand a small part of your paper, and intuitively grasp most of it, but the rigor is beyond me, and it is rigor we need here. First we need for your paper to be accepted in the mathematical community, to remove any arguments about its basic correctness. Then we need to figure out why those who use low-correlation statistics still feel that the result of doing so is worthwhile -- either that, or demonstrate beyond the possibility of falsification that it never has been worthwhile.

I think you have a book here, one of the more important books one could imagine being written. It will be, of course, violently controversial, because if your analysis holds up (I'm confident it will, but we have to test this) the repercussions will be enormous. Anything you can do to soften the blow will be to everyone's advantage, but if there is nothing that can be done, the net gain will be even larger, though resisted that much more intensely.

There's Phil Runkel's "Casting Nets", of course. It's a while since I read it, so I don't recall what details it goes into.

I attach a paper David sent me, one that seems to be recognized as a seminal paper on the uses of statistics in psychology. I'd be surprised if it's new to you, but the point is that it contains the manipulations that give me pause and keep me from being sure of my position. Is the analysis in Payne and Jones wrong, incomplete, or what? Have they failed to ask the right questions? Or has your analysis gone (as no doubt some will claim) off on a tangent and missed simple truths?

I haven't seen the paper before. I shall think about it.

-- Richard

···

--
Richard Kennaway, jrk@cmp.uea.ac.uk, Richard Kennaway
School of Computing Sciences,
University of East Anglia, Norwich NR4 7TJ, U.K.

[From Rick Marken (2006.11.09.1020)]

Bill Powers (2006.11.09.09035 MST)--

>Rick Marken (2006.11.08.1020)

>Research aimed at studying groups is what I call policy research. There
>is nothing wrong with this kind of research.

I vote for you here, Rick.

Thanks!

I think that the use of statistical facts in psychotherapy is a prime example of the misuse of population statistics for evaluating individuals.

And I vote for you here!!

I completely agree. Therapy is about individuals. Using statistical facts as a basis for psychotherapy is stereotyping at its worst. If the individual approach of PCT is appropriate anywhere it is certainly in the area of psychotherapy (and personal improvement).

Best

Rick

···

---
Richard S. Marken Consulting
marken@mindreadings.com
Home 310 474-0313
Cell 310 729-1400

[Martin Taylor 2006.11.10.10.38]

[From Rick Marken (2006.11.08.2100)]

Bjorn Simonsen (2006.11.08,24:00 EUST)-

What is a group?

A collection of individuals.

Wht is a control system?

A collection of input-output links.

Martin

P.S. Add to both descriptions the word "interconnected" or "mutually influnecing" or something like that. And then think what such an amendment implies.

[From Rick Marken (2006.11.10.1050)]

Martin Taylor (2006.11.10.10.38)

[From Rick Marken (2006.11.08.2100)]

Bjorn Simonsen (2006.11.08,24:00 EUST)-

What is a group?

A collection of individuals.

Wht is a control system?

A collection of input-output links.

Martin

P.S. Add to both descriptions the word "interconnected" or "mutually influnecing" or something like that. And then think what such an amendment implies.

It's the negative feedback closed loop nature of the interconnectedness of variables in a control system that gives it it's special characteristics. I suppose the interconnectedness that exists between individuals (who are themselves control systems) in a group can give such a group special characteristics. But for most practical purposes -- policy purposes -- all we need to know are the simplest things about groups: descriptive characteristics (mean, median, standard deviation of some relevant measure) and measures of the relationship between variables. All the insurance company needs to know, for example, to determine rates for different age groups is how accident rates vary as a function of age. As Bill mentioned, this results in some unfairness at the individual level (so a highly skilled octogenarian driver has to pay rates as high as his incompetent cohorts;-)) but that's the price on pays for the communal benefit of the insurance.

Best

Rick

···

---

Richard S. Marken Consulting
marken@mindreadings.com
Home 310 474-0313
Cell 310 729-1400

[From Richard Kennaway (2006.11.11.1758 GMT)]

[From Rick Marken (2006.11.08.1020)]
Research aimed at studying groups is what I call policy research. There is nothing wrong with this kind of research. If you do a nice, controlled experiment and find, for example, that kids who view aggressive models on TV are more likely to be aggressive than those who don't, then you know a fact about a statistical relationship at the group level. If you want to reduce aggression in kids, then a good policy would be to reduce kids' ability to see aggressive models on TV. And if the research was done appropriately, then this policy will result in reduced aggression _at the group level_.

That depends very much on what the experiment was. If it merely measured a correlation between exposure to aggressive TV and aggressive behaviour, it would yield no conclusions at all about the effect of reducing exposure to aggressive TV. To demonstrate a causal connection between X and Y, you have to act on the value of X and see if Y also changes. Measuring a large number of (X,Y) pairs "found in nature" yields no information about causality.

For example, I can think up an alternative hypothesis for why there might be a correlation between aggressive behaviour and watching aggressive TV that would predictthe invervention to have the opposite effect. Suppose that each person is controlling for a certain amount of aggression (or "excitement", to use a differently loaded word) in their life, getting it from whatever sources are available. Then someone wanting a high amount will be likely to both watch larger amounts of aggressive TV and do larger amounts of aggressive things, creating the observed correlation. Reducing their access to aggressive TV is a disturbance which they will counteract by increasing their consumption of other forms of aggression, such as aggressive behavour.

Now, I just invented that hypothesis out of thin air, but it is consistent with the hypothetical correlation, yet makes a prediction opposite to the hypothesis that the correlation is a causal link.

From an earlier paragraph:

Research like this is perfectly good for finding out about groups; it is completely inappropriate for learning about the organization of the behavior of individuals, something Bill Powers showed very nicely in his paper in the 1990 Control Theory issue of the _American Behavioral Scientist_ that I edited.

If you want to cause a group of people to change their behaviour in some way, you have to deal with the ways the individuals are organised. You need not be concerned with the organisation of each specific individual, but you do need to be concerned with how they generally tend to be organised, in order to make predictions about the effects of interventions.

Or as someone once put it, just watching what they are doing does not tell you what they are doing.

-- Richard

···

--
Richard Kennaway, jrk@cmp.uea.ac.uk, Richard Kennaway
School of Computing Sciences,
University of East Anglia, Norwich NR4 7TJ, U.K.

[Martin Taylor 2006.11.11.13.42]

[From Richard Kennaway (2006.11.11.1758 GMT)]

[From Rick Marken (2006.11.08.1020)]
Research aimed at studying groups is what I call policy research. There is nothing wrong with this kind of research. If you do a nice, controlled experiment and find, for example, that kids who view aggressive models on TV are more likely to be aggressive than those who don't, then you know a fact about a statistical relationship at the group level. If you want to reduce aggression in kids, then a good policy would be to reduce kids' ability to see aggressive models on TV. And if the research was done appropriately, then this policy will result in reduced aggression _at the group level_.

That depends very much on what the experiment was. If it merely measured a correlation between exposure to aggressive TV and aggressive behaviour, it would yield no conclusions at all about the effect of reducing exposure to aggressive TV. To demonstrate a causal connection between X and Y, you have to act on the value of X and see if Y also changes. Measuring a large number of (X,Y) pairs "found in nature" yields no information about causality.

Right. I would guess that Rick was taught that in Psych 101. Nevertheless, it is often true in everyday life that correlation exists when there really is a causal connection. For all cases of causal connection there will be correlation, but correlation can also exist when there is no direct causal correlation. Quite often the latter signifies that the two variables each have a common influence underlying their variation (example: the density of leaves on trees and the thermometer readings in temperate climates).

That correlation signifies causality is sufficiently often true that a good survival technique is to assume it's true in the absence of evidence to the contrary. If you hear a loud crack when standing under a tree, it's a good idea to jump away, even though the crack might have nothing to do with the tree.

The issue is the same is that of inferring individual characteristics from the characteristics of group data -- the balance of probabilities favours a case that is often not actually true. Bill P. once showed a graphical demonstration of why you really can't rely on the relation between individual and group correlations (quite apart from the statistical problem of inferences from low correlations. It goes like this:

                     x .
                 x x.
                  x . x
           x .X x
            x . x
            .x x
         . x
               x
_________________________________

The lines of "x"s represent the behaviours of different individuals with changes in some variable. If you look at the group data (the line of dots), there is a positive (+1.0) correlation between the two variables, but if you look at the data from any individual, the correlation is negative (-1.0). Such a condition can really happen, but it's rare. On balance, unless there is evidence to the contrary, it's a reasonable bet that any individual is more likely to behave as the group does that that there's no relation between group and individual behaviour.

Bayesian analysis doesn't really work with "proof". It goes with the balance of probabilities, and one is ALWAYS well advised to deal with the more likely situation, while recognizing the possibility that it isn't the real situation. What you do about it depends on the perceived costs of being right or wrong, in relation to the relative likelihoods of the different hypotheses.

So, even if correlations are low, and account for only a small part of the variance of something, nevertheless it is more likely than not that the two variables are related, and more likely than not that any particular individual will show the same relation between the variables. That "more likely than not" may represent only a 50.5 - 49.5 ratio, but, all else being equal, it's still the better bet. (We base decisions between radically different national policies on lesser ratios, which is probably a very bad idea). Just don't bet your house on the more likely alternative if the other side is offering a doughnut.

When it comes to policies for actions that affect groups composed of individuals, the value of the bets depends entirely on who is perceiving the situation, the individual or the -- for lack of a better example -- insurance company.

Martin

Everybody can recognize a black ball when he sees it.

  This is Phil Runkel replying to Powers, Marken, Simonson, Taylor, Kennaway, Goldstein, et alii.

  In the 1950s and 60s, the psychological and statistical literatures in the US and several other countries were replete with articles purporting to repair faults or lacunae in the statistical treatment of research data. I refer to journals such as the Psychological Bulletin, the Journal of the American Statistical Association, the Journal of the American Sociological Association, and Educational and Psychological Measurement. I contributed one such article myself (with T.M. Newcomb and J.E.K. Smith in 1957 in the Psychological Bulletin): Estimating interaction effects among overlapping pairs. (Isn�t that a sexy title?)

  I gave up scanning those literatures 30 or 40 years ago, but now, thanks to David Goldstein, I find that the faults and lacunae are still being patched up. David kindly put within reach of my fingertips the articles by Crawford and others and by Payne and Jones.

  The overall tenor of what I will say here is what I said in Casting Nets, namely: No matter how carefully you sharpen the teeth of a saw, it will remain a poor instrument for pounding nails.

  The model underlying classical statistical theory is that of white and black balls in an urn. An observation (taking a datum) corresponds to pulling out a ball from the urn. The assumption (rarely mentioned) is that everyone can tell whether the ball is white or black. In this simplest binomial distribution, the ball cannot be green or gray. In a wider distribution of scores, everything gets more complicated, but the assumption is similar: Everyone can tell whether the score is 21, 62, or some other.
  
  In classical statistical theory, no one asks for what the white ball stands. Also within test theory (such as academic tests of knowledge), no one asks for what the score stands; theory focuses on �true� scores, reliability, attenuation in correlations, and similar matters. But of course we go into all those statistics manipulations of the numbers because we care about what they stand for.

  Kennaway�s contribution of 7 November gives a nicely concise overview of test theory, which rests on the postulate that an obtained test score is the sum of a real (or �true) score and some error. In test theory (and in most treatments of statistics), the idea is to �account� for �variance.� That is, you list all the X-sub-i that vary with Y. And the last term in the equation is always the variations from unknown factors, or �error.� In PCT, in contrast, we look for what is constant despite variations in the environmental part of the loop.

  With the attitude of test theory, if you go on to assume that error scores are distributed (over people or over administrations of the test) in certain ways (such as the normal curve) then you can say a lot of things about test scores (not about behavior or control of perception or feelings of depression, etc., but about scores) as long as the assumptions about distributions hold up. It is often very difficult to ascertain whether the assumptions about distributions hold up, so they are usually left to be assumptions.

  In traditional psychology, the matter of �standing for� is called validity. Does the score stand for a certain amount (variance) of what the cover page of the test claims it stands for? Is there even such a thing for the score to stand for? This turns out to be a very complicated question. You will find, for example, about twelve kinds of validity described by D. Brinberg and J.E. McGrath (Validity and the Research Process. Sage, 1988).

  (Much of the complication comes from the assumptions made. Just as Lobachevsky and Reimann made new and startling geometries by giving up a postulate of Euclid�s, so Clyde Coombs made a new and refreshing theory of data by giving up some of the assumptions of test theory. In my 1972 book with McGrath, I gave a list in the appendix of some 15 assumptions underlying psychological data. Coombs�s book (1964) was "A Theory of Data.")

  But all that thinking and all those thoughts about dealing with data have little to do with what the data might stand for. The best test theory can do with that question is to say that if a score is NOT reliable, then it cannot stand for very much.

  In his contribution of 5 Nov at 12:10, which started this topic, Powers expressed his reluctance to use psychologists� tests for evaluating outcomes of therapy.

  Powers asked whether, if a person is administered a test labeled �depression,� it would tell whether the person actually �has� that affliction. That is, what is the validity of tests of depression? Powers also asked whether depression is something that actually exists. If it does not exist, we would like every test to tell us that the person does not suffer from it.

  Neither Powers nor I is saying that it is never helpful to say that someone �is depressed� -- to point with that word to a sort of familiar behavior. Here, we are questioning a putative condition that might be ascertained as readily as with blood pressure (low or high) or a ball from an urn (black or white), a condition that could be ascertained tomorrow or with another person as readily as today with this person. Furthermore, it would not be merely a �sign� of something going on, such as a blush when a person is embarrassed, but it would be a �real thing� that could have causal effects, such as high blood pressure.

  The likelihood of the �existence� of a condition such as depression or schizophrenia or this or that personality disorder (that is, the validity of the test) is in general very low to vanishing. The validity of a test can be no better than its reliability. That is, if you assume that the person �has� the condition every day of the week, but you discover that you cannot count on a test to give the same score from one time to another no matter who is administering or interpreting the test, then you cannot have confidence that the person �has� the claimed condition. In People As Living Things, pp. 357-363, under �Diagnosis,� I wrote:

  Despite the ambiguity of the kappa statistic, the principal author of the Diagnostic and Statistical Manual-III, along with two co-authors, wrote ... that a kappa of 0.7 or above indicated good agreement among diagnosticians on whether the patient has a disorder.

  Concerning the large class of categories labeled �Major Mental Disorders,� [Kirk and Kutchins] said that �not a single major diagnostic category achieved the .70 standard.�

  In those pages, I give several further evidences of the unreliability of psychiatric measures. Also, in pages 298-302, I give some indications of the reliability and validity of personality tests, many of which are similar to some tests used in psychiatric diagnoses. In the studies mentioned there, validity coefficients ranged from .17 to .83, with most falling between .20 and .50. Those statistics are not the end of the complications; I mentioned earlier, for example, McGrath�s 12 sorts of validities. I will sum up here with the quotation from Theodore Millon of the Harvard Medical School that I gave on page 367: �Certainly the disorders of personality should not be construed as palpable �diseases.��

  But suppose, as Powers asks us to do, that the assumed condition does actually exist. Can a test reveal it? Maybe, partially, but the matter is chancy. On pp. 282-284, 293-295, 301-302, 368, and 457-462 of People As, I describe the wild chanciness of testing generally. One can interpret the question to be asking how we can ascertain the percentage of people who will be correctly diagnosed, or to be asking whether we can correctly diagnose this one person in the therapist�s office. I do not think the answers are the same.

  An example of the subtleties in answering the question about percentages appears in calculating the accuracy of a test.
You must know the rate of appearance of the condition in the population. Suppose a company puts forward a test and claims an accuracy for it of 99 percent. But suppose that the condition appears in only one of every 1000 persons in the population. If the test makes only one mistake (calling �positive�) with every 100 people, it will be right only one time in ten when it calls �positive.� The test�s rate of success in detecting positive cases will be only ten percent, not 99 percent.

  In the case of diagnosing a single individual, Powers wants to know whether he or she �actually has it.� Well, in the case of calculating rates, we are back to white balls and black balls. But in the case of diagnosing a single individual, we are back to wondering what a black balls stands for. When a physician uses a test that is right in only 50, or 70, or 90 cases in 100, the physician typically goes on looking for more evidence as the treatment continues. One test is not the end of the story. And the physician hopes eventually to see evidence that is as clear as a black ball: bleeding or not, certain germs or not, an invasive growth or not. Investigations of most (all?) psychological conditions, however, cannot arrive at such palpability. The DSM-IV surely represents the psychiatric profession�s best attempt at listing evidence for mental conditions, and the palpability there is laughable.

  We are back to validity. The kind of validity most psychologists prefer is �predictive validity.� The reasoning goes this way: �We cannot actually see or touch X or Y, but if X is the kind of thing we think it is, it ought to predict Y.� So we choose or invent what we think are defensible tests (�operational definitions�) of X and Y, collect data, and calculate the correlation. If the correlation is high, we are tempted to conclude that the condition X actually exists.

  I think we are justified in concluding only that the TESTS actually exist. A great deal of psychological research consists of calculating correlations between presumed characteristics of two things, one or both of which remain imaginary. I have done it myself.

  If my memory is right, L.L. Thurstone was the chief inventor of factor analysis, and I think he said, about 1930, �If something exists, it exists in some degree, and can be measured.� A lot of psychologists concluded that if they have measured something, it must exist. Tsk tsk.
  
  I cannot think of any way that any single individual can justifiably be claimed to �have� any psychological condition whatsoever. You can certainly say a useful thing such as �She is grieving� or �He hates spinach,� but you are not going to do better than that my inventing a test of grieving. By the time you get the test made and assessed for reliability and all that, she will have got over her sorrow. And to say �He is the kind of person who hates spinach� simply leads you into a wild goose chase and all the pitfalls of stereotypy. (A personality test provides you with �kinds of persons.�)

  The presumed psychological condition is internal, forever impalpable to outside observers. You can observe that the person has said mean things to his father, or has often refused to speak to his wife, or has chosen certain answers to questions on a test. And you can calculate correlations among those actions. But when you predict that the person will then perform certain further actions, such as beating his wife or going back to jail or watching TV all day long, you will be correct about some persons more often than you would be by flipping a coin, but with other persons you will not. Actual correlations obtained over individuals by researchers, as I mentioned above, are low. And you will certainly have no evidence that some one continuing, reliable, internal condition is causing the actions. Particular actions, as I have said ad nauseam, are shaped both by internal reference signals and by outside opportunities; both those sources of action change.

  Any presumed assessment of these internal something-or-others will sometimes show correlations with other something-or-others, simply because humans do not behave randomly, and we are fairly good at seeing (or inventing) patterns in things. And we often hunt for patterns when they are not necessary, simply because we like to explain things, and we feel that a pattern (such as a correlation) suffices.

  I think we ought to give up inventing internal devils. But I do not think we ought to give up being helpful to one another. And I think the Method of Levels provides a wonderful way of being helpful without having to do diagnosis. That is, MOL does not assume that the therapist must know what is wrong with the client. MOL enables the therapist to throw away the DSM.

  So the best answer to Powers�s first question, whether a test can reveal a psychological condition of an individual, whether �real� or imaginary, is: No. Not at present, anyway.

  Powers also asked about assessing change. Since I have said that you cannot be confident that a test measures any psychological condition, I must also say that you cannot be confident of calculating a change in it. Even in calculating changes in test scores (ignoring the question of whether the scores stand for any real thing), the matter is very slippery. After describing some of the statistical investigations, I said on page 71 of �Casting Nets and Testing Specimens: �... if you want an estimate of change that is not biased toward too much or too little, then you must give up any hope of precision and flood your data with at least 50 percent error!�

  That warning applies to averages using tests (measures) of low reliability. But calculating a change of a single score that is assumed to come from a distribution of scores runs into similar difficulties.

  Looking beyond the score to what it might stand for brings further troubles. Many diagnostic tests are scored by comparing the client�s score with the scores of a sample of run-of-the-mill or even �normal� persons. Crawford and others (in a paper David Goldstein made available on the internet) described a �statistically significant deficit.� This consisted of calculating the number of times in 100 testings of individuals that a score this much lower than the mean (or some other arbitrary point) of a distribution would occur by chance. That tells how to score the test, but it says nothing about what the test measures, if anything.

  And note that when a score denotes a position among a collection of other people, the value of the score depends not just on the individual, but also on all those other people. A Moslem in Saudi Arabia is an ordinary person, but a Moslem in most parts of the USA exhibits �an enduring pattern of inner experience and behavior that deviates markedly from the expectations of the individual�s [surrounding] culture,� which the DSM-IV says on page 633 is one of the �general diagnostic criteria for a Personality Disorder.�

  David Goldstein put two papers onto the CSGnet: (1) Payne and Jones and (2) Crawford and others. I have no quarrel with those papers, nor with others that those others cite. The articles are about the �behavior� of test scores, not about the behavior of people.

  In �People As Living Things,� I wrote about personality tests in Chapter 26, about diagnosis in Chapter 31, and mental testing in schools in Chapter 38. I will not repeat all that here, but only this small bit from page 367:

  The question recurs in the literature year after year whether some widely used conceptions of undesirable behavior �exist� outside the heads of the diagnosticians. Theodore Millon (1990) of the Harvard Medical School wrote: �Certainly the disorders of personality should not be construed as palpable �diseases� .... Unfortunately, most are ... receding all too slowly into the dustbins of history.�

  It seems to me, actually, that the concept of validity is not useful in assessing the behavior or condition of an individual. The more relevant concept is simply accuracy or correctness.

  Assessing the presence of cancer in an individual by the use of the CAT-scan, for example, is a matter of examining the resulting pictures to see what they show. The physician, at that point, has no interest in the percentages found by physicians in previous examinations of this or other patients, and no interest in how often the physicians were right or wrong. All the physician needs to know is that (1) there really can be growths in people that can be detected by the CAT-scan when the right chemical has been injected into the blood and (2) the machine was working properly when these pictures were made. In some cases, the growth is clear, large, and obvious, and its continuing existence can be corroborated by another scan before or afterward. In other cases, the pattern in the picture is to some degree ambiguous. But the physician has no need to put a score on the ambiguity. She can choose to start treatment or to wait for a later scan. Reliability and validity in the usual psychological senses do not come into the matter. I know there are subtleties that my example here passes by, but ....

  The physician or the psychotherapist can use the experience of other therapists with other clients as possibilities to think about with the next client, but when the next client appears, the conditions and experiences of those other clients do not tell the therapist anything about THIS client.

  In analogy to the use of the CAT-scan, the psychotherapist wanting to be helpful to a client need have no interest in the percentages of symptoms found by other psychotherapists with other clients, and no interest in whether those other therapists were right or wrong. As Powers said in his contribution of 9 November,

  The fact that other people have had problems that look superficially similar to your client�s problem does not improve your knowledge of your client.... And it doesn�t help the client, either, to know what the average characteristics of people who resemble him or her are.

  But now I bethink myself that the therapist is sometimes also the researcher (as is the case with David G.). In scrutinizing the CAT-scan or the answers to test questions, the researcher�s purpose may not be the therapist�s. The therapist may want to judge whether to start treatment, if at all. The researcher may want to tell others whether the diagnostic instrument can deliver information that helps the therapist to take successful action. (Other purposes creep in, too.)

  Well, the CAT-scan does not give merely yes-no information. Nor does it give a score -- an indication of how much of some variable is present. The scan is interpretable because of the therapist�s knowledge about how some real things function: the effects on x-rays of the chemical the operator injects into the blood of the person, the functioning of lungs (for example), and the functioning of cells. The scan gives information interpretable in several ways. The score of a test, in contrast, is unidimensional. Most therapists, I suspect, prefer multidimensional diagnoses, and most researchers prefer unidimensional. You cannot calculate a correlation between chunks of multidimensional information.

  And notice that I have been writing as if I accepted the medical model (analogy) for psychological therapy. I do not. I wrote that way so that I could answer questions about diagnostic methods. But the MOL does not require diagnosis. On what variable, then, should one evaluate the effectiveness of the MOL? Well, the variable(s) the person is going to deal with will be unique to the individual. Furthermore, the variables the client cares about after some sessions of MOL are not the variables he or she cared about at the outset. But for the purposes of a researcher, you might pick out something vague like �happiness� or �feeling better� or �I don�t worry anymore.� If you do, critics are going to cry, �But did you eradicate the psychoschemia?� and �Making people feel better should be merely a side-effect� and �Any client who pays out all that money is naturally going to think he feels better.� And so on.

  One of the troubles with the medical model is that it leads people to expect that the therapist should produce results that are beyond the comprehension of the client. (�You had a severe inflammation of the estuary, and I treated it with Sprecklenberger�s reagent inserted between the upper and lower fibulatory nerves.�) With the MOL, in contrast, the client often (correct me if I am wrong) comes up with an explanation of her states before and afterward that makes good sense to both client and therapist, but not necessarily to other researchers who like to see the percentage of subjects who chose answer number 3 to item 24.

  When you buy a chair and pronounce it comfortable, you are happy, and the salesperson is happy. You do not ask what variable was below the proper level and is now above it. It is socially acceptable for you to have your own reasons for liking the chair; no shame need be attached.

  You might say, �Well, you can�t expect the client�s explanation to be the right one!� And I answer, �Well, there is no way to know whether the therapist�s explanation is the right one, either.� Because there is no way for anyone to ascertain the �right� one. There is no CAT machine that reveals what the psychoschemia is doing.

  About change on the part of an individual, I would say: Ignore the changes of other people. Take a series of scores. See whether the pattern of changes makes sense in what else you know about the person�s life during that period. But note that I am still talking about SCORES, not behavior. I think clients might be happier not with a test but with their own feelings of change in outlook.

  But that will not satisfy most traditional psychological researchers. Sorry.

  And here are some little comments on some remarks of Powers in his contribution of 13 Nov:

  What you are calling �uncertainty� is what testers call �reliability.�

  You said that disorder or unreliability �will be smaller for the panel of expert doing clinical evaluations than it will be for a paper-and-pencil test.� That may be true in physical medicine, but it is certainly not true for psychological diagnoses. Raters, no matter how extensive their education, are notoriously unreliable. See also pp. 357-360 of �People As.�

  You said one of the main considerations was �the cost of an error in diagnosis in ... degree of error, severity of consequences, and fraction of the population affected.� I think the answers will be different in regard to the therapist (dealing with individuals) on the one hand, and to legislators (dealing with the population) on the other. The therapist can afford to deal with the individual, because he or she can continue to work with the client over an extended period of time. The legislator must use the method of casting nets both in diagnosis and in specification of legal action, because the accused goes into the courtroom and out of it, and that�s that. But I suppose the matter is more complicated than that. See the horror story about �schizoaffective disorder� on page 359 of People As.

  Finally, I have to say that I find all this difficult to think about.

  And I doubt this will be much help to David G. I think, David, you are in the same fix as Powers, Marken, and others who would like to publish in researchers� journals. The traditional assumptions and standards are simply foreign to thinking about a single individual. You have my sympathy.

  Sorry to be so prolix.

···

Subject: Statistics Again

[From Bjorn Simonsen (2006.11.23,11:20 EUST)]

From Richard Kennaway received (20006.11.23, 10:48 EUST)

(Full text available without subscription.)

http://www.nature.com/nature/journal/v444/n7118/full/444418a.html

Are you sure? I arrive at an access page.

bjorn

[From Fred Nickols (2006.11.23.0834 EST)] --
      

Subject: Statistics Again

  Everybody can recognize a black ball when he sees it.

<snip rest of post>

  Sorry to be so prolix.

No apology necessary for me, Phil. I found it fascinating and personally helpful. Thanks.

Regards,

Fred Nickols
nickols@att.net

···

-------------- Original message ----------------------
From: Philip Runkel <runk@UOREGON.EDU>

[From Rick Marken (2006.11.23.0910)]

Bill Powers 92006.11.23.0830 MST)

If we get hung up on the symptom we will forget to ask what has gone wrong that is leading to these feelings. Clearly, depression is not caused by a lack of electrical shocks given to the brain, or a lack of words from a counsellor or priest, or a lack of some pep-me-up drug (I'm indebted to Tim Carey for those ideas). It is not caused by a gene that says "Now be depressed."

Great way to put it. That TIm Carey is one smart fellow!

Thanks. I'll use this in my Personal Control Seminar in what we jokingly call the Winter quarter out her in LA.

Best

Rick

···

---
Richard S. Marken Consulting
marken@mindreadings.com
Home 310 474-0313
Cell 310 729-1400

Re: Statistics again
[Martin Taylor 2006.11.13.17.58]

[From Bjorn Simonsen (2006.11.23,11:20
EUST)]
From Richard Kennaway received (20006.11.23, 10:48
EUST)

(Full text available without
subscription.)

http://www.nature.com/nature/journal/v444/n7118/full/444418a.html

Are you sure? I arrive at an access
page.

bjorn

PDF Attached. It’s not the report, but a news item about the
report, but it’s what Richard pointed to.

Martin

444418a_Happiness.pdf (362 KB)

I know people will say I am
naïve.
Well, but I think your comments
in “
[From Bill Powers (2006.11.24.0855 MST)]

Bjorn Simonsen
(2006.11.24,8:45 EUST) –

No, only that you haven’t considered all the cases where group statistics
is the only appropriate measure.

From Bill Powers 92006.11.23.0830 MST)” points
away from statistics on groups and towards MOL on the
individual.

[Tip on English usage: we say “comment … points” and
“comments … point.” just the opposite of what may seem the
logical way to match the number of the verb to plurals].

Yes, therapy is an individual matter. It is a mistake to use group
statistics to evaluate individual characteristics. But consider these
questions:

···

=================================================================

What monthly premium should an insurance company charge for a $100,000
life insurance contract?

How many newspapers should the proprietor of a street-corner news stand
order each day?

How big should the pumps be in a town’s new water treatment
plant?

How many doses of a bird flu vaccine should be ordered next year by the
U. S. Government?

How long should a traffic light stay red in each direction?

What is the noise level in a current of 1 picoampere? (about 1% of the
mean current).

How many people should be hired next year for the department that repairs
goods returned under warrantee?

At what level of the CA125 enzyme found in a blood test should a person
be advised to undergo exploratory surgery for cancer?

==================================================================

Once you get started on this, it becomes clear that group or mass
statistics is used all the time and for very good reasons, and that it is
impossible to get the required information by examining an individual
(person or electron)…

When individuals are being evaluated, the important question is the one
the individual would ask: how will the outcome of this measurement affect
me if I decide to trust it? If there is little harm threatened by an
unneeded treatment or by omitting a treatment, the individual can easily
go along with the judgment. But if a mistake either way can mean poverty,
mutilation, or death, the individual will demand very small uncertainties
in any test. The penalties of false positives and false negatives must be
taken into account.

Of course this is also true of policy decisions based on population
statistics, and in many cases I suppose the cost-benefit analysis is
actually done.

====================================================================

Last point. When the standard deviation of data becomes a small fraction
of the size of the mean value, we begin to think of the data as a set of
measurements with some measurement error. The PCT model can be used to
predict the movements of a person’s hand in a tracking experiment within
well under 15 percent from the first trial. That means that the peak
predicted value is about 6.7 times the standard deviation. This takes us
completely out of the realm of statistics, because the odds that such a
measurement could occur by chance are less than 1 in 200
million.

Statistics is used when the mean value is 2 to 3 times the standard
deviation (p < 0.045 to p < 0.002). It makes sense to speak of the
probabilities in such a case. But when the mean is 4 times the standard
deviation (25% measurement error), the probability of getting that
measure by chance is 0.00006 (6.34E-5) and it gets smaller very rapidly
(data from Handbook of Chemistry and Physics, 43rd edition, page 210).
The following table shows the numbers for deviations equal to or greater
than the given number of standard devisions:

Ratio Deviation RMS measurement
Probability

to Standard Dev. error

1
100%
0.500

2
50%
0.0455

3
33%
2.7E-3

4
25%
6.34E-5

5
20%
5.7E-7

6
17%
2.0E-9

7
14%
2.6E-12

You can see that an effort to reduce the uncertainty in experimental
measurements will pay off very handsomely Going from 50%
measurement error down to 25% makes a statistical possibility (p = 0.046,
or 1 part in 22) into just about a certainty (p = 0.000063, or 1 part in
90,000).

Somewhere between an error ratio of 3 and 4 standard deviations, there is
a transition from the world of statistics into the world of measurement.
Once the probability that a given measure is due to chance has dropped
below one part in 10,000, we can stop worrying that the measurement isn’t
real, and start using it the way measurements are used in the hard
sciences. Now only the middle column, the measurement error, is of any
real interest. Going from a measurement error of 50% down to 20% is like
going from one universe into a different one. It’s like making the
transition from superstition to science.

Best,

Bill P.

[From Rick Marken (2006.11.24.1010)]

Bill Powers (2006.11.24.0855 MST)

Yes, therapy is an individual matter. It is a mistake to use group statistics to evaluate individual characteristics. But consider these questions:

What monthly premium should an insurance company charge for a $100,000 life insurance contract?

...
At what level of the CA125 enzyme found in a blood test should a person be advised to undergo exploratory surgery for cancer?

When individuals are being evaluated, the important question is the one the individual would ask: how will the outcome of this measurement affect me if I decide to trust it?...

Last point. When the standard deviation of data becomes a small fraction of the size of the mean value, we begin to think of the data as a set of measurements with some measurement error...

Going from a measurement error of 50% down to 20% is like going from one universe into a different one. It's like making the transition from superstition to science.

Thanks Bill. This was a real "turkey" of a post, in the best Thanksgiving leftovers sense of the word!!

Best

Rick

···

---
Richard S. Marken Consulting
marken@mindreadings.com
Home 310 474-0313
Cell 310 729-1400

Message

···

-----Original Message-----
From: Control Systems Group Network (CSGnet) [mailto:CSGNET@LISTSERV.UIUC.EDU] ** On Behalf Of** Bill Powers
Sent: Friday, November 24, 2006 12:43 PM
To: CSGNET@LISTSERV.UIUC.EDU
Subject: Re: Statistics again

[From Bill Powers (2006.11.24.0855 MST)]

Bill,

Can you clarify the different columns in this table?

What does 'Ratio Deviation to Standard Dev.? Are you talking about the ratio of standard deviation/mean?

What does RMS measurement error refer to? If we had a psychological test with a correlation test-retest

correlation of .9, would the ‘measurement error’ be 1-r squared, the proportion of variance not accounted for?

Therefore, the RMS measurement error would be 1-.81=.19?

The probablity column, is this the probablity of getting a score that is so many standard deviations above the

mean?

Thanks,

David

David M. Goldstein, Ph.D.

====================================================================
Last point. When the standard deviation of data becomes a small fraction of the size of the mean value, we begin to think of the data as a set of measurements with some measurement error. The PCT model can be used to predict the movements of a person’s hand in a tracking experiment within well under 15 percent from the first trial. That means that the peak predicted value is about 6.7 times the standard deviation. This takes us completely out of the realm of statistics, because the odds that such a measurement could occur by chance are less than 1 in 200 million.

Statistics is used when the mean value is 2 to 3 times the standard deviation (p < 0.045 to p < 0.002). It makes sense to speak of the probabilities in such a case. But when the mean is 4 times the standard deviation (25% measurement error), the probability of getting that measure by chance is 0.00006 (6.34E-5) and it gets smaller very rapidly (data from Handbook of Chemistry and Physics, 43rd edition, page 210). The following table shows the numbers for deviations equal to or greater than the given number of standard devisions:

Ratio Deviation RMS measurement Probability
to Standard Dev. error

    1                100%              0.500 

    2                50%               0.0455

    3                33%               2.7E-3  

    4                25%               6.34E-5

    5                20%               5.7E-7
 
    6                17%               2.0E-9

    7                14%               2.6E-12

You can see that an effort to reduce the uncertainty in experimental measurements will pay off very handsomely Going from 50% measurement error down to 25% makes a statistical possibility (p = 0.046, or 1 part in 22) into just about a certainty (p = 0.000063, or 1 part in 90,000).

Somewhere between an error ratio of 3 and 4 standard deviations, there is a transition from the world of statistics into the world of measurement. Once the probability that a given measure is due to chance has dropped below one part in 10,000, we can stop worrying that the measurement isn’t real, and start using it the way measurements are used in the hard sciences. Now only the middle column, the measurement error, is of any real interest. Going from a measurement error of 50% down to 20% is like going from one universe into a different one. It’s like making the transition from superstition to science.

Best,

Bill P.


No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.409 / Virus Database: 268.14.14/548 - Release Date: 11/23/2006

No virus found in this outgoing message.

Checked by AVG Free Edition.

Version: 7.1.409 / Virus Database: 268.14.16/552 - Release Date: 11/26/2006

[From Bill Powers (2006.11.26.1140 MST)]

David Goldstein(2006.11.26) –

What does 'Ratio
Deviation to Standard Dev.? Are you talking about the ratio of standard
deviation/mean?

I think they mean the size of a deviation from the mean in comparison to
the standard deviation. In other words, a value of 2.0 means a deviation
of 2.0 standard deviations away from the mean.

What does RMS
measurement error refer to?

RMS means Root-Mean-Square: the square root of the sum of the
squares of deviations divided by N, or

  RMS =  sqrt(sum(x[i] -

Xavg)/N)

I think that’s just sigma. RMS is what’s used in physics and
electronics

The probablity
column, is this the probablity of getting a score that is so many
standard deviations above the mean?

Yes, but “greater than or equal to” that number of standard
deviations away from the mean.

Best,

Bill P.

[from Tracy Harms (2006;11,26.20:00 Pacific)]

You are very welcome, Bj�rn.

It is worth noting that, after I wrote what you
repeated here, Bill posted a reply that made clear
that it is not the case that Bill is "well aware that
PCT does not contain such a theory within it." His
appraisal seems, instead, to be roughly the opposite
to what I'd said.

I've posted this in order to make it explicit that my
claim no longer stands: Bill does think that PCT
contains an assertion of equivalence between
subjective experience and brain state.

Tracy

···

--- Bj�rn Simonsen <bjornsi@BROADPARK.NO> wrote:

[From Bjorn Simonsen (2006.11.24,8:45 EUST)]
...

This mail considers the use of statistics on human
beings, individuals or
groups. The mail is written on pretext of a certain
point of view where I
partly lean upon: _Critique of Impure Reason: An
Essay on Neurons, Somatic
Markers, and Consciousness_ by Peter Munz (1999,
ISBN:0275963845)

It was Tracy Harms who recommended the book saying:
�Bill Powers is quite
inclined to assert an equivalence between subjective
experience and brain
state, but he seems well aware that PCT does not
contain such a theory
within it. For an examination
of the differences between these two things, I
highly recommend (the book)�

Thanks to you Tracy Harms.

...

____________________________________________________________________________________
Do you Yahoo!?
Everyone is raving about the all-new Yahoo! Mail beta.
http://new.mail.yahoo.com