Everybody can recognize a black ball when he sees it.
This is Phil Runkel replying to Powers, Marken, Simonson, Taylor, Kennaway, Goldstein, et alii.
In the 1950s and 60s, the psychological and statistical literatures in the US and several other countries were replete with articles purporting to repair faults or lacunae in the statistical treatment of research data. I refer to journals such as the Psychological Bulletin, the Journal of the American Statistical Association, the Journal of the American Sociological Association, and Educational and Psychological Measurement. I contributed one such article myself (with T.M. Newcomb and J.E.K. Smith in 1957 in the Psychological Bulletin): Estimating interaction effects among overlapping pairs. (Isnï¿½t that a sexy title?)
I gave up scanning those literatures 30 or 40 years ago, but now, thanks to David Goldstein, I find that the faults and lacunae are still being patched up. David kindly put within reach of my fingertips the articles by Crawford and others and by Payne and Jones.
The overall tenor of what I will say here is what I said in Casting Nets, namely: No matter how carefully you sharpen the teeth of a saw, it will remain a poor instrument for pounding nails.
The model underlying classical statistical theory is that of white and black balls in an urn. An observation (taking a datum) corresponds to pulling out a ball from the urn. The assumption (rarely mentioned) is that everyone can tell whether the ball is white or black. In this simplest binomial distribution, the ball cannot be green or gray. In a wider distribution of scores, everything gets more complicated, but the assumption is similar: Everyone can tell whether the score is 21, 62, or some other.
In classical statistical theory, no one asks for what the white ball stands. Also within test theory (such as academic tests of knowledge), no one asks for what the score stands; theory focuses on ï¿½trueï¿½ scores, reliability, attenuation in correlations, and similar matters. But of course we go into all those statistics manipulations of the numbers because we care about what they stand for.
Kennawayï¿½s contribution of 7 November gives a nicely concise overview of test theory, which rests on the postulate that an obtained test score is the sum of a real (or ï¿½true) score and some error. In test theory (and in most treatments of statistics), the idea is to ï¿½accountï¿½ for ï¿½variance.ï¿½ That is, you list all the X-sub-i that vary with Y. And the last term in the equation is always the variations from unknown factors, or ï¿½error.ï¿½ In PCT, in contrast, we look for what is constant despite variations in the environmental part of the loop.
With the attitude of test theory, if you go on to assume that error scores are distributed (over people or over administrations of the test) in certain ways (such as the normal curve) then you can say a lot of things about test scores (not about behavior or control of perception or feelings of depression, etc., but about scores) as long as the assumptions about distributions hold up. It is often very difficult to ascertain whether the assumptions about distributions hold up, so they are usually left to be assumptions.
In traditional psychology, the matter of ï¿½standing forï¿½ is called validity. Does the score stand for a certain amount (variance) of what the cover page of the test claims it stands for? Is there even such a thing for the score to stand for? This turns out to be a very complicated question. You will find, for example, about twelve kinds of validity described by D. Brinberg and J.E. McGrath (Validity and the Research Process. Sage, 1988).
(Much of the complication comes from the assumptions made. Just as Lobachevsky and Reimann made new and startling geometries by giving up a postulate of Euclidï¿½s, so Clyde Coombs made a new and refreshing theory of data by giving up some of the assumptions of test theory. In my 1972 book with McGrath, I gave a list in the appendix of some 15 assumptions underlying psychological data. Coombsï¿½s book (1964) was "A Theory of Data.")
But all that thinking and all those thoughts about dealing with data have little to do with what the data might stand for. The best test theory can do with that question is to say that if a score is NOT reliable, then it cannot stand for very much.
In his contribution of 5 Nov at 12:10, which started this topic, Powers expressed his reluctance to use psychologistsï¿½ tests for evaluating outcomes of therapy.
Powers asked whether, if a person is administered a test labeled ï¿½depression,ï¿½ it would tell whether the person actually ï¿½hasï¿½ that affliction. That is, what is the validity of tests of depression? Powers also asked whether depression is something that actually exists. If it does not exist, we would like every test to tell us that the person does not suffer from it.
Neither Powers nor I is saying that it is never helpful to say that someone ï¿½is depressedï¿½ -- to point with that word to a sort of familiar behavior. Here, we are questioning a putative condition that might be ascertained as readily as with blood pressure (low or high) or a ball from an urn (black or white), a condition that could be ascertained tomorrow or with another person as readily as today with this person. Furthermore, it would not be merely a ï¿½signï¿½ of something going on, such as a blush when a person is embarrassed, but it would be a ï¿½real thingï¿½ that could have causal effects, such as high blood pressure.
The likelihood of the ï¿½existenceï¿½ of a condition such as depression or schizophrenia or this or that personality disorder (that is, the validity of the test) is in general very low to vanishing. The validity of a test can be no better than its reliability. That is, if you assume that the person ï¿½hasï¿½ the condition every day of the week, but you discover that you cannot count on a test to give the same score from one time to another no matter who is administering or interpreting the test, then you cannot have confidence that the person ï¿½hasï¿½ the claimed condition. In People As Living Things, pp. 357-363, under ï¿½Diagnosis,ï¿½ I wrote:
Despite the ambiguity of the kappa statistic, the principal author of the Diagnostic and Statistical Manual-III, along with two co-authors, wrote ... that a kappa of 0.7 or above indicated good agreement among diagnosticians on whether the patient has a disorder.
Concerning the large class of categories labeled ï¿½Major Mental Disorders,ï¿½ [Kirk and Kutchins] said that ï¿½not a single major diagnostic category achieved the .70 standard.ï¿½
In those pages, I give several further evidences of the unreliability of psychiatric measures. Also, in pages 298-302, I give some indications of the reliability and validity of personality tests, many of which are similar to some tests used in psychiatric diagnoses. In the studies mentioned there, validity coefficients ranged from .17 to .83, with most falling between .20 and .50. Those statistics are not the end of the complications; I mentioned earlier, for example, McGrathï¿½s 12 sorts of validities. I will sum up here with the quotation from Theodore Millon of the Harvard Medical School that I gave on page 367: ï¿½Certainly the disorders of personality should not be construed as palpable ï¿½diseases.ï¿½ï¿½
But suppose, as Powers asks us to do, that the assumed condition does actually exist. Can a test reveal it? Maybe, partially, but the matter is chancy. On pp. 282-284, 293-295, 301-302, 368, and 457-462 of People As, I describe the wild chanciness of testing generally. One can interpret the question to be asking how we can ascertain the percentage of people who will be correctly diagnosed, or to be asking whether we can correctly diagnose this one person in the therapistï¿½s office. I do not think the answers are the same.
An example of the subtleties in answering the question about percentages appears in calculating the accuracy of a test.
You must know the rate of appearance of the condition in the population. Suppose a company puts forward a test and claims an accuracy for it of 99 percent. But suppose that the condition appears in only one of every 1000 persons in the population. If the test makes only one mistake (calling ï¿½positiveï¿½) with every 100 people, it will be right only one time in ten when it calls ï¿½positive.ï¿½ The testï¿½s rate of success in detecting positive cases will be only ten percent, not 99 percent.
In the case of diagnosing a single individual, Powers wants to know whether he or she ï¿½actually has it.ï¿½ Well, in the case of calculating rates, we are back to white balls and black balls. But in the case of diagnosing a single individual, we are back to wondering what a black balls stands for. When a physician uses a test that is right in only 50, or 70, or 90 cases in 100, the physician typically goes on looking for more evidence as the treatment continues. One test is not the end of the story. And the physician hopes eventually to see evidence that is as clear as a black ball: bleeding or not, certain germs or not, an invasive growth or not. Investigations of most (all?) psychological conditions, however, cannot arrive at such palpability. The DSM-IV surely represents the psychiatric professionï¿½s best attempt at listing evidence for mental conditions, and the palpability there is laughable.
We are back to validity. The kind of validity most psychologists prefer is ï¿½predictive validity.ï¿½ The reasoning goes this way: ï¿½We cannot actually see or touch X or Y, but if X is the kind of thing we think it is, it ought to predict Y.ï¿½ So we choose or invent what we think are defensible tests (ï¿½operational definitionsï¿½) of X and Y, collect data, and calculate the correlation. If the correlation is high, we are tempted to conclude that the condition X actually exists.
I think we are justified in concluding only that the TESTS actually exist. A great deal of psychological research consists of calculating correlations between presumed characteristics of two things, one or both of which remain imaginary. I have done it myself.
If my memory is right, L.L. Thurstone was the chief inventor of factor analysis, and I think he said, about 1930, ï¿½If something exists, it exists in some degree, and can be measured.ï¿½ A lot of psychologists concluded that if they have measured something, it must exist. Tsk tsk.
I cannot think of any way that any single individual can justifiably be claimed to ï¿½haveï¿½ any psychological condition whatsoever. You can certainly say a useful thing such as ï¿½She is grievingï¿½ or ï¿½He hates spinach,ï¿½ but you are not going to do better than that my inventing a test of grieving. By the time you get the test made and assessed for reliability and all that, she will have got over her sorrow. And to say ï¿½He is the kind of person who hates spinachï¿½ simply leads you into a wild goose chase and all the pitfalls of stereotypy. (A personality test provides you with ï¿½kinds of persons.ï¿½)
The presumed psychological condition is internal, forever impalpable to outside observers. You can observe that the person has said mean things to his father, or has often refused to speak to his wife, or has chosen certain answers to questions on a test. And you can calculate correlations among those actions. But when you predict that the person will then perform certain further actions, such as beating his wife or going back to jail or watching TV all day long, you will be correct about some persons more often than you would be by flipping a coin, but with other persons you will not. Actual correlations obtained over individuals by researchers, as I mentioned above, are low. And you will certainly have no evidence that some one continuing, reliable, internal condition is causing the actions. Particular actions, as I have said ad nauseam, are shaped both by internal reference signals and by outside opportunities; both those sources of action change.
Any presumed assessment of these internal something-or-others will sometimes show correlations with other something-or-others, simply because humans do not behave randomly, and we are fairly good at seeing (or inventing) patterns in things. And we often hunt for patterns when they are not necessary, simply because we like to explain things, and we feel that a pattern (such as a correlation) suffices.
I think we ought to give up inventing internal devils. But I do not think we ought to give up being helpful to one another. And I think the Method of Levels provides a wonderful way of being helpful without having to do diagnosis. That is, MOL does not assume that the therapist must know what is wrong with the client. MOL enables the therapist to throw away the DSM.
So the best answer to Powersï¿½s first question, whether a test can reveal a psychological condition of an individual, whether ï¿½realï¿½ or imaginary, is: No. Not at present, anyway.
Powers also asked about assessing change. Since I have said that you cannot be confident that a test measures any psychological condition, I must also say that you cannot be confident of calculating a change in it. Even in calculating changes in test scores (ignoring the question of whether the scores stand for any real thing), the matter is very slippery. After describing some of the statistical investigations, I said on page 71 of ï¿½Casting Nets and Testing Specimens: ï¿½... if you want an estimate of change that is not biased toward too much or too little, then you must give up any hope of precision and flood your data with at least 50 percent error!ï¿½
That warning applies to averages using tests (measures) of low reliability. But calculating a change of a single score that is assumed to come from a distribution of scores runs into similar difficulties.
Looking beyond the score to what it might stand for brings further troubles. Many diagnostic tests are scored by comparing the clientï¿½s score with the scores of a sample of run-of-the-mill or even ï¿½normalï¿½ persons. Crawford and others (in a paper David Goldstein made available on the internet) described a ï¿½statistically significant deficit.ï¿½ This consisted of calculating the number of times in 100 testings of individuals that a score this much lower than the mean (or some other arbitrary point) of a distribution would occur by chance. That tells how to score the test, but it says nothing about what the test measures, if anything.
And note that when a score denotes a position among a collection of other people, the value of the score depends not just on the individual, but also on all those other people. A Moslem in Saudi Arabia is an ordinary person, but a Moslem in most parts of the USA exhibits ï¿½an enduring pattern of inner experience and behavior that deviates markedly from the expectations of the individualï¿½s [surrounding] culture,ï¿½ which the DSM-IV says on page 633 is one of the ï¿½general diagnostic criteria for a Personality Disorder.ï¿½
David Goldstein put two papers onto the CSGnet: (1) Payne and Jones and (2) Crawford and others. I have no quarrel with those papers, nor with others that those others cite. The articles are about the ï¿½behaviorï¿½ of test scores, not about the behavior of people.
In ï¿½People As Living Things,ï¿½ I wrote about personality tests in Chapter 26, about diagnosis in Chapter 31, and mental testing in schools in Chapter 38. I will not repeat all that here, but only this small bit from page 367:
The question recurs in the literature year after year whether some widely used conceptions of undesirable behavior ï¿½existï¿½ outside the heads of the diagnosticians. Theodore Millon (1990) of the Harvard Medical School wrote: ï¿½Certainly the disorders of personality should not be construed as palpable ï¿½diseasesï¿½ .... Unfortunately, most are ... receding all too slowly into the dustbins of history.ï¿½
It seems to me, actually, that the concept of validity is not useful in assessing the behavior or condition of an individual. The more relevant concept is simply accuracy or correctness.
Assessing the presence of cancer in an individual by the use of the CAT-scan, for example, is a matter of examining the resulting pictures to see what they show. The physician, at that point, has no interest in the percentages found by physicians in previous examinations of this or other patients, and no interest in how often the physicians were right or wrong. All the physician needs to know is that (1) there really can be growths in people that can be detected by the CAT-scan when the right chemical has been injected into the blood and (2) the machine was working properly when these pictures were made. In some cases, the growth is clear, large, and obvious, and its continuing existence can be corroborated by another scan before or afterward. In other cases, the pattern in the picture is to some degree ambiguous. But the physician has no need to put a score on the ambiguity. She can choose to start treatment or to wait for a later scan. Reliability and validity in the usual psychological senses do not come into the matter. I know there are subtleties that my example here passes by, but ....
The physician or the psychotherapist can use the experience of other therapists with other clients as possibilities to think about with the next client, but when the next client appears, the conditions and experiences of those other clients do not tell the therapist anything about THIS client.
In analogy to the use of the CAT-scan, the psychotherapist wanting to be helpful to a client need have no interest in the percentages of symptoms found by other psychotherapists with other clients, and no interest in whether those other therapists were right or wrong. As Powers said in his contribution of 9 November,
The fact that other people have had problems that look superficially similar to your clientï¿½s problem does not improve your knowledge of your client.... And it doesnï¿½t help the client, either, to know what the average characteristics of people who resemble him or her are.
But now I bethink myself that the therapist is sometimes also the researcher (as is the case with David G.). In scrutinizing the CAT-scan or the answers to test questions, the researcherï¿½s purpose may not be the therapistï¿½s. The therapist may want to judge whether to start treatment, if at all. The researcher may want to tell others whether the diagnostic instrument can deliver information that helps the therapist to take successful action. (Other purposes creep in, too.)
Well, the CAT-scan does not give merely yes-no information. Nor does it give a score -- an indication of how much of some variable is present. The scan is interpretable because of the therapistï¿½s knowledge about how some real things function: the effects on x-rays of the chemical the operator injects into the blood of the person, the functioning of lungs (for example), and the functioning of cells. The scan gives information interpretable in several ways. The score of a test, in contrast, is unidimensional. Most therapists, I suspect, prefer multidimensional diagnoses, and most researchers prefer unidimensional. You cannot calculate a correlation between chunks of multidimensional information.
And notice that I have been writing as if I accepted the medical model (analogy) for psychological therapy. I do not. I wrote that way so that I could answer questions about diagnostic methods. But the MOL does not require diagnosis. On what variable, then, should one evaluate the effectiveness of the MOL? Well, the variable(s) the person is going to deal with will be unique to the individual. Furthermore, the variables the client cares about after some sessions of MOL are not the variables he or she cared about at the outset. But for the purposes of a researcher, you might pick out something vague like ï¿½happinessï¿½ or ï¿½feeling betterï¿½ or ï¿½I donï¿½t worry anymore.ï¿½ If you do, critics are going to cry, ï¿½But did you eradicate the psychoschemia?ï¿½ and ï¿½Making people feel better should be merely a side-effectï¿½ and ï¿½Any client who pays out all that money is naturally going to think he feels better.ï¿½ And so on.
One of the troubles with the medical model is that it leads people to expect that the therapist should produce results that are beyond the comprehension of the client. (ï¿½You had a severe inflammation of the estuary, and I treated it with Sprecklenbergerï¿½s reagent inserted between the upper and lower fibulatory nerves.ï¿½) With the MOL, in contrast, the client often (correct me if I am wrong) comes up with an explanation of her states before and afterward that makes good sense to both client and therapist, but not necessarily to other researchers who like to see the percentage of subjects who chose answer number 3 to item 24.
When you buy a chair and pronounce it comfortable, you are happy, and the salesperson is happy. You do not ask what variable was below the proper level and is now above it. It is socially acceptable for you to have your own reasons for liking the chair; no shame need be attached.
You might say, ï¿½Well, you canï¿½t expect the clientï¿½s explanation to be the right one!ï¿½ And I answer, ï¿½Well, there is no way to know whether the therapistï¿½s explanation is the right one, either.ï¿½ Because there is no way for anyone to ascertain the ï¿½rightï¿½ one. There is no CAT machine that reveals what the psychoschemia is doing.
About change on the part of an individual, I would say: Ignore the changes of other people. Take a series of scores. See whether the pattern of changes makes sense in what else you know about the personï¿½s life during that period. But note that I am still talking about SCORES, not behavior. I think clients might be happier not with a test but with their own feelings of change in outlook.
But that will not satisfy most traditional psychological researchers. Sorry.
And here are some little comments on some remarks of Powers in his contribution of 13 Nov:
What you are calling ï¿½uncertaintyï¿½ is what testers call ï¿½reliability.ï¿½
You said that disorder or unreliability ï¿½will be smaller for the panel of expert doing clinical evaluations than it will be for a paper-and-pencil test.ï¿½ That may be true in physical medicine, but it is certainly not true for psychological diagnoses. Raters, no matter how extensive their education, are notoriously unreliable. See also pp. 357-360 of ï¿½People As.ï¿½
You said one of the main considerations was ï¿½the cost of an error in diagnosis in ... degree of error, severity of consequences, and fraction of the population affected.ï¿½ I think the answers will be different in regard to the therapist (dealing with individuals) on the one hand, and to legislators (dealing with the population) on the other. The therapist can afford to deal with the individual, because he or she can continue to work with the client over an extended period of time. The legislator must use the method of casting nets both in diagnosis and in specification of legal action, because the accused goes into the courtroom and out of it, and thatï¿½s that. But I suppose the matter is more complicated than that. See the horror story about ï¿½schizoaffective disorderï¿½ on page 359 of People As.
Finally, I have to say that I find all this difficult to think about.
And I doubt this will be much help to David G. I think, David, you are in the same fix as Powers, Marken, and others who would like to publish in researchersï¿½ journals. The traditional assumptions and standards are simply foreign to thinking about a single individual. You have my sympathy.
Sorry to be so prolix.
Subject: Statistics Again