Statistics: what is it about?

[From Bill Powers (2009.01.03.0858 MST)]

Martin Taylor(various) --

I've been looking over the two references to Bayesian logic that you conundrum that Mike Acree posted: what is the probability that John is the Pope given that John is Catholic, and what is the probability that John is Catholic given that John is the Pope? I said that one Catholic in 1000 is the pope, and 1 person in 10 is Catholic, and whaddaya know, the Bayesian theorem works:

p(P)p(C|P) = p(C)p(P|C)
(0.0001)(1) = (0.1)(0.001)

C: John is Catholic; P: John is Pope; p: probability

But there's something funny about this, or my internal translation of it. It has something to do with writing p(P). The calculation implies that after we choose any person from the total population, there is a certain probability that this person is the Pope. But that probability actually applies only within the subset of Catholics; outside that subset, the probability that a randomly chosen person will be the Pope is zero. p(P) is defined only within the set of Catholics -- there isn't really a uniform probability within the total population. We have a value for p(P|C), but the value of p(P|not-C) is always zero. That doesn't seem to be explicit in the notation, nor would it seem to be necessarily true in all cases.

I guess this is just an echo of the "disorder" discussion. I think of probabilities in a more Bayesian way, perhaps, because I always think in terms of "the number or proportion of people who..." rather than "the chance that any given person will ...". We are always talking about subpopulations, it seems to me: some might do it, some don't and never would. No noncatholic ever will be Pope, even though the population statistics for N people say there is one chance in N (or 2 chances, if we include Eastern Orthodox) for every person. It's not that everyone has an equal tendency or risk as implied by the probability; it's that only a certain subset has any probability at all of showing the effect. The probability of being Pope is measured relative to the size of the subset, not as the "strength" of a "tendency" in the whole population. The choosing of a person from the whole population is random, but the condition itself is not.

Maybe that's what you've been trying to say.

In a somewhat different context:

You can have a perfectly regular relationship between two variables, y = f(x), yet when you observe y as a function of x, it appears to have a random component. The random component is not a property of the functional relationship f, but of the method of observation. It's possible that the observation itself disturbs the observed variable in some unknown and therefore random-looking way, as per Heisenberg, but it's at least as likely that the observation is generated in a way that is only partly dependent on the actual states of the variables -- that we are observing the wrong variable. Like trying to see the temporal relationship between the way the violins play and the gestures of the conductor's right elbow.

When I was learning practical electronics, one of the first things we were taught was that measuring a voltage with an ordinary 20000-ohm-per-volt voltmeter drew enough current from a circuit to alter the voltage being measured. The thing we were taught next was how to compensate for that error, by figuring out the impedance of the measured circuit and using theory to calculate what the correct reading was. I always wondered why we couldn't do that for position and momentum measurements, but my physics professors would look offended and say that the uncertainty was in the position and momentum, not in the measurements. I never believed them.

What's on my mind now is the idea that perhaps there are two completely different contexts in which to apply probability ideas. One is the context of qualitative events: things that either happen or don't happen( which can be seen as logical propositions that are either true or not true). The other is the context of continuous quantitative relationships where we measure things rather than counting them.

When we measure things, probability does not apply to the act of measuring, but to the degree of uncertainty in the measurement. A voltmeter gives a reading of 10.0 volts, plus or minus 0.1. The probability enters with the 0.1, not the 10.0. But when we predict events, such as rain tomorrow, the 0.1 is all there is: either there is detectable rain tomorrow, or there is not. The laws of probability make sense in the latter qualitative case, but are relatively unimportant in the former quantitative case. There are, of course, gray areas where both are somewhat important: Yes it did rain, but you said there would be a downpour and all we got was a tenth of an inch.

This applies to perception. In some cases, qualitative probability enters: was that a dog I caught a glimpse of, or a coyote? But in ordinary everyday life, most perceptions are seen clearly and for long enough to leave only a small bit of uncertainty about the amount, and practically none about the identity or logical state. I don't think I have ever mistaken my glasses for a pencil and picked up the wrong thing. I drive a car without wandering from edge to edge of the road, and stop when the light is red but never when it's green. To measure uncertainty in perception, it seems to me, you have to set up pretty special conditions, with very low signal levels and masking noise or other distractions to make it difficult to see, hear, or what have you. Most of the time in real life, the world of perception is stable and repeatable, with uncertainty in the second to third decimal place, not in the first to second.

I think this pretty much wraps up my reluctance to think of information theory and uncertainty as being relevant to the way most of our control processes work. It's not that I think they don't apply, but that I don't think they can add much to the kinds of measurements we make in, for example, the tracking demo.

I can't say for sure, but I don't think that perceptual signals in the brain are any noiser than our experiences of them. When I say we may not be measuring them right, I mean that there may be parallel signals involved in what we experience as "a" signal, or that there are smoothing processes involved in the control loop that we don't see when we look at single-fiber activities, or that the signal levels are higher in most ordinary experiences than those used in neurological experiments, or that while the signal frequencies may be unpredictable, they change in smooth ways that show some regular analog process must be involved. I still say that in the records of spike trains that I have seen in many places, I see very little randomness in the way the frequencies seem to change. I really don't see the uncertainties that people write about. You just don't see those pulse train frequencies jumping randomly from one frequency to another: they speed up or slow down. About the only place where random-looking signals come up is in cases such as electronmyograms where we're really seeing the effects of many parallel signals which are not sychronized so on very fast time scales we see what looks like noise. But the individual signals in single axons show smoothly changing frequencies, as far as I have ever seen.

The main uncertainty here, as far as I'm concerned, is what people are talking about when they say neural signals have a lot of uncertainty in them. We really need some data.

Best,

Bill P.

···

sent with the "Dangers" paper. I actually checked out the little

Martin Taylor 2009.01.03.22.54]

[From Bill Powers (2009.01.03.0858 MST)]

Martin Taylor(various) --

I've been looking over the two references to Bayesian logic that you sent with the "Dangers" paper. I actually checked out the little conundrum that Mike Acree posted: what is the probability that John is the Pope given that John is Catholic, and what is the probability that John is Catholic given that John is the Pope? I said that one Catholic in 1000 is the pope, and 1 person in 10 is Catholic, and whaddaya know, the Bayesian theorem works:

p(P)p(C|P) = p(C)p(P|C)
(0.0001)(1) = (0.1)(0.001)

C: John is Catholic; P: John is Pope; p: probability

But there's something funny about this, or my internal translation of it. It has something to do with writing p(P). The calculation implies that after we choose any person from the total population, there is a certain probability that this person is the Pope. But that probability actually applies only within the subset of Catholics; outside that subset, the probability that a randomly chosen person will be the Pope is zero. p(P) is defined only within the set of Catholics -- there isn't really a uniform probability within the total population. We have a value for p(P|C), but the value of p(P|not-C) is always zero. That doesn't seem to be explicit in the notation, nor would it seem to be necessarily true in all cases.

I'm not going to answer this all here, because you encourage me to take up again the message I said I was working on, but which I had rather put aside for the moment. However, two key points are reasonable to make here.

1) All probabilities are subjective in the sense that they depend on the background knowledge of the observer. The background knowledge may include models, records of previous occurrences of the "probabilistic" matter, pure intuition, faith, and anything else you can think of. "Frequentist" probability, on which significance statistics are based, takes into account only records of previous occurrences (but see below for a problem with even this).

2) All probabilities are conditional. That's part of the background knowledge that you bring to bear. In the "frequentist" view of probability, you look at N situations, and say that on M of them the event E happened, so the probability of event E happening the next time is M/N. But that's not legitimate, You should at the very least say "conditional on the N situations all having the same characteristics insofar as event E is concerned, and on the next occurrence of the situation it will also have those characteristics." In physical fact, no two of the N situations had all the same characteristics. At the very least, the universe had a different age when each occurred.

What you are really saying is that the characteristics you, the observer and the generator of this probability measure, take into account are the same so far as you are aware in their influence on E, where the different events E also share all those characteristics that are of concern to you. You, the observer, are creating the sets of situations and events, selecting which characteristics of each are relevant to the proportion M/N, and you, the frequentist creator of the probability estimate, are saying that the next event that you will designate as being of the category E will occur with probability p = M/N when that same situation arises. You put a confidence bound on that, because you know that different base probability distributions could have all given rise to that observed proportion M/N.

That's a long-winded way of saying that because all probabilities are conditional on some background knowledge, the proportions of past events on which a frequentist bases a probability estimate are dependent on a perception that the condition belongs to a category and the event belongs to a category. These are perceptions.

Now think of p(P), which is expanded as p(Pope = John | John = person). You say "there isn't really a uniform probability within the total population". That's a true statement, but it isn't relevant if you don't know whether John is or is not a Catholic. When you write the expression p(P)p(C|P) = p(C)p(P|C), what is your prior state of knowledge? Do you or do you not know John is a Catholic? Setting aside the possibility of basing your probability on a model ("Non-Catholics are barred from becoming Pope") that is part of your background knowledge, what do you know from observing the situation (past Papal elections) about p(C|P), the probability that the next Pope will be a Catholic? Is it really zero, because the last 300 or so Popes have been Catholic, although St. Peter (Pope 1) presumably was not, since he was Jewish? It becomes a "White Swan" problem.

What you can say about the "White Swan" problem is that if the conditions remain the same, the likelihood that the next swan you see will be white increases the more swans you have seen without seeing a non-white one. Let's consider some hypotheses about that. Say we have been observing swans and they have all been white. We know little enough about the background of these swans that we are prepared to say the conditions will be considered to be "the same" the next time we see a swan, any swan. The hypotheses we will consider are

H1: No swans are white.
H2: Half of all swans are white
H3: 75% of all swans are white
H4: All swans are white.

Now, before you saw your first swan, you may or may not have had an opinion about the proportion of swans that are white. Maybe you figured any colour was possible, and on that basis you might say p(H4) was near zero if "white" means "exactly equal and high reflectance of all frequencies in the visible spectrum", and that H1 was almost certainly true. Or you might think of "white" as generically different from "red", "green", "yellow", "blue", and "black", in which case you might expect 1/6 of all swans you see to be white. However, for our purposes, let's think of someone assuming the only possibilities are white and black (though it really doesn't matter for the analysis), and that the person really has no opinion about what proportion have what colour. It's convenient but not necessary (as I show below) to say that the person gives equal credence (prior probability) to any hypothesis about the "real" proportion of swans that are white. If we are dealing with only four hypotheses, as above, then: p(H1) = p(H2) = p(H3) = p(H4) = 0.25

Now we observe our first swan, and it is white (note the difference between saying that and "now we observe a white swan"). What can we now say about the hypothesis probabilities (posterior probability)?

p(H|D) = p(H) * p(D|H) / p(D) Of course, we don't know p(D), but since it is the same for all the hypotheses, it doesn't affect their relative probability of being the correct hypothesis (among those we are considering). It is a constant multiplier (or divisor, if you like). For clarity in writing the formulae, I will take c = 1/p(D). (P(D) can matter; if, for example, the swan observer used a filter that blocked out all non-white swans, then the observations could not discriminate among the hypotheses, since p(D|H) would always be 1.0. A less efficient filter might let in some black swans if any ever came, but the filter would nevertheless alter P(D|H). That's what happens when we decide a priori what kind of evidence we are going to ignore in our investigations.

All the probabilities, of course, are conditional on some background knowledge, but again, that is the same for all of them so we don't have to notate it. I'll use the notation L(H) for the likelihood of H. Likelihoods are relative numbers, whereas probabilities must add to unity if all possibilities have been accounted for. D is the observation (in this case of a white swan).

L(H1|D) = 0.25 * 0 * c = 0 We can discard this one right away.
L(H2|D) = 0.25 * 0.5 * c = 0.125 * c
L(H3|D) = 0.25 * 0.75 * c = 0.1875 * c
L(H4|D) = 0.25 * 1 * c = 0.25 * c

Although these have been notated p(H), they aren't yet real probabilities, since the four hypotheses are considered to be the only possibilities, which means the sum of their probabilities must be unity. Remember we are talking about subjective probabilities assigned by a person for whom these are the only possible hypotheses, so the idea that the proportion of white swans might be 90% simply doesn't arise. Conditional on these being the only possible hypotheses, the probabilities are:

p(H1|D) = 0
p(H2|D) = 0.125 * c / (0.5625 * c) = 0.222...
p(H3|D) = 0.1875 * c/ (0.5625 * c) = 0.333...
p(H4|D) = 0.25 * c / (0.5625 * c) = 0.444...

The multiplier constant c = 1/p(D) drops out. After seeing one swan and it was white, we now think it twice as likely that all swans are white than that half of them are, and we are certain that it's wrong to say none of them are white.

Now we observe another swan and it also is white. I'll drop the "c" from the equations, and now we can ignore H1, since no matter how many non-white swans we see, we know it is inconsistent with the first swan, which was white. D in this case is the observation of the second white swan. The observation of the first is subsumed in the revision of the prior probabilities from 0.25 all round to 0, .222, .333, and .444. It is part of the background knowledge at this point.

L(H2|D) = 0.222 ...* 0.5 = 0.111...
L(H3|D) = 0.333... * 0.75 = 0.25
L(H4|D) = 0.444... * 1 = 0.444...

The sum is .80555..., which gives us the probabilities for the hypotheses after two observations:

p(H2|D) ~= 0.138
p(H3|D) ~= 0.310
p(H4|D) ~= 0.552

After seeing 10 swans which were all white, we can compute the likelihoods from the initial formula rather than recomputing new prior probabilities, but we can simplify by ignoring p(D) because it always falls out from the comparisons. D in this case is the observation of 10 swans, not just of the tenth swan.

L(H1|D) = 0.25 * 0 = 0
L(H2|D) = 0.25 * (0.5)^10 = .0002
L(H3|D) = 0.25 * (0.75)^10 = .014
L(H4|D) = 0.25 * 1 = 0.25

Normalizing so that the probabilities sum to 1.0, the posterior probabilities are

p(H1|D) = 0
p(H2|D) = 0.0009
p(H3|D) = 0.053
p(H4|D) = 0.947

By this point, the initial prior distribution has become almost irrelevant.

Let's see what the probabilities would be after 10 swans, all of them white, if the prior probabilities had been

p(H1) = 0.1
p(H2) = 0.7
p(H3) = 0.1
p(H4) = 0.1

From these, we would get

L(H1|D) = 0
L(H2|D) = 0.7 * (0.5)^10 = .00068
L(H3|D) = 0.1 * (0.75)^10 = .0056
L(H4|D) = 0.1 * 1 = 0.1

and

p(H1|D) = 0
p(H2|D) = 0.0064
p(H3|D) = 0.053
p(H4|D) = 0.941

Which is not very different from the probabilities we got using equal prior probabilities. This is almost always the result unless one's prior probability for some hypothesis is either zero or unity. One's initial prejudices get swamped by the data unless they are extremely strong (faith) or the data are more or less equally consistent with more than one of the original hypotheses (as would be the case if the hypotheses were that the proportion of white swans was either 0.4 or 0.6 and the observations were 50-50 white).

I guess that's a kind of introduction to page 1 of the Bayesian notes, and to the message I had been working on but had set aside as moot.

I may comment separately on the rest of your message, but I may leave it for the long message that I will take up again later. I had intended this reply to be only a few lines long, about all probabilities being conditional and subjective, but somehow it morphed into this tutorial. Sorry about that.

Martin

[From Bill Powers (2009.01.04.0812 MST)]

Martin Taylor 2009.01.03.22.54 –

  1. All probabilities are
    subjective in the sense that they depend on the background knowledge of
    the observer. The background knowledge may include models, records of
    previous occurrences of the “probabilistic” matter, pure
    intuition, faith, and anything else you can think of.
    “Frequentist” probability, on which significance statistics are
    based, takes into account only records of previous occurrences (but see
    below for a problem with even this).

This post has clarified your meanings considerably. So much so that I
want to make a recommendation, which is that we change the name of the
axis that goes from “uncertain” to “certain” from
“probability” to “belief.” I saw hints of this in
reading other sources about Bayesian probability, but your discussion
makes it the clearest: all probabilities are subjective. The progression
in the “White Swan” example shows that it is the state of
belief, not anything about the objective probable occurrance of white
swans, that changes as successive observations are made. This also takes
care of Mike Acree’s lovely critique in which he points out (in effect)
that the objective probability that the next swan will be white, derived
simply from observations, is the same as the probability that it will
have a head.
The a priori probabilities of which you speak are what we are led
to expect on the basis of theories about the world, or past experiences
with it. The a posteriori probabilities are what lead us to change
our predictions, if they do not agree with predictions based on the a
priori
expectations. What you’re describing is one version of
experimental science.

The frequentist approach is the simplest of theories, which says only
that what has happened before will happen again. As the cumulative record
of observed frequencies of various possible occurrances grows, the
predictions about which one will happen next change, with, presumably,
some steady-state distribution of belief being reached, the distribution
we have been calling the probability distribution. This distribution has
nothing to do with the objective truth, but reflects only our
expectations.

This leads right into your second point:

  1. All probabilities are
    conditional. That’s part of the background knowledge that you bring to
    bear.

Simply change “backround knowledge” to “expectations”
and this says what I said above (and what you said first). Whether these
expectations are based on superstition, faith, or impeccable scientific
model-building (which produces expectations that we elevate to the status
of knowledge), what we believe depends on what we expect. All that
matters is that we have some degree of belief in each possible
prediction, and revise those degrees of belief as evidence
accumulates.

In the
“frequentist” view of probability, you look at N situations,
and say that on M of them the event E happened, so the probability of
event E happening the next time is M/N. But that’s not legitimate, You
should at the very least say “conditional on the N situations all
having the same characteristics insofar as event E is concerned, and on
the next occurrence of the situation it will also have those
characteristics.” In physical fact, no two of the N situations had
all the same characteristics. At the very least, the universe had a
different age when each occurred.

The frequentist theory is that what has happened before is most likely to
happen again: there is nothing new under the sun. If M out of N
occurrances come out a particular way, the next M out of N will come out
that same way. It isn’t necessary that your expectation reflect the
actual ratio of M to N – if you keep track of the data on paper and use
a hand calculator to define your expectation, M/N would be the
appropriate expectation. But if you do the calculation in your head and
unconsciously, all we can say is that your degree of belief, or your
subjective prediction, will be some function of M and N. I think the
experimental data obtained on this question are pretty clear that
subjective expectations in most people do not conform to the strict
“M/N” hypothesis, whether frequentist or Bayesian.

But now that’s not a problem, because all beliefs are subjective and
conditional on theory-based or experience-based expectations.

What you are really saying is
that the characteristics you, the observer and the generator of this
probability measure, take into account are the same so far as you are
aware in their influence on E, where the different events E also share
all those characteristics that are of concern to you. You, the observer,
are creating the sets of situations and events, selecting which
characteristics of each are relevant to the proportion M/N, and you, the
frequentist creator of the probability estimate, are saying that the next
event that you will designate as being of the category E will occur with
probability p = M/N when that same situation arises. You put a confidence
bound on that, because you know that different base probability
distributions could have all given rise to that observed proportion
M/N.

Well, some observers may know that, but I don’t think many do. That’s the
long way of saying that people have different theories that lead them to
predict differently. They may or may not know that their theories are not
like other theories other people use, or that you could use instead.

What you can say about the
“White Swan” problem is that if the conditions remain the same,
the likelihood that the next swan you see will be white increases the
more swans you have seen without seeing a non-white one.

I think we can now say this in a way that doesn’t imply the objective
existence of some “liklihood” of seeing a white swan. The
actual likelihood is irrelevant; the black swan, which is waiting in the
wings, could come onstage next at any time by the whim of the stage
manager. All that matters is the subjective expectation, which is related
to actual occurrances according to the way each individual is
organized.

Your long series of calculations of degrees of belief ends with

p(H1|D) = 0

p(H2|D) = 0.0064

p(H3|D) = 0.053

p(H4|D) = 0.941

Which is not very different from the probabilities we got using equal
prior probabilities.

But it is probably different from the expectations of anyone who does not
use either of those particular mathematical methods of generating
expectations. In effect, the Bayesian system is a model which is used to
generate expectations about the relative degrees of belief a person will
arrive at on the basis of successive subjective observations. While it
creates a plausible proposal, experiments have shown that no one model
can explain everyone’s way of adjusting expectations on the basis of
successive observations.

For example: If a single acceptable observation of a person behaving as a
stimulus-response system were to occur, I would have to revise my
expectations about the behavior of all organisms rather drastically; much
as I would have to revise a large number of my expectations if a single
swan paddled by without its head. Mike Acree referred to this kind of
effect on expectations as depending on whether the variation under
discussion is superficial or fundamental: whether we are basing
expectations on things like color and number of feathers which are not
directly important in the swan’s life, or on things like a head or eyes
or a stomach without which, most of us would easily agree, the swan could
not live. Those opinions, too, are subjective beliefs, but they are of a
different order of belief than expectations based on nothing but
frequency. Some joker could be deliberately diverting all the black swans
from the stretch of river we are observing, but that, we can be sure,
does not apply to the idea of swans without heads.

It’s coincidental but instructive that David Goldstein’s post about QEEG
showed up at the same time as yours. Here we have a prime example of the
frequentist theory of knowledge. In this method, one records a large
number of

EEG samplings of scalp currents and looks for coherences among them in
many frequency bands, producing millions of observations in a few
minutes. Then one compares the data on the basis of clinical judgment of
subjects as to whether they are “normal” or
“abnormal” in their behavior. If a person shows
“abnormal” patterns, that person is judged to have something
wrong with him or her. What is wrong is unknown, but treatments can be
assigned based on whether they helped other people in the past to move
from clinically “abnormal” to clinically “normal.”
Thus the diagnosis and treatment can be automated without the slightest
understanding of how the brain works or how human behavior is organized,
or of whether the treatment actually leaves the person better off than
before.

The fact that all these judgements and actions are based on subjective
impressions of normality and abnormality is very effectively concealed by
the impressive instrumentation and massive snowdrifts of data. But one
can think of some questions about the advisability of this approach,.
Should a child diagnosed with ADHD be forced with drugs or other
treatments to stop showing those behaviors? Or should we find out what
the child is trying to accomplish and offer help in accomplishing it?
Maybe what’s wrong is not in the child, but the QEEG approach necessarily
ignores that possibility.

I now see the Bayesian concept of probability as the only plausible one.
All probabilities, all degrees of belief, are conditional on what we
already accept as true. The question is, now that we know this, what we
should do about it.

Best,

Bill P.

[From Bill Powers (2009.01.04.0812 MST)]
I now see the Bayesian concept of probability as the only plausible one. All probabilities, all degrees of belief, are conditional on what we already accept as true. The question is, now that we know this, what we should do about it.

For anyone wanting to know more about Bayes' theorem, there's an excellent introduction here:

An Intuitive Explanation of Bayes' Theorem
http://yudkowsky.net/rational/bayes

The other material on that site is also well worth reading, but be warned, there's a vast amount of it!

···

--
Richard Kennaway, jrk@cmp.uea.ac.uk, Richard Kennaway
School of Computing Sciences,
University of East Anglia, Norwich NR4 7TJ, U.K.

[Martin Taylor 2009.01.04.14.05]

Richard Kennaway wrote:

[From Bill Powers (2009.01.04.0812 MST)]
I now see the Bayesian concept of probability as the only plausible one. All probabilities, all degrees of belief, are conditional on what we already accept as true. The question is, now that we know this, what we should do about it.

For anyone wanting to know more about Bayes' theorem, there's an excellent introduction here:

An Intuitive Explanation of Bayes' Theorem
An Intuitive Explanation of Bayes’ Theorem – Eliezer S. Yudkowsky

I'm sorry, but I don't think it's an excellent introduction, for two reasons:

(1) It gives the impression that if you are to use Bayes Theorem the hypothesis universe is restricted to H and ~H, a binary choice. That impression is enhanced when the form of Bayes theorem is finally given late in the essay, since the denominator is given as p(X|A)*p(A) + p(X|~A)*p(~A), implying that only two hypotheses can ever be considered.

(2) It deals only in "objective" frequentist probabilities, and gives the impression that there could be "correct" prior probabilities for the binary choices.

Both of these problems could lead a naive reader into quite serious misunderstanding of the nature and power of Bayesian analysis.

Martin

[From Bill Powers (23009.01.04.1210 MST)]

For anyone wanting to know more about Bayes' theorem, there's an excellent introduction here:

An Intuitive Explanation of Bayes' Theorem
An Intuitive Explanation of Bayes’ Theorem – Eliezer S. Yudkowsky

How nice, I guessed reasonably right. It's scientific method. The variablf farthest to the right, however, should not be called "reality." It's another model.

Inadvertently, Yudkowsky confirms my claim that people don't reason using Bayesian logic, unless they specifically learn to do so. If they did, why would anyone come up with wrong answers to his questions? While they use what I call their logic levels to get the right answers, the same levels can be organized to reason in other ways. I've seen that exercise about mammography before in other forms, with the same basic explanation involving the ratio of false positives to true positives, but without Bayes. So I would have arrived at the right conclusion using my logic level without knowing Bayesian logic, but for effectively the same reasons. The people who arrive at the wrong conclusion use their logic levels to do that, too, but they're organized differently.

Also, I notice that even Bayesian logic seems to require dealing in qualitative events or propositions rather than quantitative measurements. If I propose that F/M = A, and observe that F/M = 10 and A = 9.9990, what is the value of p(F/M = 10 and A = 9.999 | F/M = A)? It's zero: A can't be equal to both 10 and 9.999 at the same time.

It simply doesn't make sense to use logical arguments about a quantitative relationship: wrong level of perception. When variables have an infinite number of possible values (or even just an unreasonable large number of them), we use not logic but algebra. And algebra is not about probabilities or beliefs.

Best,

Bill P.

···

At 06:16 PM 1/4/2009 +0000, Richard Kennaway wrote:

[Martin Taylor 2009.01.04.14.13]

[From Bill Powers (2009.01.04.0812 MST)]

Martin Taylor 2009.01.03.22.54 --

2) All probabilities are conditional. That's part of the background knowledge that you bring to bear.

Simply change "backround knowledge" to "expectations" and this says what I said above (and what you said first).

In my language, "expectations" would be the prior probability distribution over the hypotheses, whereas "background knowledge" is what leads to those expectations.

I'm not keen on revising the words conventionally used, much as you were not keen on finding a new word to represent an internal signal called "perception" in PCT. "Probability" as necessarily subjective is as closely related to "probability" in everyday parlance as are the technical and everyday use of "perception". "Belief", to me, has a somewhat different connotation. Maybe it's because I've used the terms for 50 years, but "prior" and "posterior" probabilities seem to have a nice precise connotation (limits between 0 and 1.0, the total of all mutually exclusive probabilities summing to 1.0, and so forth), as does "likelihood" (likelihoods don't necessarily sum to 1.0). Perhaps you can map "belief" onto probability by normalizing. I don't know.

What you can say about the "White Swan" problem is that if the conditions remain the same, the likelihood that the next swan you see will be white increases the more swans you have seen without seeing a non-white one.

I think we can now say this in a way that doesn't imply the objective existence of some "liklihood" of seeing a white swan.

Sorry. I'm so used to thinking only of subjective probabilities that I tend to omit the word "subjective", A likelihood is the probability of the hypothesis given the data, so all likelihoods are subjective. You really can't mix subjective and the rather surreal "objective" quantities in the same formula.

The actual likelihood is irrelevant; the black swan, which is waiting in the wings, could come onstage next at any time by the whim of the stage manager.

Yes, and I will consider that in the next episode. At the moment, all that we need to know is that the observer gives no credence to the hypothesis that there is a stage manager. Once you admit the stage manager hypothesis, it is totally consistent with all possible data. Only an extremely low prior (based on faith or models) can discredit the stage manager (personal God, guardian angel) hypothesis. One has to say "OK, but I just don't believe things work according to the whim of a stage manager". The problem is that it's a useless hypothesis, if one wants to use models and relations among observations to predict future events such as the sun coming up tomorrow. It invalidates all predictions, and yet, in the past, many such predictions have panned out, so we expect them to pan out tomorrow, as well. My subjective probability that the sun will rise tomorrow is very near 1.0, but I cannot dispute anyone who asserts that there is a stage manager who is capable of making it not happen.

In effect, the Bayesian system is a model which is used to generate expectations about the relative degrees of belief a person will arrive at on the basis of successive subjective observations. While it creates a plausible proposal, experiments have shown that no one model can explain everyone's way of adjusting expectations on the basis of successive observations.

True. But I think you misstate what the Bayesian system is. It isn't a model of what people do, but a model of the best they could do. It's a mathematical ideal, and in few, if any, situations do humans perform in a mathematically ideas way. I'd reword your first sentence here to substitute "could arrive at" for "will arrive at". A Bayesian analysis of experimental results, carried out correctly, can reach the ideal, but most people don't do that in everyday operation -- nor do most analyses of experimental results get carried out correctly (using appropriate conditionals in making generalizations, for example).

On the other hand, in a lot of situations, when something affects the performance of the ideal in some way, people very often are affected in the same way. In psychoacoustics, for example, the ability of a person to detect a signal (say "yes" when the signal exists and "no" when it doesn't) as measured by d' has a well defined ideal value for any particular defined signal and noise, provided that the prior information available to the ideal and to the human is the same. A well trained observer may get within 6 db of the ideal, and a highly trained one within 3 db. A more skilled reasoner is likely to come closer to the Bayesian ideal than is an unpracticed one. Sherlock Holmes can make wrong deductions from the given data, but for the same number of wrong deductions he will make more correct ones than will most people. Even he, however, cannot do better than a Bayesian analyst working with the same data and hypothesis universe.

If the noise power is changed in a psychoacoustic detection experiment, it changes the performance of the ideal and of the human by close to the same amount (in any experiments I have done). Likewise if we change the prior knowledge of the subject (and the ideal) about the exact waveform of the signal, that changes the performance of the ideal, and of the human, quite drastically and by the same amount (in the human's case we do that by providing the signal and noise in one ear, and the would-be signal simultaneously in the other ear, whether it is truly embedded in the noise or not).

So, even if people don't actually behave in an ideal way, it's not a bad thing to understand the mathematically ideal limit to what they could do. Personally, I would not be very surprised if it turned out that the peripheral neural structures had evolved to function in a near ideal way (within, say, 3db) when it comes to relating low-level perceptions to states of the external environment. (That speculation was at the base of my comments on neural spike timing effects last month).

Martin

[Martin Taylor 2009.01.04.15.16]

[From Bill Powers (23009.01.04.1210 MST)]

Also, I notice that even Bayesian logic seems to require dealing in qualitative events or propositions rather than quantitative measurements. If I propose that F/M = A, and observe that F/M = 10 and A = 9.9990, what is the value of p(F/M = 10 and A = 9.999 | F/M = A)? It's zero: A can't be equal to both 10 and 9.999 at the same time.

What were the prior beliefs and the hypotheses in this example? Is it a hypothesis that F/M=A and you are wanting to test that hypothesis. If so, and if your observables F/M and A were observed with infinite precision, then indeed your observations have totally discredited the hypothesis F/M = A, in exactly the same way as the observation of the first white swan discredited the hypothesis "no swans are white". How is that a problem?

Of course, if your observations were not made with infinite precision, the story is a bit different, isn't it?

Martin

[From Bill Powers (2009.01.05.0942 MST)]

Martin Taylor 2009.01.04.14.13 –

In my language,
“expectations” would be the prior probability distribution over
the hypotheses, whereas “background knowledge” is what leads to
those expectations.

OK. Both are established prior to testing the hypothesis.

I’m not keen on revising the
words conventionally used, much as you were not keen on finding a new
word to represent an internal signal called “perception” in
PCT. “Probability” as necessarily subjective is as closely
related to “probability” in everyday parlance as are the
technical and everyday use of “perception”.

I’m not sure of that. Don’t quantum physicists typically speak as if the
uncertainties involved in quantum phenomena are really there in nature,
rather than in the observer’s mind? Or do you exclude that from
“everyday” use? Don’t gamblers usually speak of probabilities
as if they were really there in the cards or the dice? Is a 30% chance of
showers tomorrow spoken of as if it’s in the forecaster’s perceptions –
or the viewer’s?

“Belief”, to me,
has a somewhat different connotation. Maybe it’s because I’ve used the
terms for 50 years, but “prior” and “posterior”
probabilities seem to have a nice precise connotation (limits between 0
and 1.0, the total of all mutually exclusive probabilities summing to
1.0, and so forth), as does “likelihood” (likelihoods don’t
necessarily sum to 1.0). Perhaps you can map “belief” onto
probability by normalizing. I don’t know.

OK, but when you start defining belief and likelihood so they can be
calculated from a set of definitions, it seems to me that this makes them
something other than subjective. Is a Gaussian probability distribution
subjective? Or is it Gaussian for everybody, whether they believe it is
or not?

What you can say about the
“White Swan” problem is that if the conditions remain the same,
the likelihood that the next swan you see will be white increases the
more swans you have seen without seeing a non-white one.

I think we can now say this in a way that doesn’t imply the objective
existence of some “liklihood” of seeing a white swan.

Sorry. I’m so used to thinking only of subjective probabilities that I
tend to omit the word “subjective”, A likelihood is the
probability of the hypothesis given the data,

… which means “the degree to which you will believe the hypothesis
given that you believe what the data appear to tell you?” The
shorthand way of putting it suggests strongly that there is an actual
probability of the hypothesis independent of the observer’s belief in it,
and that the data have meaning independent of the observer’s
interpretation. I think it’s better to use the long form of the
description in cases where a different interpretation is common and we
can expect it to be applied – or where the difference is the subject of
the conversation.

so all likelihoods are
subjective. You really can’t mix subjective and the rather surreal
“objective” quantities in the same formula.

Yes, but you just did it above – as far as some stranger reading your
words would be concerned. When you say " the probability of the
hypothesis given the data," you imply that there is in fact a
probability of the hypothesis, and that the data are simply
“given” to anyone who observes them.

The actual likelihood is
irrelevant; the black swan, which is waiting in the wings, could come
onstage next at any time by the whim of the stage manager.

Yes, and I will consider that in the next episode. At the moment, all
that we need to know is that the observer gives no credence to the
hypothesis that there is a stage manager. Once you admit the stage
manager hypothesis …

My words sent you off on a wild swan chase. I didn’t mean to emphasize
the “whim” part of that – the idea of someone behind the
scenes trying to fool us. I agree that that is a pointless assumption. I
meant only that the stage manager, following the rules of the script, is
holding the black swan aside for the proper time, at which time it will
appear. The black swan already exists although we haven’t seen it yet.
“Whim” was the wrong word. Perhaps, like a superenergetic
cosmic ray, it will appear during only one observing period out of a
million. But appear it will, and it will show that not all cosmic rays
are “normal,” even though a given observer has seen only normal
ones in his lifetime. Its existence is not uncertain; only the time when
it will be observed is.

In effect, the Bayesian
system is a model which is used to generate expectations about the
relative degrees of belief a person will arrive at on the basis of
successive subjective observations. While it creates a plausible
proposal, experiments have shown that no one model can explain everyone’s
way of adjusting expectations on the basis of successive
observations.

True. But I think you misstate what the Bayesian system is. It isn’t a
model of what people do, but a model of the best they could
do.

Yes, I was really trying to make that point. It is, as you say, an
idealized method, invented by someone and adopted by others but not a
“natural” aspect inherent in the brain’s organization.

On the other hand, in a lot of
situations, when something affects the performance of the ideal in some
way, people very often are affected in the same way.

In psychoacoustics, for
example, the ability of a person to detect a signal (say “yes”
when the signal exists and “no” when it doesn’t) as measured by
d’ has a well defined ideal value for any particular defined signal and
noise, provided that the prior information available to the ideal and to
the human is the same.

But if you say “the same prior information” you’re simply
assuming a human being who “correctly” interprets prior
observations in the “correct” way, so you’re just comparing two
ideal computations, one made by you and one made by the subject. How did
the subject do it? You don’t know. Maybe the ones who do it best have
figured out some of the principles of Bayesian logic.

A well trained observer
may get within 6 db of the ideal, and a highly trained one within 3 db.

What kind of db are those? If you mean 10 Log(x)[base 10], 3 db is a
factor of two in the measure of x, and 6 db is a factor of 4, neither of
which is very close. The problem with experiments like these is that they
create low-probability perceptions, and lead to interpreting the
results as if they apply to all perceptions – most of which are very
high-probability perceptions. An accuracy of 3 db may be astonishingly
(or expectedly) good for a very noisy observation, but it’s terrible for
an observation made under normal conditions.

A more skilled reasoner is
likely to come closer to the Bayesian ideal than is an unpracticed one.

Well, yes – that’s probably how Bayes got his insight, isn’t it, by
becoming more skilled? But I reason in ways quite compatible with
Bayesian reasoning just by considering the underlying relationships. The
probability that John is Pope can be computed only if you know the
probability that a given person is Catholic and remember that the Pope
has to be Catholic. You’re really asking for the probability p[(John is
Catholic) AND (John is Pope)]. You can reason that way without
(knowingly) applying Bayes’ Theorem. And also without realizing that the
need for such reasoning can be concealed in a complicated
problem.

Sherlock Holmes can make wrong
deductions from the given data, but for the same number of wrong
deductions he will make more correct ones than will most people. Even he,
however, cannot do better than a Bayesian analyst working with the same
data and hypothesis universe.

That is true, to the extent that both live in a universe where it is
difficult to know what the data are. But I don’t think Sherlock is any
better at distinguishing between his morning egg and his morning bacon
than is any person randomly chosen off the street – unless both are
forced to observe their breakfast through a fog at twilight from a
distance of 100 feet. Sherlock will notice the significance of a faint
odor of burnt grease and a flash of fuzzy yellow where the ignorant
wherryman (whatever that is) doesn’t.

Neither do I believe that Sherlock’s superiority is a matter of
statistical variations in his conclusions being smaller than those of
other people. His mean conclusion is better than the mean conclusion of
other people. If you see what I mean.

So, even if people don’t
actually behave in an ideal way, it’s not a bad thing to understand the
mathematically ideal limit to what they could do.

We part company when you assume that it is difficult most of the time to
behave in an ideal way because of large uncertainties of observation. All
the examples and experiments you cite assume (or impose) a high
noise-level on the observations, so instead of p(A|B) being 0.99, it is
0.3. I will readily admit that Bayesian statistics are appropriate in
cases where low probabilities are the norm. But I don’t admit that such
cases are predominant or even important in ordinary behavior – more than
occasionally.

Personally, I would not be
very surprised if it turned out that the peripheral neural structures had
evolved to function in a near ideal way (within, say, 3db) when it comes
to relating low-level perceptions to states of the external environment.
(That speculation was at the base of my comments on neural spike timing
effects last month).

I think that normally we get a LOT closer than 3 db. You couldn’t hum the
right note when the choir director blows into the pitch pipe if you could
perceive the difference in pitch only within 3 db, a factor of 2. Most
people can get within a fraction of a half-tone, which is a fraction of
the twelfth root of 2 of an octave, or a fraction of 1/4 db. If a
carpenter could measure distances only within 3 db, the house he is
building would be a random pile of sticks. Nobody would survive driving a
car.

When you investigate mostly cases in which signal-to-noise ratios are
low, you may tend to forget about all the other, far more frequent, cases
in which it is high. It’s like looking at a coarse half-tone picture with
a magnifying glass. All you see are the dots. It’s hard to see that it is
even a picture of something, until you put the magnifier aside and back
up a little. Then you see a perfectly clear photo of a person who is
without the slightest doubt your wife or someone who looks exactly like
her.

This is another way to see my reluctance to adopt your statistical
approach to perception. I think you are considering only a small part of
the whole dynamic range of perception; you examine the signal with enough
magnification to make uncertainties noticeable and measurable, or you
simply magnify uncertainties beyond their normal magnitude. Most of the
dynamic range is outside the field of observation. While we do sometimes
have to stumble through darkened rooms and blindly pick yellow M&Ms
out of a jar and try to hear what someone is saying while rock music is
blasting out of a speaker next to us, most of the time no such
difficulties exist. Then the uncertainties shrink into minor measurement
variations with very little range relative to the mean, and the random
differences become so small as to make no difference. Of course the
mathematics of uncertainty is still valid, but only in the statistical
sense of confidence levels; valid, but unimportant. The keys on which I
am typing are made of jiggling molecules, but that doesn’t make the keys
hard to find or press.

Best,

Bill P.

[From Bill Powers (2009.01.05.1303 MST)]

Martin Taylor 2009.01.04.15.16 --

Also, I notice that even Bayesian logic seems to require dealing in qualitative events or propositions rather than quantitative measurements. If I propose that F/M = A, and observe that F/M = 10 and A = 9.9990, what is the value of p(F/M = 10 and A = 9.999 | F/M = A)? It's zero: A can't be equal to both 10 and 9.999 at the same time.

What were the prior beliefs and the hypotheses in this example? Is it a hypothesis that F/M=A and you are wanting to test that hypothesis. If so, and if your observables F/M and A were observed with infinite precision, then indeed your observations have totally discredited the hypothesis F/M = A, in exactly the same way as the observation of the first white swan discredited
the hypothesis "no swans are white". How is that a problem?

Evidently my example didn't make that clear. I was trying to show that if you treat continuous measurement variables like logic variables, you come up with absurd conclusions.

What is the probability that we will observe the conjunction of F/M = 10 (one observation) and A = 9.999 ( a second observation), given that past experience has shown that F/M = A? The right answer is that A = 9.999 is essentially the same as A = 10, and the "given" condition is simply an idealization of a physical law, so the probability of the stated condition is not zero. But if you treat this as a problem in logic, then clearly it is impossible for F/M to equal 10 at the same time that we observe A to be equal to 9.999, given the generalization that F = MA, because for that to be true, 9.999 would have to be equal to 10, which it is not (no matter how many 9s there are).

Of course, if your observations were not made with infinite precision, the story is a bit different, isn't it?

Of course. But the means of the observations will never exactly support the premise that F/M = A, so it is highly unlikely for the probability to be other than zero if we treat the statements as logical conditions. All this would be obviated if the "given" condition were stated as, for example, F = M/A +/- 5%.

Now the given condition is no longer a logical variable but a quantitative statement with a specified range of uncertainty. If we know the exact distribution of the uncertainty, we can judge the probability that the left-hand term is true given the right-hand term; that probability will vary continuously with the difference between the observed value of F/M and the observed value of A. It can no longer be determined by the Bayesian calculation. If A is 9.9 we will get one probability; if it is 9.6 we will get a different probability. More exactly, we would have to know the distributions of all the stated equalities, but I think the point is clear. The rules governing continuous relationships are different from the rules governing logical conditions.

Best,

Bill P.

[Martin Taylor 2009.01.05.13.47]

[From Bill Powers (2009.01.05.0942 MST)]

Martin Taylor 2009.01.04.14.13 –

In my language,
“expectations” would be the prior probability distribution over
the hypotheses, whereas “background knowledge” is what leads to
those expectations.

OK. Both are established prior to testing the hypothesis.

I’m not keen on revising
the
words conventionally used, much as you were not keen on finding a new
word to represent an internal signal called “perception” in
PCT. “Probability” as necessarily subjective is as closely
related to “probability” in everyday parlance as are the
technical and everyday use of “perception”.

I’m not sure of that. Don’t quantum physicists typically speak as if
the
uncertainties involved in quantum phenomena are really there in nature,
rather than in the observer’s mind? Or do you exclude that from
“everyday” use? Don’t gamblers usually speak of probabilities
as if they were really there in the cards or the dice? Is a 30% chance
of
showers tomorrow spoken of as if it’s in the forecaster’s perceptions

or the viewer’s?

Lots of people talk about probabilities as though they were really out
there. That’s true of many perceptions, isn’t it? We don’t know whether
our perceptions match real reality for probabilities any more than they
do for anything else. Maybe they do, maybe they don’t. When a
forecaster says there’s a 30% chance of showers tomorrow, that
presumably is what he believes is the limit of his understanding and
what he wants us to believe if we base our estimate only on his word.
If he knew all of the physical conditions that applied to producing the
weather, his words would say exactly when and where each shower would
occur, wouldn’t they?

When gamblers think of probabilities as being “in the cards”, they are
working from a blend of models and past experience in developing their
subjective probabiities, are they not?

As for the quantum physicists, isn’t there an argument even within the
fraternity about the “reality” of the probabilities? Even there, there
are two sorts of probability to be considered: the probability that X
will be observed under condition Y, and the model of quantum physics
that asserts an intrinsic probability to lurk within the systems of the
very small. The first is a subjective probability, whose value may
involve the second if the person believes that model.

“Belief”, to me,
has a somewhat different connotation. Maybe it’s because I’ve used the
terms for 50 years, but “prior” and “posterior”
probabilities seem to have a nice precise connotation (limits between 0
and 1.0, the total of all mutually exclusive probabilities summing to
1.0, and so forth), as does “likelihood” (likelihoods don’t
necessarily sum to 1.0). Perhaps you can map “belief” onto
probability by normalizing. I don’t know.

OK, but when you start defining belief and likelihood so they can be
calculated from a set of definitions, it seems to me that this makes
them
something other than subjective. Is a Gaussian probability distribution
subjective? Or is it Gaussian for everybody, whether they believe it is
or not?

This is a bit like saying that 2 + 2 = 4 is a subjective opinion. It
is, but in most people’s minds it ties in with so much else that the
whole web of relationships is consistent enough to make it seem like an
objective reality. You are talking here about the subjective or
objective reality of mathematical operations. A Gaussian probability
distribution is a mathematical statement. Whether it applies in any
situation is another question. Whether mathematical operations exist
outside the mind is yet another, one that I don’t want to contemplate.

Let me use an example I like. Does the question “What is the
probability that the next stranger I see will be wearing brown shoes?”
make any sense to you? Is there such a thing as an “objective
probability” answer to that question? If so, what is the population of
“next stranger I see”, and how do I find the probability that a random
member of that population wears brown shoes? There is a definite
subjective probability answer, different perhaps for you and for me,
but it’s not a silly question if it is asked about your subjective
probability.

One of the definitions of probability as a mathematical concept is that
it is limited to a range of zero and unity. It would be quite possible
to define a related concept that ranged from zero to infinity, or from
minus to plus infinity. The zero to unity range is easy to work with,
and people usually don’t have much trouble with concepts that can map
onto it, such as 50-50 or “ten to one”. If I say I’d give ten to one on
something happening, I wouldn’t object if someone told me that this
meant I had assigned it a probability of 0.91.

Probability has some requirements if it is to be a useful concept, such
as that if certainty is taken to be probability 1.0, then if a set of
possibilities is exhaustive in that one and only one of them must be
true, then their probabilities must sum to 1.0 exactly. That must be
just as true for subjective probability as for “objective” probability.
You can’t be certain that they sky will be cloudless and overcast when
you first look out tomorrow, but you can say there’s a 50-50 chance of
either (assuming that there are no possible intermediate cloudiness
states). It’s convenient also to arrange the probability scale
mathematically so that if B never happens except if A has happened and
when A happens be happed p(B|A) of the time, the probability of B
happening will be p(B)p(B|A). That’s a normal intuition about
probability, so why not scale it so that works? The others of what
Jaynes calls “Desiderata” are of the same kind. They give expression to
what a lot of people think of as being natural properties of something
one would call “probability”.

So, it’s not true that “belief” and “likelihood” can be calculated from
a set of definitions, so much as that the properties of probability as
a mathematical concept are devised so as to agree with what most people
would think of as natural properties of probability, and those
properties in turn make it possible to do the calculations.

What you can say
about the
“White Swan” problem is that if the conditions remain the same,
the likelihood that the next swan you see will be white increases the
more swans you have seen without seeing a non-white one.

I think we can now say this in a way that doesn’t imply the objective
existence of some “liklihood” of seeing a white swan.

Sorry. I’m so used to thinking only of subjective probabilities that I
tend to omit the word “subjective”, A likelihood is the
probability of the hypothesis given the data,

… which means “the degree to which you will believe the hypothesis
given that you believe what the data appear to tell you?” The
shorthand way of putting it suggests strongly that there is an actual
probability of the hypothesis independent of the observer’s belief in
it,
and that the data have meaning independent of the observer’s
interpretation.

I think I accept your first sentence, but I don’t see how your second
relates to it. Maybe the second episode message will help.

I
meant only that the stage manager, following the rules of the script,
is
holding the black swan aside for the proper time, at which time it will
appear. The black swan already exists although we haven’t seen it yet.
“Whim” was the wrong word. Perhaps, like a superenergetic
cosmic ray, it will appear during only one observing period out of a
million. But appear it will, and it will show that not all cosmic rays
are “normal,” even though a given observer has seen only normal
ones in his lifetime. Its existence is not uncertain; only the time
when
it will be observed is.

Oh, but its existence IS uncertain to the person counting swans, even
if it is not to the stage manager. That uncertainty is why no amount of
counting swans that are white will ever lead to a probability 1.0 for
the hypothesis “All swans are white”, using the kind of analysis I
illustrated…

In effect, the
Bayesian
system is a model which is used to generate expectations about the
relative degrees of belief a person will arrive at on the basis of
successive subjective observations. While it creates a plausible
proposal, experiments have shown that no one model can explain
everyone’s
way of adjusting expectations on the basis of successive
observations.

True. But I think you misstate what the Bayesian system is. It isn’t a
model of what people do, but a model of the best they could
do.

Yes, I was really trying to make that point. It is, as you say, an
idealized method, invented by someone and adopted by others but not a
“natural” aspect inherent in the brain’s organization.

Correct. Neither is the Ideal Observer of psychoacoustics inherent in
someone’s brain. It’s the best that any mechanism, biological or
designed, could do under the specified circumstances. It doesn’t matter
how the organism or the mechanism works.

On the other hand, in a
lot of
situations, when something affects the performance of the ideal in some
way, people very often are affected in the same way.

In psychoacoustics, for
example, the ability of a person to detect a signal (say “yes”
when the signal exists and “no” when it doesn’t) as measured by
d’ has a well defined ideal value for any particular defined signal and
noise, provided that the prior information available to the ideal and
to
the human is the same.

But if you say “the same prior information” you’re simply
assuming a human being who “correctly” interprets prior
observations in the “correct” way, so you’re just comparing two
ideal computations, one made by you and one made by the subject.

No I’m not. I’m saying that no matter how the subject does it, the
result can never be better than the ideal, any more than a mechanical
engine can get more than an ideal engine out of a given temperature
drop. Ideal is ideal, a limiting possibility.

How did
the subject do it? You don’t know. Maybe the ones who do it best have
figured out some of the principles of Bayesian logic.

As my own subject in such experiments, I don’t know how the subject
does it. For sure, it’s not done with any conscious computation! You
just learn to hear something that turns out to be what you are supposed
to be hearing.

A well trained observer
may get within 6 db of the ideal, and a highly trained one within 3 db.

What kind of db are those? If you mean 10 Log(x)[base 10], 3 db is a
factor of two in the measure of x, and 6 db is a factor of 4, neither
of
which is very close.

Close is in the eye of the observe. 3 db is, for many people, about the
minimum difference that will allow them to say something is louder or
softer than something else, though trained subjects under good
conditions can probably do 1 db. What the ideal can do depends on the
length of the listening interval and on the information it has about
such things as the signal frequency. When you are measuring energy, 3db
is indeed a doubling of the energy. The point is that it takes a great
deal of training (like two months of 3 hours a day) for people to get
that close to the ideal when detecting a signal, and it takes some
training to come to within 6 db. When you are comparing with the ideal,
3 db is reasonably used to mean “half as precise” or the appropriate
analogue in other domains.

The problem with experiments like these is that they
create low-probability perceptions, and lead to interpreting
the
results as if they apply to all perceptions – most of which are very
high-probability perceptions. An accuracy of 3 db may be astonishingly
(or expectedly) good for a very noisy observation, but it’s terrible
for
an observation made under normal conditions.

I think you completely misunderstand. If the ideal finds a detection
easy, so will the human. If the ideal finds it difficult, so will the
human. In both cases, the human seems usually to be about the same
number of db worse than the ideal, the actual number depending on the
human’s listening ability. I have not the slightest idea what “an
accuracy of 3 db” could mean. Maybe you could explain that usage.

Sherlock Holmes can make
wrong
deductions from the given data, but for the same number of wrong
deductions he will make more correct ones than will most people. Even
he,
however, cannot do better than a Bayesian analyst working with the same
data and hypothesis universe.

That is true, to the extent that both live in a universe where it is
difficult to know what the data are.

No. It’s simply true. No qualification. No exemptions.

So, even if people don’t
actually behave in an ideal way, it’s not a bad thing to understand the
mathematically ideal limit to what they could do.

We part company when you assume that it is difficult most of the time
to
behave in an ideal way because of large uncertainties of observation.

Again, you completely misunderstand. It is you who insert “large
uncertainties of observation” into the discussion. Get rid of them. It
doesn’t matter in the least how large the uncertainties are. The ideal
is the ideal, and no human or machine can do better.

I will readily admit that Bayesian statistics are
appropriate in
cases where low probabilities are the norm. But I don’t admit that such
cases are predominant or even important in ordinary behavior – more
than
occasionally.

Could you explain how the mathematical relations depend on the
probabilities being low?

Personally, I would not
be
very surprised if it turned out that the peripheral neural structures
had
evolved to function in a near ideal way (within, say, 3db) when it
comes
to relating low-level perceptions to states of the external
environment.
(That speculation was at the base of my comments on neural spike timing
effects last month).

I think that normally we get a LOT closer than 3 db.

Demonstration, or reference, using the appropriate ideal performance
for the situation?

You couldn’t hum the
right note when the choir director blows into the pitch pipe if you
could
perceive the difference in pitch only within 3 db, a factor of 2. Most
people can get within a fraction of a half-tone, which is a fraction of
the twelfth root of 2 of an octave, or a fraction of 1/4 db.

I’m not sure what the ideal is for this example, but it has to depend
on the length of the sample and its SNR. To make the argument, you
would have to know the ideal for the experimental situation, and then
see if people could get within twice the ideal discriminable interval
(which I think would be a reasonable interpretation of a 3 db
difference, even though energy really doesn’t apply here).

If a
carpenter could measure distances only within 3 db, the house he is
building would be a random pile of sticks. Nobody would survive driving
a
car.

Oh, what ARE you talking about? What is the ideal for the carpenter
situation? What is it for the car driver? To reword those two
questions: What is the best that the best possible machine could do in
their situation? Then, “Do they come within being half as precise as
that ideal (3 db)?”

This is another way to see my reluctance to adopt your statistical
approach to perception. I think you are considering only a small part
of
the whole dynamic range of perception;

You seem to want to insist on this, for reasons unclear to me, rather
in the way Rick wants to insist that I want to extract individual
control parameters from group survey data. It’s as though for you there
is some threshold. A quantity less than epsilon is identically zero, it
seems, where epsilon is defined in whatever way suits your whim of the
moment. I don’t think of zero that way. The analyses work without
artificial thresholds. The interesting question is whether people work
with thresholds.

Anyway, the only claim I make is that no biological or mechanical
entity can outperform a properly conceived ideal mechanism, and that in
the situations where comparisons have been possible, people’s sensory
systems seem usually to mimic the behaviour of the ideal, but
consistently fall a little short – how short depends on training in
the situation.

···

As an aside, you might like the story of a couple of experiments done
by W. P. Tanner (Spike) in the late 50s when he was introducing signal
detection theory to psychologists, from radar engineering. In one
experiment, he was measuring frequency discrimination, simply whether a
tone moved up or down in pitch (so I imagine he had computed the ideal
observer for this condition). All his subjects except two learned to do
this quite well in one or two sessions. He assumed the two were not
tome deaf, because their native language was a tonal one, in which
pitch shifts made differences in meaning. They could not converse
properly in their native language without controlling their perceived
pitch shifts. Spike therefore assumed that their sensory systems could
discriminate up from down, but at some level they were not interpreting
it the same way when the tone shifts were divorced from voice. He kept
them in the experiment for many days (always letting them know after
each response what was the correct answer). Eventually, there came a
day for each of them when they cottoned on to what they were supposed
to be listening for, at which point their answers shifted from pure
chance to 100% correct within minutes. You have to learn to perceive
something consciously, even if you normally control it quite precisely.

The second experiment has something of the same flavour. In this one.
Tanner designed a binaural sound presentation that was unlike anything
anyone ever heard in the natural world. In each ear he played a ramp
tone lasting (I think) 100 msec – in other words, a “bip”. In one ear
the ramp rose, and in the other it dropped, so that the total energy
level was constant throughout the duration of the bip, but the apparent
placement swept from one side to the other. In this blip he placed a
short pulse. during this pulse, the energy level dropped to zero in one
ear, but went up to the same total energy level in the other ear. The
pulse might come early or late in the “bip”. Nobody could discriminate
whether it came early or late when they started the experiment, and
most people spent many days before they were able to discriminate the
difference. When they did, their scores went from chance to 100% over a
couple of sessions, and the difference between early and late was a
simple obvious perception. One subject took 44 days before this
happened, but it did happen for each of his subjects.

I draw no moral from this, but I thought it might interest you because
of the way people went from being quite unable to discriminate
differences that in the one experiment they must have been using and in
the other they had never heard before, to being able to discriminate
those differences as easily as discriminating red from blue.


The second episode of the Bayesian discussion has been a bit delayed
because I want to include some computed curves. I had bought a program
called iMathGeo to do this, but it had a bug, which the author has
apparently fixed this morning.

Martin

[Martin Taylor 2009.01.05.15.35]

[From Bill Powers (2009.01.05.1303 MST)]

Martin Taylor 2009.01.04.15.16 --

Also, I notice that even Bayesian logic seems to require dealing in qualitative events or propositions rather than quantitative measurements. If I propose that F/M = A, and observe that F/M = 10 and A = 9.9990, what is the value of p(F/M = 10 and A = 9.999 | F/M = A)? It's zero: A can't be equal to both 10 and 9.999 at the same time.

Of course, if your observations were not made with infinite precision, the story is a bit different, isn't it?

Of course. But the means of the observations will never exactly support the premise that F/M = A, so it is highly unlikely for the probability to be other than zero if we treat the statements as logical conditions.

That's why Bayesian Analysis does not apply to logical analysis. There's no probability assigned to the truth or falsity of A && B when A and B are individually known to be true or false.

All this would be obviated if the "given" condition were stated as, for example, F = M/A +/- 5%.

Now the given condition is no longer a logical variable but a quantitative statement with a specified range of uncertainty. If we know the exact distribution of the uncertainty, we can judge the probability that the left-hand term is true given the right-hand term; that probability will vary continuously with the difference between the observed value of F/M and the observed value of A. It can no longer be determined by the Bayesian calculation.

Why do you say that. This is precisely the situation where you CAN use the Bayesian calculations. It's when you have non-probabilistic logical statements that probability analyses are inappropriate.

If A is 9.9 we will get one probability; if it is 9.6 we will get a different probability. More exactly, we would have to know the distributions of all the stated equalities, but I think the point is clear.

The "distributions of all the stated equalities" perhaps should be restated as "the expected distributions of the observation errors of the observed quantities". The point is indeed clear. It is the point I was making when I said "Of course, if your observations were not made with infinite precision, the story is a bit different, isn't it?".

The rules governing continuous relationships are different from the rules governing logical conditions.

Yes, they are, but that's hardly relevant. The rules governing road traffic are different from the rules governing house heating, too.

Anyway, the start of the "second episode" of my Bayesian tutorial covers precisely this point of continuous distributions, in part because of the misleading essay for which Richard K gave the link, but in part because I had intended it when I set out the basic concepts, by sampling from the continuous hypothesis distribution in the calculations included in the first episode. And now I seem to have the program that will draw the curves in question, I might be able to finish it tonight.

Martin

[From Mike Acree (2009.01.05.1309 PST)]

Martin Taylor 2009.01.04.14.13--

the Bayesian system . . . isn't a model of what people do, but a model

of the best they could do. It's a >mathematical ideal

I don' know whether it's worth repeating, but this is a conclusion I
argued against in my paper. The problem, I contended, with taking
Bayesian theory as a model of inference is that it forces beliefs to
behave as chances. It constrains them to a scale with certainty at both
ends (once 0 and 1 are interpreted epistemically, which is what
inferential application does), and to an additive rule for their
combination (probabilities over defined alternatives must sum to 1). I
see Bayes' Theorem as a perfectly legitimate formula for solving certain
kinds of arithmetic problems. Once we get past the
bookbag-and-pokerchip exercises, however, I think the assumptions
required for proper application are met much less often than is widely
assumed. And I wouldn't be inclined to say, even in legitimate
applications, that Bayes' Theorem constituted a model of our thinking,
any more than I would say that matrix algebra modeled our thinking about
spatial transformation.

Of the large literature on the subject, I might mention just Gerd
Gigerenzer and David Murray's _Cognition as Intuitive Statistics_. In
addition to their standard critique, they raise the possibility that the
frequentist Neyman-Pearson approach may provide the better model for
thinking. The discrete jumps from one hypothesis to another
necessitated by its decision theory may at least be closer in a
descriptive sense to actual human reasoning than the gradual revision of
probabilities modeled in the Bayesian approach. Could any PCT
proponents say they had followed an incremental, Bayesian path in
reaching PCT, gradually revising their probabilities that it was true?

Bill Powers (2009.01.04.0812 MST) --

I now see the Bayesian concept of probability as the only plausible

one. All probabilities, all degrees of >belief, are conditional on what
we already accept as true.

Keynes, a Bayesian before it was called that, was perhaps the first
modern probability theorist (his 1905 dissertation, published in 1921)
to argue that all probabilities were conditional, though he accords
priority for that idea to Ludwig M. Kahle, for a book published in 1735,
before Bayes himself. But I don't think the idea that all probabilities
are at least implicitly conditional need commit one to a Bayesian
concept of probability.

Martin Taylor 2009.01.05.13.47--

We don't know whether our perceptions match real reality for

probabilities any more than they do for >anything else.

John Venn, in _The Logic of Chance_ (1888, p. 157n), offers this
illustration of the ambiguity in the metaphysical status of probability:
"The best example I can recall of the distinction between judging from
the subjective and the objective side, in such cases as these, occurred
once in a railway train. I met a timid old lady who was in much fear of
accidents. I endeavoured to soothe her on the usual statistical ground
of the extreme rarity of such events. She listened patiently, and then
replied, 'Yes, Sir, that is all very well; but I don't see how the real
danger will be a bit the less because I don't believe in it.'" The
story is also told of Niels Bohr that a reporter questioned him once
about the horseshoe hanging over his door. Bohr assured the reporter
that of course he didn't believe in such good-luck charms, but he had
heard that it helped even if one didn't believe.

Mike

[From Bill Powers (2009.01.05.1518 MST)]

Martin Taylor 2009.01.05.13.47] –

I’m not keen on revising the
words conventionally used, much as you were not keen on finding a new
word to represent an internal signal called “perception” in
PCT. “Probability” as necessarily subjective is as closely
related to “probability” in everyday parlance as are the
technical and everyday use of “perception”.

I’m not sure of that. Don’t quantum physicists typically speak as if the
uncertainties involved in quantum phenomena are really there in nature,
rather than in the observer’s mind? Or do you exclude that from
“everyday” use? Don’t gamblers usually speak of probabilities
as if they were really there in the cards or the dice? Is a 30% chance of
showers tomorrow spoken of as if it’s in the forecaster’s perceptions –
or the viewer’s?

Lots of people talk about probabilities as though they were really out
there. That’s true of many perceptions, isn’t it? We don’t know whether
our perceptions match real reality for probabilities any more than they
do for anything else.

OK, so which usage is the one you mean when you refer to the meanings
“conventionally used?” I assumed you meant the everyday usage,
not the formal usage restricted to a few people. But you seem to mean the
latter.

How often you and I argue because of using words differently, or not
making clear which of several meanings is intended!

… which means “the degree
to which you will believe the hypothesis given that you believe what the
data appear to tell you?” The shorthand way of putting it suggests
strongly that there is an actual probability of the hypothesis
independent of the observer’s belief in it, and that the data have
meaning independent of the observer’s interpretation.

I think I accept your first sentence, but I don’t see how your second
relates to it. Maybe the second episode message will help.

I meant only that the
stage manager, following the rules of the script, is holding the black
swan aside for the proper time, at which time it will appear. The black
swan already exists although we haven’t seen it yet. “Whim” was
the wrong word. Perhaps, like a superenergetic cosmic ray, it will appear
during only one observing period out of a million. But appear it will,
and it will show that not all cosmic rays are “normal,” even
though a given observer has seen only normal ones in his lifetime. Its
existence is not uncertain; only the time when it will be observed
is.

Oh, but its existence IS uncertain to the person counting swans, even if
it is not to the stage manager.

Yes, that is what I’m getting at. You seem to be putting me on the
opposite side of the argument from the one I’m taking. The
“shorthand” way of stating the proposition is the one you gave:
“A likelihood is the probability of the hypothesis given the
data”. That way of saying it would suggest to all but the annointed
few that the hypothesis itself HAS a probability of being true, given the
self-evident meaning of the data. I want to see “liklihood”
expanded so it doesn’t sound like something that has an existence of its
own, and “data” so it doesn’t sound as if data mean the same
thing to everyone.

My bungled point about the stage manager was to show that subjective
probabilty does not determine what happens; if the black swan is going to
appear, it is already upstream floating toward the point of discovery,
unaffected by the expectations of the observer. The mere fact that the
observer has never seen a black one before is an accident; a different
observer may have done so. So subjective probability is truly subjective
and person-specific.

I still don’t see what we are really discussing there, but the following
brings out something very clearly:

True. But I think you misstate
what the Bayesian system is. It isn’t a

model of what people do, but a model of the best they could
do.

Yes, I was really trying to make that point. It is, as you say, an
idealized method, invented by someone and adopted by others but not a
“natural” aspect inherent in the brain’s
organization.

This is the start of a comedy of misunderstanding. I said “idealized
method,” meaning a mathematical form that is an idealization of some
natural form to which the idealized one is fit. You took this to mean
“ideal” in a different sense:

Correct. Neither is the Ideal
Observer of psychoacoustics inherent in someone’s brain. It’s the best
that any mechanism, biological or designed, could do under the specified
circumstances.

So now when you say “ideal” you’re thinking “the best that
can be achieved,” and I’m thinking “a convenient approximation
to the actual form.”

For example, I say

But if you say “the same
prior information” you’re simply assuming a human being who
“correctly” interprets prior observations in the
“correct” way, so you’re just comparing two ideal computations,
one made by you and one made by the subject.

and you say

No I’m not. I’m saying that no
matter how the subject does it, the result can never be better than the
ideal, any more than a mechanical engine can get more than an ideal
engine out of a given temperature drop. Ideal is ideal, a limiting
possibility.

What I was talking about was the idealization involved in saying that the
Bayesian calculation reveals the best possible performance. A much better
performance may be possible. A gambler can do much better than chance at
predicting the probability that a given horse will win, given the track
conditions, if he has discovered that the probability known to the bookie
comes from a delayed broadcast, while the one known to the gambler comes
from a phone call from a confederate at the finish line – five minutes
ago. The assumptions, in other words, on which the Bayesian analysis
depends, which lead to the idealization that the experimenter uses, may
not be the ones the subject is using, so the subject is operating on the
basis of a different idealization. Note that this idealization is in the
sense of which model of the process is being used, not in the sense of
the best possible performance under a fixed and known model. When you say
“no matter how the subject does it,” you are not considering
(and couldn’t consider) all the possible ways it can be done. You mean
“all the ways it could be done under the conditions we are tacitly
accepting ion the background.”

A well trained observer
may get within 6 db of the ideal, and a highly trained one within 3 db.

What kind of db are those? If you mean 10 Log(x)[base 10], 3 db is a
factor of two in the measure of x, and 6 db is a factor of 4, neither of
which is very close.

Close is in the eye of the observe. 3 db is, for many people, about the
minimum difference that will allow them to say something is louder or
softer than something else, though trained subjects under good conditions
can probably do 1 db.

That makes no sense under my definition of a decibel, which is the one
used in electronics. People can’t tell if the level of a sound has
increased until it doubles? Perhaps you’re using a different definition
of a decibel.

What the ideal can do depends on
the length of the listening interval and on the information it has about
such things as the signal frequency. When you are measuring energy, 3db
is indeed a doubling of the energy. The point is that it takes a great
deal of training (like two months of 3 hours a day) for people to get
that close to the ideal when detecting a signal, and it takes some
training to come to within 6 db. When you are comparing with the ideal, 3
db is reasonably used to mean “half as precise” or the
appropriate analogue in other domains.

Well, I suppose this is what you really mean, but there are lots of other
perceptions beside sound intensity that vary along a continuum, and
untrained people can certainly detect much smaller differences that 1
decibel in many of them, defined as I define it – 10 log(x2/x1). Look at
the length of two sticks; anyone can see if a stick is longer or shorter
than a foot ruler within, say, a quarter of an inch, which converted to
decibels is 10 log (12.25/12) or 0.09 decibel. Speaking of pitch, the
ratio of A# to A is 466 to 440, a difference of 5% or, on the decibel
scale, 0.21 decibel. I would guess that almost anyone could hear that
difference. Perhaps the case you’re talking about, perception of
intensity, is special.

The problem with experiments
like these is that they create low-probability perceptions, and
lead to interpreting the results as if they apply to all perceptions –
most of which are very high-probability perceptions. An accuracy of 3 db
may be astonishingly (or expectedly) good for a very noisy observation,
but it’s terrible for an observation made under normal
conditions.

I think you completely misunderstand. If the ideal finds a detection
easy, so will the human. If the ideal finds it difficult, so will the
human. In both cases, the human seems usually to be about the same number
of db worse than the ideal, the actual number depending on the human’s
listening ability. I have not the slightest idea what “an accuracy
of 3 db” could mean. Maybe you could explain that
usage.

It’s the ratio of the estimate to the actual value, converted to
decibels., which is just a log scale.

Sherlock Holmes can make wrong
deductions from the given data, but for the same number of wrong
deductions he will make more correct ones than will most people. Even he,
however, cannot do better than a Bayesian analyst working with the same
data and hypothesis universe.

That is true, to the extent that both live in a universe where it is
difficult to know what the data are.

No. It’s simply true. No qualification. No exemptions.

Again, the confusion over “ideal.” Sherlock uses a different

···

So, even if people don’t
actually behave in an ideal way, it’s not a bad thing to understand the
mathematically ideal limit to what they could do.

We part company when you assume that it is difficult most of the time to
behave in an ideal way because of large uncertainties of observation.

Again, you completely misunderstand. It is you who insert “large
uncertainties of observation” into the discussion. Get rid of them.
It doesn’t matter in the least how large the uncertainties are. The ideal
is the ideal, and no human or machine can do better.

I will readily admit that
Bayesian statistics are appropriate in cases where low probabilities are
the norm. But I don’t admit that such cases are predominant or even
important in ordinary behavior – more than
occasionally.

Could you explain how the mathematical relations depend on the
probabilities being low?

Personally, I would not be
very surprised if it turned out that the peripheral neural structures had
evolved to function in a near ideal way (within, say, 3db) when it comes
to relating low-level perceptions to states of the external environment.
(That speculation was at the base of my comments on neural spike timing
effects last month).

I think that normally we get a LOT closer than 3 db.

Demonstration, or reference, using the appropriate ideal performance for
the situation?

You couldn’t hum the right note
when the choir director blows into the pitch pipe if you could perceive
the difference in pitch only within 3 db, a factor of 2. Most people can
get within a fraction of a half-tone, which is a fraction of the twelfth
root of 2 of an octave, or a fraction of 1/4 db.

I’m not sure what the ideal is for this example, but it has to depend on
the length of the sample and its SNR. To make the argument, you would
have to know the ideal for the experimental situation, and then see if
people could get within twice the ideal discriminable interval (which I
think would be a reasonable interpretation of a 3 db difference, even
though energy really doesn’t apply here).

If a carpenter could measure
distances only within 3 db, the house he is building would be a random
pile of sticks. Nobody would survive driving a car.

Oh, what ARE you talking about? What is the ideal for the carpenter
situation? What is it for the car driver? To reword those two questions:
What is the best that the best possible machine could do in their
situation? Then, “Do they come within being half as precise as that
ideal (3 db)?”

This is another way to see my reluctance to adopt your statistical
approach to perception. I think you are considering only a small part of
the whole dynamic range of perception;

You seem to want to insist on this, for reasons unclear to me, rather in
the way Rick wants to insist that I want to extract individual control
parameters from group survey data. It’s as though for you there is some
threshold. A quantity less than epsilon is identically zero, it seems,
where epsilon is defined in whatever way suits your whim of the moment. I
don’t think of zero that way. The analyses work without artificial
thresholds. The interesting question is whether people work with
thresholds.

Anyway, the only claim I make is that no biological or mechanical entity
can outperform a properly conceived ideal mechanism, and that in the
situations where comparisons have been possible, people’s sensory systems
seem usually to mimic the behaviour of the ideal, but consistently fall a
little short – how short depends on training in the situation.


As an aside, you might like the story of a couple of experiments done by
W. P. Tanner (Spike) in the late 50s when he was introducing signal
detection theory to psychologists, from radar engineering. In one
experiment, he was measuring frequency discrimination, simply whether a
tone moved up or down in pitch (so I imagine he had computed the ideal
observer for this condition). All his subjects except two learned to do
this quite well in one or two sessions. He assumed the two were not tome
deaf, because their native language was a tonal one, in which pitch
shifts made differences in meaning. They could not converse properly in
their native language without controlling their perceived pitch shifts.
Spike therefore assumed that their sensory systems could discriminate up
from down, but at some level they were not interpreting it the same way
when the tone shifts were divorced from voice. He kept them in the
experiment for many days (always letting them know after each response
what was the correct answer). Eventually, there came a day for each of
them when they cottoned on to what they were supposed to be listening
for, at which point their answers shifted from pure chance to 100%
correct within minutes. You have to learn to perceive something
consciously, even if you normally control it quite precisely.

The second experiment has something of the same flavour. In this one.
Tanner designed a binaural sound presentation that was unlike anything
anyone ever heard in the natural world. In each ear he played a ramp tone
lasting (I think) 100 msec – in other words, a “bip”. In one
ear the ramp rose, and in the other it dropped, so that the total energy
level was constant throughout the duration of the bip, but the apparent
placement swept from one side to the other. In this blip he placed a
short pulse. during this pulse, the energy level dropped to zero in one
ear, but went up to the same total energy level in the other ear. The
pulse might come early or late in the “bip”. Nobody could
discriminate whether it came early or late when they started the
experiment, and most people spent many days before they were able to
discriminate the difference. When they did, their scores went from chance
to 100% over a couple of sessions, and the difference between early and
late was a simple obvious perception. One subject took 44 days before
this happened, but it did happen for each of his subjects.

I draw no moral from this, but I thought it might interest you because of
the way people went from being quite unable to discriminate differences
that in the one experiment they must have been using and in the other
they had never heard before, to being able to discriminate those
differences as easily as discriminating red from blue.


The second episode of the Bayesian discussion has been a bit delayed
because I want to include some computed curves. I had bought a program
called iMathGeo to do this, but it had a bug, which the author has
apparently fixed this morning.

Martin

No virus found in this incoming message.

Checked by AVG -
http://www.avg.com

Version: 8.0.176 / Virus Database: 270.10.2/1874 - Release Date: 1/4/2009
4:32 PM

[From Bill Powers (2009.01.05.1518 MST)]

[Sorry, I accidently sent this before it was finished]

Martin Taylor 2009.01.05.13.47] –

I’m not keen on revising the
words conventionally used, much as you were not keen on finding a new
word to represent an internal signal called “perception” in
PCT. “Probability” as necessarily subjective is as closely
related to “probability” in everyday parlance as are the
technical and everyday use of “perception”.

I’m not sure of that. Don’t quantum physicists typically speak as if the
uncertainties involved in quantum phenomena are really there in nature,
rather than in the observer’s mind? Or do you exclude that from
“everyday” use? Don’t gamblers usually speak of probabilities
as if they were really there in the cards or the dice? Is a 30% chance of
showers tomorrow spoken of as if it’s in the forecaster’s perceptions –
or the viewer’s?

Lots of people talk about probabilities as though they were really out
there. That’s true of many perceptions, isn’t it? We don’t know whether
our perceptions match real reality for probabilities any more than they
do for anything else.

OK, so which usage is the one you mean when you refer to the meanings
“conventionally used?” I assumed you meant the everyday usage,
not the formal usage restricted to a few people. But you seem to mean the
latter.

How often you and I argue because of using words differently, or not
making clear which of several meanings is intended!

… which means “the degree
to which you will believe the hypothesis given that you believe what the
data appear to tell you?” The shorthand way of putting it suggests
strongly that there is an actual probability of the hypothesis
independent of the observer’s belief in it, and that the data have
meaning independent of the observer’s interpretation.

I think I accept your first sentence, but I don’t see how your second
relates to it. Maybe the second episode message will help.

I meant only that the
stage manager, following the rules of the script, is holding the black
swan aside for the proper time, at which time it will appear. The black
swan already exists although we haven’t seen it yet. “Whim” was
the wrong word. Perhaps, like a superenergetic cosmic ray, it will appear
during only one observing period out of a million. But appear it will,
and it will show that not all cosmic rays are “normal,” even
though a given observer has seen only normal ones in his lifetime. Its
existence is not uncertain; only the time when it will be observed
is.

Oh, but its existence IS uncertain to the person counting swans, even if
it is not to the stage manager.

Yes, that is what I’m getting at. You seem to be putting me on the
opposite side of the argument from the one I’m taking. The
“shorthand” way of stating the proposition is the one you gave:
“A likelihood is the probability of the hypothesis given the
data”. That way of saying it would suggest to all but the annointed
few that the hypothesis itself HAS a probability of being true, given the
self-evident meaning of the data. I want to see “likelihood”
expanded so it doesn’t sound like something that has an existence of its
own, and “data” so it doesn’t sound as if data mean the same
thing to everyone.

My bungled point about the stage manager was to show that subjective
probabilty does not determine what happens; if the black swan is going to
appear, it is already upstream floating toward the point of discovery,
unaffected by the expectations of the observer. The mere fact that the
observer has never seen a black one before is an accident; a different
observer may have done so. So subjective probability is truly subjective
and person-specific.

I still don’t see what we are really discussing there, but the following
brings out something very clearly:

True. But I think you misstate
what the Bayesian system is. It isn’t a

model of what people do, but a model of the best they could
do.

Yes, I was really trying to make that point. It is, as you say, an
idealized method, invented by someone and adopted by others but not a
“natural” aspect inherent in the brain’s
organization.

This is the start of a comedy of misunderstanding. I said “idealized
method,” meaning a mathematical form that is an idealization of some
natural form to which the idealized one is fit. You took this to mean
“ideal” in a different sense:

Correct. Neither is the Ideal
Observer of psychoacoustics inherent in someone’s brain. It’s the best
that any mechanism, biological or designed, could do under the specified
circumstances.

So now when you say “ideal” you’re thinking “the best that
can be achieved,” and I’m thinking “a convenient approximation
to the actual form.”

For example, I say

But if you say “the same
prior information” you’re simply assuming a human being who
“correctly” interprets prior observations in the
“correct” way, so you’re just comparing two ideal computations,
one made by you and one made by the subject.

and you say

No I’m not. I’m saying that no
matter how the subject does it, the result can never be better than the
ideal, any more than a mechanical engine can get more than an ideal
engine out of a given temperature drop. Ideal is ideal, a limiting
possibility.

What I was talking about was the idealization involved in saying that the
Bayesian calculation reveals the best possible performance. A much better
performance may be possible. A gambler can do much better than chance at
predicting the probability that a given horse will win, given the track
conditions, if he has discovered that the probability known to the bookie
comes from a delayed broadcast, while the one known to the gambler comes
from a phone call from a confederate at the finish line – five minutes
ago. The assumptions, in other words, on which the Bayesian analysis
depends, which lead to the idealization that the experimenter uses, may
not be the ones the subject is using, so the subject is operating on the
basis of a different idealization. Note that this idealization is in the
sense of which model of the process is being used, not in the sense of
the best possible performance under a fixed and known model. When you say
“no matter how the subject does it,” you are not considering
(and couldn’t consider) all the possible ways it can be done. You mean
“all the ways it could be done under the conditions we are tacitly
accepting in the background.”

Different subject:

A well trained observer
may get within 6 db of the ideal, and a highly trained one within 3 db.

What kind of db are those? If you mean 10 Log(x)[base 10], 3 db is a
factor of two in the measure of x, and 6 db is a factor of 4, neither of
which is very close.

Close is in the eye of the observe. 3 db is, for many people, about the
minimum difference that will allow them to say something is louder or
softer than something else, though trained subjects under good conditions
can probably do 1 db.

That makes no sense under my definition of a decibel, which is the one
used in electronics. People can’t tell if the level of a sound has
increased until it doubles? Perhaps you’re using a different definition
of a decibel.

What the ideal can do depends on
the length of the listening interval and on the information it has about
such things as the signal frequency. When you are measuring energy, 3db
is indeed a doubling of the energy. The point is that it takes a great
deal of training (like two months of 3 hours a day) for people to get
that close to the ideal when detecting a signal, and it takes some
training to come to within 6 db. When you are comparing with the ideal, 3
db is reasonably used to mean “half as precise” or the
appropriate analogue in other domains.

Well, I suppose this is what you really mean, but there are lots of other
perceptions beside sound intensity that vary along a continuum, and
untrained people can certainly detect much smaller differences that 1
decibel in many of them, defined as I define it – 10 log(x2/x1). Look at
the length of two sticks; anyone can see if a stick is longer or shorter
than a foot ruler within, say, a quarter of an inch, which converted to
decibels is 10 log (12.25/12) or 0.09 decibel. Speaking of pitch, the
ratio of A# to A is 466 to 440, a difference of 5% or, on the decibel
scale, 0.21 decibel. I would guess that almost anyone could hear that
difference. Perhaps the case you’re talking about, perception of sound
intensity, is special.

The problem with experiments
like these is that they create low-probability perceptions, and
lead to interpreting the results as if they apply to all perceptions –
most of which are very high-probability perceptions. An accuracy of 3 db
may be astonishingly (or expectedly) good for a very noisy observation,
but it’s terrible for an observation made under normal
conditions.

I think you completely misunderstand. If the ideal finds a detection
easy, so will the human. If the ideal finds it difficult, so will the
human. In both cases, the human seems usually to be about the same number
of db worse than the ideal, the actual number depending on the human’s
listening ability. I have not the slightest idea what “an accuracy
of 3 db” could mean. Maybe you could explain that
usage.

It’s the ratio of the estimate to the actual value, converted to
decibels., which is just a log scale.

Sherlock Holmes can make wrong
deductions from the given data, but for the same number of wrong
deductions he will make more correct ones than will most people. Even he,
however, cannot do better than a Bayesian analyst working with the same
data and hypothesis universe.

That is true, to the extent that both live in a universe where it is
difficult to know what the data are.

No. It’s simply true. No qualification. No exemptions.

Again, the confusion over “ideal.” Sherlock uses a DIFFERENT
METHOD of deduction from what an amateur uses, not the same method as the
amateur only with less uncertainty. The underlying principle of deduction
that Sherlock uses is superior to the one the amateur uses, and that is
why he gets the right answer when the amateur doesn’t, even given the
same data under the same conditions. That’s the difference I’ve been
trying to talk about; a difference in method, not in statistical
precision.

So, even if people don’t
actually behave in an ideal way, it’s not a bad thing to understand the
mathematically ideal limit to what they could do.

We part company when you assume that it is difficult most of the time to
behave in an ideal way because of large uncertainties of observation.

Again, you completely misunderstand. It is you who insert “large
uncertainties of observation” into the discussion. Get rid of them.
It doesn’t matter in the least how large the uncertainties are. The ideal
is the ideal, and no human or machine can do better.

Well, it does matter if you think of my meaning of ideal rather than
yours. Under your definition – the best that can be done under the
existing conditions – you are of course correct.

I will readily admit that
Bayesian statistics are appropriate in cases where low probabilities are
the norm. But I don’t admit that such cases are predominant or even
important in ordinary behavior – more than
occasionally.

Could you explain how the mathematical relations depend on the
probabilities being low?

They don’t, but if the probabilities are very high, the mathematics
becomes superfluous. The signal-to-noise ratio is then very high, meaning
that there is no need to calculate whether the probability of being right
is 0.99 rather than 0.995. That makes no significant difference in
perception or control. If the noise level is high, then the mathematics
becomes relevant because there are differences in probability that make a
difference.

Personally, I would not be
very surprised if it turned out that the peripheral neural structures had
evolved to function in a near ideal way (within, say, 3db) when it comes
to relating low-level perceptions to states of the external environment.
(That speculation was at the base of my comments on neural spike timing
effects last month).

I think that normally we get a LOT closer than 3 db.

Demonstration, or reference, using the appropriate ideal performance for
the situation?

You opened the door by saying “peripheral neural structures,”
which lets us consider all the senses. What about the senses that tell us
where on the retina an image of an object is relative to the image of
another object? The relative size of two objects? What about the phase
difference between two channels of hearing a bat’s chirp being compared
with each other? What about a carpenter measuring to cut an 8-foot length
of board? I could go on for an hour describing all the actions and
products of action generated by human beings which are under exquisitely
precise control which would be completely impossible if the uncertainties
in perception were anything like a factor of two of the mean.

You couldn’t hum the right note
when the choir director blows into the pitch pipe if you could perceive
the difference in pitch only within 3 db, a factor of 2. Most people can
get within a fraction of a half-tone, which is a fraction of the twelfth
root of 2 of an octave, or a fraction of 1/4 db.

I’m not sure what the ideal is for this example, but it has to depend on
the length of the sample and its SNR.

It doesn’t matter what the SNR is, because it’s high enough to permit
matching the tone within a very small fraction of a factor of 2:
certainly better than 5% because that much error would put you an entire
half-tone off, and you wouldn’t be in the choir long.

To make the argument, you
would have to know the ideal for the experimental situation, and then see
if people could get within twice the ideal discriminable interval (which
I think would be a reasonable interpretation of a 3 db difference, even
though energy really doesn’t apply here).

Again the difference in “ideals”.

If a carpenter could measure
distances only within 3 db, the house he is building would be a random
pile of sticks. Nobody would survive driving a car.

Oh, what ARE you talking about? What is the ideal for the carpenter
situation?

The ideal is for the carpenter to construct the house with an rms error
of less than 1/4 inch. Things like moldings can be adjusted to cover up
errors of that size. You’re asking what is the best that a carpenter
could possible do given the existing uncertainties of measurement. I’d
say the uncertainties of measurement are so small that there’s no point
in asking what they are, other than to say that they’re a quarter inch or
less (usually less). For the Bayesian calculation to be needed, you have
to have uncertainties that are large enough to bother someone. Why else
go to the trouble? Most of the experimental data concerning probabilities
that I know about are taken with huge uncertainties deliberately
introduced so they make a big difference in performance. But in real
life, that doesn’t happen very often.

I’m really running down on this subject. It seems to me that I’m trying
to make a very simple and obvious point, but not having much success. The
discussion keeps developing side-issues that branch and proliferate while
the simple underlying idea gets obscured. The simple underlying idea that
I see is that most perceptions are essentially noise-free, so calculating
probability is a waste of time: all that matters is the mean and a small
standard deviation that represents minor errors of performance. Under
conditions of high noise or very small signal, both of which I have
encountered in my career as a designer of low-light-level television
systems, the noise does matter and has to be taken into account. I had,
at one time, a need to calculate the probability of detecting a moving
satellite right at the limit of detection, with a signal-to-noise ratio
of 1/10 or so. I might have done better if I had known Bayesian
statistics, but the rough-and-ready estimate proved close enough to the
observations for the purposes at hand – convincing the Air Force that
not one of the submitted proposals would meet the specifications set for
the requested system. Later on, the general in charge went ahead and
ordered one of the systems anyway, on what basis I can’t imagine. Sure
enough, the system didn’t work.

So I’m not against statistics where it’s needed.

Best,

Bill P.

···

What is it for the car
driver? To reword those two questions: What is the best that the best
possible machine could do in their situation? Then, “Do they come
within being half as precise as that ideal (3 db)?”

This is another way to see my reluctance to adopt your statistical
approach to perception. I think you are considering only a small part of
the whole dynamic range of perception;

You seem to want to insist on this, for reasons unclear to me, rather in
the way Rick wants to insist that I want to extract individual control
parameters from group survey data. It’s as though for you there is some
threshold. A quantity less than epsilon is identically zero, it seems,
where epsilon is defined in whatever way suits your whim of the moment. I
don’t think of zero that way. The analyses work without artificial
thresholds. The interesting question is whether people work with
thresholds.

Anyway, the only claim I make is that no biological or mechanical entity
can outperform a properly conceived ideal mechanism, and that in the
situations where comparisons have been possible, people’s sensory systems
seem usually to mimic the behaviour of the ideal, but consistently fall a
little short – how short depends on training in the situation.


As an aside, you might like the story of a couple of experiments done by
W. P. Tanner (Spike) in the late 50s when he was introducing signal
detection theory to psychologists, from radar engineering. In one
experiment, he was measuring frequency discrimination, simply whether a
tone moved up or down in pitch (so I imagine he had computed the ideal
observer for this condition). All his subjects except two learned to do
this quite well in one or two sessions. He assumed the two were not tome
deaf, because their native language was a tonal one, in which pitch
shifts made differences in meaning. They could not converse properly in
their native language without controlling their perceived pitch shifts.
Spike therefore assumed that their sensory systems could discriminate up
from down, but at some level they were not interpreting it the same way
when the tone shifts were divorced from voice. He kept them in the
experiment for many days (always letting them know after each response
what was the correct answer). Eventually, there came a day for each of
them when they cottoned on to what they were supposed to be listening
for, at which point their answers shifted from pure chance to 100%
correct within minutes. You have to learn to perceive something
consciously, even if you normally control it quite precisely.

The second experiment has something of the same flavour. In this one.
Tanner designed a binaural sound presentation that was unlike anything
anyone ever heard in the natural world. In each ear he played a ramp tone
lasting (I think) 100 msec – in other words, a “bip”. In one
ear the ramp rose, and in the other it dropped, so that the total energy
level was constant throughout the duration of the bip, but the apparent
placement swept from one side to the other. In this blip he placed a
short pulse. during this pulse, the energy level dropped to zero in one
ear, but went up to the same total energy level in the other ear. The
pulse might come early or late in the “bip”. Nobody could
discriminate whether it came early or late when they started the
experiment, and most people spent many days before they were able to
discriminate the difference. When they did, their scores went from chance
to 100% over a couple of sessions, and the difference between early and
late was a simple obvious perception. One subject took 44 days before
this happened, but it did happen for each of his subjects.

I draw no moral from this, but I thought it might interest you because of
the way people went from being quite unable to discriminate differences
that in the one experiment they must have been using and in the other
they had never heard before, to being able to discriminate those
differences as easily as discriminating red from blue.


The second episode of the Bayesian discussion has been a bit delayed
because I want to include some computed curves. I had bought a program
called iMathGeo to do this, but it had a bug, which the author has
apparently fixed this morning.

Martin

No virus found in this incoming message.

Checked by AVG -
http://www.avg.com

Version: 8.0.176 / Virus Database: 270.10.2/1874 - Release Date: 1/4/2009
4:32 PM

Martin Taylor 2009.01.05.17.58


[From Mike Acree (2009.01.05.1309 PST)]
Martin Taylor 2009.01.04.14.13--
the Bayesian system . . . isn't a model of what people do, but a model

of the best they could do. It's a >mathematical ideal
I don' know whether it's worth repeating, but this is a conclusion I
argued against in my paper.

Did you? I’m sorry, I missed that. And on re-reading your paper after
seeing your message, I missed it again. Could you tell me a page
reference?

The problem, I contended, with taking
Bayesian theory as a model of inference is that it forces beliefs to
behave as chances.

Are “chances” the same as “probabilities”?

And I wouldn't be inclined to say, even in legitimate
applications, that Bayes' Theorem constituted a model of our thinking,
any more than I would say that matrix algebra modeled our thinking about
spatial transformation.

Nor would I. Nor have I.

But I don't think the idea that all probabilities
are at least implicitly conditional need commit one to a Bayesian
concept of probability.

No. Those two concepts are not independent, but I think the implication
goes in the other direction. If by “Bayesian concept of probability”
you mean “subjective probability” then it’s hard to avoid the notion
that all probabilities are conditional. If you start from the idea that
all probabilities are conditional, it in no way prevents you from
taking a frequentist attitude to probability. Of course, if you mean
something else by “Bayesian concept of probability”, what I said makes
no sense, and I don’t know what you are referring to.

Martin

[From Mike Acree (2009.01.06.0930 PST)]

Martin Taylor 2009.01.05.17.58--

  [From Mike Acree (2009.01.05.1309 PST)]
  
  Martin Taylor 2009.01.04.14.13--

    the Bayesian system . . . isn't a model of what people
do, but a model
    of the best they could do. It's a >mathematical ideal
  
  I don' know whether it's worth repeating, but this is a
conclusion I
  argued against in my paper.

Did you? I'm sorry, I missed that. And on re-reading your paper after

seeing your message, I missed it >again. Could you tell me a page
reference?

You're right, I didn't say so in so many words.

   The problem, I contended, with taking
  Bayesian theory as a model of inference is that it forces
beliefs to
  behave as chances.

Are "chances" the same as "probabilities"?

Following Poisson and others, I was using "chances" in that context to
try to make clear that I was referring to aleatory probabilities. I did
argue in the paper (_very_ briefly, and not originally) that Bayesianism
provides a poor descriptive model of human reasoning; and I also at
least suggested, in my discussion of nonadditive probability (pp. 7-8),
that it doesn't provide a prescriptive model, either. One of the
reasons I gave was that a scale with certainty at both ends is unsuited
to contexts, rather common in practice, where we have little evidence
bearing on a question one way or the other. In fact, Shafer's book, _A
Mathematical Theory of Evidence_, shows that situations where additive
(aleatory, or frequency) probabilities apply are those with infinite
contradictory weights of evidence (as he defines weight). I went on, in
my brief mention of Pearl's work (p. 22), to express my doubt that human
inference in general can be formalized. I adverted there only to the
problem of causality, but I would also say that the treachery of
everyday language also constitutes a sufficient reason for doubt. I
gave earlier the example of Ernest Adams' modus tollens:

  If it rained, it did not rain hard.
  It did rain hard.
  Therefore. . . .

Pearl gives this example:

  If the ground is wet, then it rained last night.
  If the sprinkler was on, then the ground is wet.
  Therefore. . . .

Sandra Harding points out the invalidity of the classical "All men are
mortal" syllogism: If we substitute Xanthippe for Socrates, we see that
"man" is being used in two different senses.

Or consider the difference between "A butcher is like a surgeon" and "A
surgeon is like a butcher." And so on. Followers of the CSGNet are
familiar enough with the problems of language.

   And I wouldn't be inclined to say, even in legitimate
  applications, that Bayes' Theorem constituted a model of our
thinking,
  any more than I would say that matrix algebra modeled our
thinking about
  spatial transformation.

Nor would I. Nor have I.

Then (speaking of the linguistic barriers to formalizing thought) I'm
seriously misunderstanding you when you say

    the Bayesian system . . . isn't a model of what people
do, but a model
    of the best they could do. It's a >mathematical ideal

Mike

[Martin Taylor 2009.01.06.16.48]

[From Mike Acree (2009.01.06.0930 PST)]
I did
argue in the paper (_very_ briefly, and not originally) that Bayesianism
provides a poor descriptive model of human reasoning;

That, we can agree on.

and I also at
least suggested, in my discussion of nonadditive probability (pp. 7-8),
that it doesn't provide a prescriptive model, either.

On rereading those pages and a little of the surround, I disagree with
the way you look at it. The core quote is, I think: "In the
Anglo-American legal tradition, a trial starts from the presumption of
innocence. How can we represent this as a probability? If we say the
probability of guilt is 0, that means guilt is impossible. Anything
times 0 is 0, and no amount of evidence would ever overturn that
proposition. If we say ½, then we are saying the accused is as likely
to be guilty as innocent, which is hardly a presumption of innocence. "

The way I see it, the presumption of innocence does not mean that we
start by having a strong prior probability that the defendant is
innocent, or have any other particular opinion. A prior of 0.5 seems
quite reasonable, given that the effect of choice of priors tends to
get washed out by data. What the presumption of innocence means to me
is that we will not make a decision to call him guilty until we have
strong evidence that he is. In other words, the presumption of
innocence is not a prior probability, but (in the language of signal
detection theory) a “decision criterion”.

One of the
reasons I gave was that a scale with certainty at both ends is unsuited
to contexts, rather common in practice, where we have little evidence
bearing on a question one way or the other.

I don’t see why such a scale is unsuited to situations where we have
little evidence. If you have little evidence one way or the other, your
subjective probability won’t be near either end of the scale.

I think you are mixing up the need to make a decision with the strength
of the evidence. In a Bayesian sense, you have two hypotheses: (1)
Defendant is guilty, (2) Defendant is innocent. You separately have two
possible courses of action (1) Defendant is declared guilty, (2)
Defendant is declared not guilty, which is different from being
declared innocent. In Canada we had a case last year in which a wrongly
convicted man asked the court to declare him innocent, but the court
said that was impossible, because there is no such decision in law.
However, they could declare him not guilty, and recommend he be awarded
a substantial sum as compensation for the time he spent in jail. (In
Scottish jurisprudence, you have a third possible course of action (3)
Case is declared “not proven”).

In general, if you have little evidence to disciminate between the two
(or three) hypotheses, you won’t have a prior probability near either
end of the scale, whether the scale goes from zero to unity or from
plus to minus infinity. When you must make a decision to choose one
hypothesis or the other, you make it based on how whatever evidence you
have has affected your prior probability, as well as on how the
decision might (in imagination) influence other variables you control
– typically called costs and benefits of each possible decision.

The court’s problem is one of signal detection, in which there is a
statutory requirement for the decision criterion to be strongly
weighted toward the “not guilty” decision. In signal detection theory
as it is used in psychoacoustics, the decision criterion is called
“Beta”. In some experiments Beta is manipulated by various disturbances
to variables that appear to be in the control loop for which the output
is the subject’s response to observations about which she perceives
herself to be uncertain. I expect to go into this in a future episode
of my Bayesian notes.

I have no problem with any kind of rescaling of “probability” that
transforms the 0 to 1 scale into a scale that goes from plus to minus
infinity (as we do with z-scores in some analyses). The formulae will
change, but that can be useful. In Bayesian analysis, we often use log
likelihoods rather than zero-to-one likelihoods. We do that because it
allows us to sum the gain from successive increments of data rather
than multiplying, as must be done with the raw likelihoods. It’s a
convenience. The log likelihood scale goes from zero to negative
infinity. Unity likelihood is zero log-likelihood, and zero likelihood
becomes negative infinity log-likelihood. If some other transformation
is convenient for other purposes, such as to help someone to understand
what’s going on, that’s no problem, either.

	And I wouldn't be inclined to say, even in legitimate
applications, that Bayes' Theorem constituted a model of our
thinking,
any more than I would say that matrix algebra modeled our
thinking about
spatial transformation.
Nor would I. Nor have I.

		Then (speaking of the linguistic barriers to formalizing thought) I'm
seriously misunderstanding you when you say
the Bayesian system . . . isn't a model of what people
do, but a model
of the best they could do. It's a >mathematical ideal

How do you understand it when I say that the Carnot cycle is a
mathematical ideal of a heat engine, an ideal that makes explicit a
limit that any heat engine, no matter how constructed, cannot improve
upon? Do you understand me to be advocating that real engines should be
built so as to be thermodynamically reversible? Or to be saying that
they actually are?

Martin

PS. Bill P [From Bill Powers (2009.01.05.1518 MST)] queried my claim
that untrained observers need about 3 db increment in an acoustic
signal before being able to tell it is louder, whereas trained
observers need about 1 db. His comment made me wonder whether I was
right or had misremembered it and mixed it up with something else. I
can confirm the 1 db for trained observers listening to moderate level
narrow-band signals at high signal-noise ratio, but it’s very hard to
find published data for untrained observers, because when you have
gathered enough data for one individual to allow a reasonable estimate,
the observer has become at least partly trained, and if you take only
early data for a lot of individuals you just get a schmear that tells
you very liitle about anything. So I can’t support my 3 db claim. I
suspect that, like a lot of what an experienced practitioner “knows” it
is “lab lore” and not supported by clean data. Or else it’s just a bad
remembery :slight_smile:

[From Bill Powers (2009.01.06.1557 MST)]

Martin Taylor 2009.01.06.16.48 –

The way I see it, the
presumption of innocence does not mean that we start by having a strong
prior probability that the defendant is innocent, or have any other
particular opinion. A prior of 0.5 seems quite reasonable, given that the
effect of choice of priors tends to get washed out by data. What the
presumption of innocence means to me is that we will not make a decision
to call him guilty until we have strong evidence that he is. In other
words, the presumption of innocence is not a prior probability, but (in
the language of signal detection theory) a “decision
criterion”.

I have to side with you on this. Innocence of a crime is not a condition
that needs to be supported with evidence. What has to be proven is that
the defendant did what is charged. In the absence of such proof there is
no case.
Presumption of innocence does not mean starting with a probability of
innocence of 1.00 and then revising it in the light of evidence. The a
priori
condition of innocence is irrelevant; it’s not taken into
account. The jury determines the probability of guilt given conditions
like “defendent had a loaded gun” and “one shot had been
fired from the gun” and “defendant had opportunity and
motive” and “someone reports seeing defendent shooting the
victim AND is a truthful person.” If the probability of guilt is
less than a threshold (reasonable doubt) the defendant is declared not
guilty; the prosecution failed to make its case. The Scottish verdict is
actually superfluous, because the court never declares the defendent
innocent: the charges are simply dropped and, save for damages to the
falsely accused, it is as if they had never been brought. I suppose the
Scottish verdict leaves open avenues for further prosecution that
“not guilty” does not allow.
All that said, there is another subject that we have all been
overlooking. I was jogged by reading a student paper that Warren Mansell
asked me to look over. In this paper, the following passage appears:

···

=========================================================================
In everyday life, we must constantly make
probabilistic inferences concerning the meaning of what is
happening around us, for example when the bus pulls up at the bus
stop to take one to one’s destination; if
the bus is displaying its route
number it is most probable the
bus will be
travelling to where one expects. If, however, the display reads an
unfamiliar destination number, but here is a piece of paper in the front
window with the expected destination number, it is most likely the
display machine is broken.

I don’t deny that logical inferences take place, but not every logical
inference that might be made is made. Some people do more interpreting of
experience than others do. Yet they all do something when the bus
arrives. Mostly, they line up and get on it. If it doesn’t end up where
they expected, they complain and find a way to reach their destinations
anyway. They don’t have to think at all about probability.

So it’s simply not true that because it is possible to draw probabilistic
inferences from observations, we “must” do so in order to
behave. I ran into this same sort of problem in conversations about
choice and decisions. On my way to work, I come to a T-junction where I
can turn left or turn right. My destination lies to the left. Do I
have to make a decision as to whether I should turn left or right? If I
were in conflict or didn’t know which way my destination lay, the answer
is yes. I have to make a choice and it’s a little hard because I lack
information or am in conflict. But if I know the direction of the
destination, why would I even consider turning away from it? I wouldn’t.
There is no decision to be made. I know the way and I go that
way.

Control systems do not have to consider probabilities or make decisions
or predict anything. They compare a perception to the given reference
signal, and on the basis of the error, they act. If there is a
significant amount of noise in the perception, they simply act a little
erratically, as if there is a random disturbance acting on the controlled
variable. All this stuff about Bayesian probability applies only if, at
the level where we behave according to rules of logic or mathematics or
culture, we perform certain calculations and act on the basis of their
results. We don’t have to do that, but we can if we want to.

After we have behaved, it is of course possible to examine the details
and come up with explanations that show how the behavior would have been
produced if it had been the result of decision-making or cultural
influences or probability calculations. However, unless we can prove that
it WAS generated in that way, there is no reason to believe those
explanations. It’s perfectly possible, indeed likely, that there are ways
other than calculating to arrive at behavior organized in the optimal way
according to Bayesian analysis or signal-processing principles or
decision theory. That doesn’t mean that these concepts have anything to
do with the way behavior works. The mistake is in offering an explanation
without having a good Bayesian reason for thinking it applies.

Best,

Bill P.

Best,

Bill P.

[From Mike Acree (2009.01.06.1607 PST)]

[Martin Taylor 2009.01.06.16.48]--

In general, if you have little evidence to disciminate between the two

(or three) hypotheses, you won't >have a prior probability near either
end of the scale, whether the scale goes from zero to unity or from

plus to minus infinity. . . .

I have no problem with any kind of rescaling of "probability" that

transforms the 0 to 1 scale into a >scale that goes from plus to minus
infinity

I agree that such rescalings are immaterial; Jeffreys pointed out long
ago that we could equally well measure (subjective) probabilities on a
scale that was the logarithm of what we currently use. But the
nonadditive "probabilities" of Bernoulli, Dempster, and Shafer are
something else. The additive, frequency scale is unable to distinguish
the cases (a) a hypothesis and its contrary are supported by strong
evidence which is evenly balanced (e.g., testimony of conflicting
authorities) and (b) there is no evidence bearing on the question one
way or the other. I think you are saying that discrimination doesn't
matter to you. (I should add that I don't see much practical value in
attempts to develop a calculus of the Dempster-Shafer nonadditive
probabilities.)

How do you understand it when I say that the Carnot cycle is a

mathematical ideal of a heat engine, an >ideal that makes explicit a
limit that any heat engine, no matter how constructed, cannot improve
upon? >Do you understand me to be advocating that real engines should be
built so as to be thermodynamically >reversible? Or to be saying that
they actually are?

The former; you had used the words "mathematical ideal":

  the Bayesian system . . . isn't a model of what people
  do, but a model of the best they could do. It's a mathematical
ideal

I had responded:

   And I wouldn't be inclined to say, even in legitimate
  applications, that Bayes' Theorem constituted a model of our
  thinking, any more than I would say that matrix algebra modeled
our
  thinking about spatial transformation.
  
To which you responded:

  Nor would I. Nor have I.
        
So I understand you to be saying (a) the Bayesian system is a model of
human thinking, in the sense of a mathematical ideal, and (b) that you
would never say any such thing. When I denied that Bayesianism was a
model of human thinking, I was using "model" in the sense you say you
intended. I can only infer that you think I tacitly switched to the
empirical meaning of "model"?

Mike