[From Bill Powers (2009.03.06.1035 MST)]
[Martin Taylor 2009.03.06.00.04]
[WTP] I usually wait until I’m
pretty sure, and devote my efforts before then to making the decision
unneccessary just yet. The main thing is that I know what the outcome is
quite likely to be if I decide prematurely: wrong. If you have only a 60%
chance of choosing right, you might as well not choose, because your
estimate of the chances is probably off by a large amount, too.
[MMT] That’s your privilege, of
course, but if it happened to be a tiger rather than your friend who made
that leaf rustle, you’d be dead by then.
Made-up positive examples don’t prove your case. For every one of those,
there are many negative examples. The conditions, remember, are that it
is almost equally likely that your response will be wrong: if there’s
only one possible place the tiger could be when you hear a leaf rustle,
then by all means run for your life in a direction away from the tiger,
assuming you’re pretty sure what direction that is. But if running could
equally well take you toward the tiger, it doesn’t matter whether you act
or not: you have a fifty-fifty (or 60-40 or 40-60) chance of being dead
if you act even sooner than if you did nothing. You may take some comfort
from knowing that you at least did something, but we’re talking
about cases in which you will do the right thing only about half of the
time, and in which doing the wrong thing is very bad.
Think about your everyday
life. Do you never have to make choices by deadlines, whether you have
got enough data to be quite sure of the best choice?
I think of a way to put off the deadline, because I know that just acting
in order to be acting is futile, and uses up resources I might need. I
have much more faith in my ability to handle whatever comes up than I do
in my ability to correctly forecast the future and have the appropriate
action all ready. For one thing, having the appropriate action all ready
can be a terrible handicap if what happens is different from what you
expected. That can be much worse than doing nothing but remaining
alert.
Do you never vote (I assume you
never have much really solid information about most of the candidates for
office).
Yes, of course, and it’s been a long time since I had a hard time
deciding. When I know nothing about candidates on a ballot, I just skip
that item. Why bother, when half of the time I will make a choice that
turns out exactly wrong?
One of the few things I
remember from my short days as a student of Industrial Engineering was
the professor pointing out that “Inaction can be extreme
action”.
Sure, it can be. But is it usually? You can always make up a scenario
under which any proposal would be correct. But that doesn’t tell us what
sort of policy to follow in general. In general, I would say that
inaction has many advantages over action when you’re not reasonably sure
of what the right action would be.
and anyway the probability
of a correct guess is so close to 1 by then that it can’t really be
measured in any reasonable number of trials.
That was precisely what made me interested in the Schouten study in the
first place. It is indeed extremely difficult to determine the
detectability or discriminability of obvious things like bright lights or
the locations of well separated things. I saw the Schouten technique as a
possible way to get at these very detectable or very discriminable
signals, which orthodox techniques cannot touch for the reason you
mention.
I need to ask again what the actual observed variable was in these
experiments. Wasn’t it a count of correct and incorrect responses at each
delay time? Isn’t that why you need so many trials when the probability
of a wrong response gets down to 0.003? That’s the probability at 3
sigma. As I’ve been understanding this, the measure of d’ is derived from
the measured probabilities of getting a correct answer, which is Nc/N,
where Nc is the number of correct choices and N is the total number of
choices at a given delay. Since you have to measure those probabilities,
you’re not gaining any advantage by the Schouten approach.
It seems that d’ is derived from those probability measurements, not the
other way around. Given the probability of a correct choice, we can use
the tables to find out how many standard deviations over the noise level
correspond to that probability, assuming a Gaussian distribution of the
noise. From this we can deduced what the underlying noise level is. This
seems to fit the data, since the probabilities you mention that go with
different values of d’ approach certainty very rapidly with each added
increment in d’. The straight lines in Schouten show that the
distribution is Gaussian.
Well, you see what such a
process can do in the Schouten data, when you look at the graph for about
270 msec delay. By that time, the "processor’ has half the
information it has by 320 msec. It is able to get (from memory) about 85%
correct button presses.
This estimate of how much information exists at each delay depends very
strongly on how you think this detection works. What you describe can
also be imitated by a series resistor terminated by a capacitor connected
to ground, followed by a threshold detector. To characterize the voltage
on the capacitor at a given time as “information” is simply a
matter of what computation you choose to use. What you call
“processing” is simply a flow of current into the capacitor.
This is why I ask what model you’re assuming. If we’re just talking about
an R-C circuit, the concept of information is overcomplicating something
very simple. You can think of a computer accumulating digital information
over time and reasoning out what it means on the basis of the information
so far received, or you can just think of a capacitor charging up. The
latter could be much closer to what is actually happening; all that
“processing” would then be a figment of the imagination that
adds complexity without increasing our understanding.
More generally, now knowing
something of PCT, I would equate “processor” with the complex
of control loops that interact to enable the subject to perform in the
experiment as the experimenter asks.
I would equate it with something much simpler, since this experiment is
focused on just one perceptual input function. The other complexities are
there in the background, but the experiment is not measuring
them.
“Four and years our”
doesn’t help me at all toward “Four score and seven years ago our
forefathers…”
However, it might help you perceive that neither “To be or not to
be” nor “We shall fight on the beaches” was the quote from
which some words had been obscured by noise bursts.
But there is nothing in that stream of words that tells me how much, if
anything, is missing from it. The original sentence could be “four
and twenty blackbirds, a verse from the years of our childhood.”
Information is not interchangeable, so we get can get half of the way to
understanding given ANY half of the bits. When we actually have half of
the information, we don’t know that, since we have no idea how much we
have missed or how much is yet to come. “On the nose” does not
give us half of the information in “Mary kissed John on the
nose” or “Mary socked John on the nose” or “That is
right on the nose.”
Possibly if someone
mentioned “forefathers”, that might even be enough to cue you
into recognizing which words you missed.
Or that could completely mislead you if the original sentence was another
one I mentioned. Information theory has nothing to do with meaning; it’s
just a way of characterizing channel capacity, no matter what message is
being carried by the channel. “John hit Mary” has the same
number of bits of information as “Mary hit John” but the two
clearly do not mean the same thing.
Shannon did determine that
the redundancy of English is about 50%, meaning that in a long passage
with a random 50% of the words substituted by gaps or backouts (so you
know where they were), you will be able to correctly replace most of the
missing words. But that really has nothing to do with the case at hand,
which is the rate of gain of information over time. Your example might
have been a slightly better analogy if you had said “Four score and
seven” rather than “Four an years our”.
How would that help us arrive at the rest of the sentence, which is
“… is a quotation from Lincoln”? “Half of the
information,” as you use the term, implies that any half of the bits
will do, in or out of sequence, with large gaps or small.
While the concept of channel
capacity is relevant to analog systems as well as digital ones, its
meaning is rather different when amplitude measures matter as much as
durations and times of occurrance. In the Schouten case above, the
amplitude of the perceptual response to the light matters because the
part of the curve that is near the noise level does not distinguish
between members of a family of curves having the same measured initial
slope, and the channel capacity that actually exists can’t be calculated
without knowing which member of that family is present.
I would prefer to continue this kind of discussion if I was assured that
you understood the analysis that allows us to equate d’^2 with
information. It’s on page 13-14 of my Bayesian_Seminar_2.pdf. Shannon’s
expression
, which you mention from the wiki page, is central to the argument. Given
that, the equation of d’^2 and information follows from the fact
that d’^2 = 2E/No under the same “ideal observer”
conditions that apply to Shannon’s formula, where E is the signal energy
and No the noise power per unit bandwidth.
I don’t think you understand the point I am making. The actual channel
capacity can be calculated from the measured bandwidth. I showed that the
same model will yield the same information content as you measure it even
when the actual channel capacity varies enormously. That is because the
1/e rise time of the perceptual signal can generate initial rise rates of
the signal that are indistinguishable from one another even if the 1 -
e^-kt waveforms have greatly different values of k – which determines
the bandwidth. If you’re only looking at the first 5% or less of the
total rise, the exponential differs from a straight line by at most 1
part in 400, which would never be detected in this experiment.
I was hoping you would look at my argument and realize for yourself that
this is a large loophole in the Schouten experiment. The conclusions you
draw about the first 130 milliseconds apply only to that first 130
milliseconds. They say nothing about the actual channel capacity of the
perceptual input function you’re trying to characterize. The actual
channel capacity depends on how the perceptual signal behaves even after
the contribution of noise has become only a small part of the total
signal. If you were to apply a sine-wave disturbance to the light
intensity, you would find that the perceptual signal would have a much
smaller amplitude and a much larger phase lag than you would expect from
the channel capacity you measure over 130 milliseconds after a step-rise
in intensity. That is because most of the perceptual signal variation
would occur when the signal is far above the noise level and represents
the light intensity very accurately (save for nonlinearities). Your
approach does not consider that the signal contains any information after
the second or third sigma of excess over the noise level. Relative to the
assigned task, which is merely to report “detected”, that may
be true. But it has nothing to do with the channel capacity of that input
function, which is a physical property that doesn’t depend on what the
signal is or what use it made of it.
I would be much more interested in an argument that forces you to
equate d^2 with information, and that pins down a definition of
information that is not simply the equation by which you calculate it. I
just don’t think that “Your girlfriend is 18 years old”
contains as much information as “Your girlfriend is 17 years
old,” if by “information” you mean “important
implications.” If you’re just counting bits, that’s fine: sometimes
we might want to know how many bits per second a given channel can
transmit, even not knowing which bits they will be. But that kind of
knowledge is a far cry from what we normally mean by information, which
is what the string of bits conveys by way of meaning. Shannon hijacked
the term, but I don’t admit that he got away with it. If he had just
called the units “Shannons” we would be much better off. The
connection with the term “information” is entirely
gratuitous.
It is certainly not the case
that an analogue circuit could “work effectively only on the first
130 milliseconds of the signal”, especially when that part of the
signal gives no information about what the signal does after reaching
that level.
No, but it seems to be the case that at least some of the human sensory
systems have that limitation, at least with static signal bursts such as
our tones and light flashes.
But you’re finding out only that the signal-to-noise ratio is lowest in
the first 130 milliseconds, not what the control system using that signal
does on the basis of its changing (and accurately-perceived) amplitude
after the uncertainty has dropped essentially to zero. You’re thinking
that the only thing such a signal could be used for is to trigger a
simple act, a “response.” But if, instead of a button, we had
provided the subject with two potentiometers to turn, and told the
subject to keep both lights below some particular brightness (perhaps
compared with another light), the first 130 milliseconds wouldn’t matter
much because most of the control action would be taking place after that.
Yet the perceptual input function could be the same physical one measured
in the Schouten experiment.
Garner found the same
thing in the 1950s, when measuring channel capacities was all the rage.
He was looking at information integration in hearing. He found the same
as we did a decade later, that there is a break at around 130 msec
between complete information integration and a following slower
rise.
My objection is simply that if you use the data to calculate
“shannons,” that is all you will get: an arbitrary measure of a
physical process, or part of it. It’s like deciding to measure something
in units of quarts per log(parsec). Sure, you can probably come up with a
formula for converting a physical measurement into those units, but
what’s the justification? That’s what I see as missing from these
applications of information theory that go far beyond channel capacity.
Is the fact that you CAN compute information capacity any indication that
you should? Or that what you’re computing has anything to do with any
independent definition of “information?”
Best,
Bill P.