[Martin Taylor 2009.01.10.11.00]
[From Bill Powers (2009./01.10.0508 MST)]
Mike Acree
(2009.01.09.2219 PST)]
[Martin Taylor
2009.01.08.10.59]–
A fundamental question: E.T.
Jaynes
(Probability Theory: the logic of science, Cambridge U.P, 2003) lists a
set of desiderata that should apply to any measure of plausibility, and
goes on to show that logically these desiderata lead to the standard
(but
not frequentist or objective) measure of probability. Jaynes is arguing
strongly for a Bayesian approach, and by this point in the book has
introduced a hypothetical robot that discovers the plausibility of a
hypothesis based on available data.
I think you guys are on the verge of something, but we have to work out
what it is. Partly because of just mulling over this whole subject for
a
long time, my brain has started involuntarily putting ideas together.
This morning I awoke with some amazing series of thoughts, of which I
hope a glimmer will still remain now that I’m back from dreamland. See
what you make of this.
Thanks for this. It’s really good to have someone come at it from a
different viewpoint. It’s a useful disturbance.
First, the Bayesian approach. The probability of A is the probability
of
B given C. Isn’t this just extending two-variable probabilities to
three
dimensions? Considering A, B, and C together, we have eight possible
combinations of true and false, each one of which can be written as a
Bayesian conditional probability – can’t it?
Yes and no. I think I covered both in Episode 3 [Martin Taylor
2009.01.06.09.56], so I’ll be brief here.
First “No”. C, as a conditional, has no degree of belief. For the
purpose of figuring out whether B follows from A, C is just one of
possibly many conceivable background conditions.
You are interested in astrophysics, so I will use as an example the
calculations cosmologists do when they figure out what the Universe
would be like if one of the fundamental constants had a different value
from the value they measure. Nobody believes there is any possibility
that the constant might take on or have had in the past a different
value. It’s a conditional for the rest of the computation. The same is
true for conditionals in a Bayesian analysis.
Going back to Sherlock Holmes: Imagine Sherlock finding a footprint
outside a window. He has two hypotheses: (1) the print was made
innocently the previous day, (2) it was the print of an intruder in the
early morning. Conditional (a): it rained heavily in the late evening
until midnight – result, hypothesis 2 is much the likeliest;
Conditional (b): last night was dry – result, the two hypotheses are
indistinguishable; Conditional (c): there was light drizzle through the
night – result, hypothesis 2 is more likely than hypothesis 1, but
hypothesis 1 is still reasonably possible.
Now “Yes”. Sherlock doesn’t know how much rain there was and when it
fell. He observes other evidence of rain such as wet leaves (which
might be dew) or asks a local. Perhaps he makes a test footprint and
compares its crispness with that of the questionable footprint. These
are evidence that allows him to test hypotheses about C. At that point,
Sherlock does have a two-dimensional set of hypotheses {H1&Ca;
H1&Cb; H1&Cc; H2&Ca; H2&Cb; H2&Cc}.
To assess probabilities associated with those joint hypotheses Sherlock
has a whole lifetime of conditionals in which he believes, or at least
is willing to take as given for the purposes of this investigation. For
example consider datum “wet leaves”: Hypothesis Cb is rather
discredited, while Hypotheses Ca and Cc remain viable. Add datum
“puddles”, and Hypothesis Ca becomes more likely than Cc. Add datum
“Locals say it didn’t rain much overnight”, and Cc regains some
credibility relative to Ca. These, then can be used to influence the
relative probability of H1 and H2, since P(H&C) = P(H|C)P(C).
Ah, but now Sherlock, thinks of a new Hypothesis (3): Someone in the
house is faking the presence of an intruder by making the footprint.
With conditionals Ca, Cb, and Cc, H3 cannot be distinguished from H1 or
H2, but under a new conditional Cd “Somebody sprayed water in the area
to make it look as though the footprint was made in the early morning”,
H3 would become appreciably more credible than either H1 or H2.
So now Sherlock looks for evidence as to whether the “wet leaves” and
“puddles” data affect the credibility of the conditionals, making them
into hypotheses. He notes that leaves outside this area are not wet,
and there are few puddles elsewhere. His mental models of rain make
that situation unlikely if Ca or Cc were true, but those data are
consistent with Cb & Cd. Now he can go back to the original
question, and finds P(H3|Cb,Cd) substantially exceeds P(H1|Cb,Cd) and
P(H2|Cb,Cd), whereas P(H2|Ca) >> P(H1 or H3|Ca) and P(H1|Cc) ~=
P(H2|Cc) ~= P(H3|Cc). It becomes most important to Sherlock to discover
which of the conditionals is most likely to be true.
The point of “Yes and No” is that when a conditional is used AS a
conditional, it is taken to be absolutely true. If it is true, it leads
to certain conclusions about the hypotheses given the data. If it is
not true, then all bets are off about the hypotheses, just as in any
other "If A then B"statement.
When there is a question about whether some conditional C is true, then
that has to be taken into account in determining how strongly to
believe one hypothesis or another, given only the other background
conditionals. If there is a question, then it is always possible to
make a new test with the conditional not-C, though the universe of
not-C is often so broad as to make the new test rather indiscriminate.
In Sherlock’s case, for example, if the “wet leaves” were NOT wet with
water, what could they have been wet with, and how would that affect
Sherlock’s conclusions? Quite possibly it would leave all the three
hypotheses equally credible.
So why not
generalize? I suppose that’s already been done. Bayesian probability is
just the first step toward multidimensional probability. Maybe.
I don’t think it is “multidimensional probability” so much as scalar
probability determined over a multidimensional space of hypotheses. You
can, of course, have a vector of probabilities, such as “P(rain this
afternoon, P(rain this evening), P(rain tonight), P(rain tomorrow
morning)…” but that’s not quite the same thing as a multidimensional
probabiity in the sense you suggest.
The main idea this morning was very simple: degrees of knowledge. What
is
probability? Mike, you can perhaps put this in terms of kinds of
probability (apostolic? No, that’t not it), but to me it’s simply a
question of what we think we know by various means.
Isn’t that what all probability is?
When we know nothing, we simply have faith: we decide what we want to
believe, and make it true. We believe it as long as doing so is
pleasing
to us, or more pleasing than believing it’s not true… I’d call this
the
lowest degree of knowledge. At least it’s based on something that
matters
to us: intrinsic error, or lack of it. It’s not completely
arbitrary.
The next degree of knowledge is the bare detection of a regularity.
Something happens. Something happens. Something happens – ah, the
same thing happens again.
Here is a BIG issue. What is “the same thing”?
“The same thing” is a categoric perception, is it not? When you have a
category-level perception, you can’t say much about the values of the
lower-level perceptions that contribute to the category, except that
they fall within whatever ranges are appropriate for the category. For
category “bird”, you can’t even say that “ability to fly” is a part of
the perception, for example – or maybe you can, for your personal
perception of the category “bird”. What is “the same thing” to you may
well not be “the same thing” to me. For each of us two instances of
“the same thing” at the category level are not necessarily the same at
the lower levels of perception.
But suppose it were (asserting a conditional): then you are beginning
to arrive at the development of statistical evidence for a probability
estimate. If you have a model that suggests how this “same thing” comes
to occur, that also is evidence (non-statistical, most probably) toward
your probability estimate.
Now there is a single thing that
happens more than once. There it is again. “It” has happened
again. It is happening, again and again. Now it has repetition,
duration
through time. And now it has stopped: I am remembering it, but no
longer
sensing it.
And now there it is again. It is happening. And now it has stopped
again.
It is not happening. And now it is happening. Not happening. Happening.
Now it is happening periodically: the happenings are happening in
groups
and the groups become a series of “its” and something new
appears: the alternation through time, the frequency of
occurrance.
A new category-level perception.
Then something else happens, different from the first thing.
Sometimes B happens, sometimes A happens. But if A stops happening, we
discover eventually, B stops happening, too. And when A starts up
again,
so does B. It is not the case, we decide, that A is happening and B is
not happening, and if B is happening, A must have happened first. A is
causing B.
A logic-level (program level?) perception – the beginning of a model.
So knowledge gradually develops. At this point, A and B are simply
different. They have no meaning other than themselves, and other than
the
fact that each one is not the other one.
We can now start to do statistics. Statistics is not about meanings; it
is about occurrances of anything. It doesn’t matter WHAT is occurring;
all that matters is THAT it occurred, and that it-1 is different from
it-2. No other relationship between the it-s matters. There is no
question of “plausibility.”
True. “Plausibility” enters here only (so far as I can see) in whether
it-1 did occur when the lower-level perceptions are near the edges of
the ranges that are appropriate for the category it-1. Once you have
perceived it-1, you have perceived it, and not it-2. The decision has
been made.
We can count the total number of
its and also the number of times each it occurs. Whatever the ratio of
the two numbers is, we can say we expect future occurrances to be in
the
same ratio, according to the latest observations. If they occur in the
ratio of 101 to 303, we can say we expect occurrance A to happen 1/3 as
often as occurance B, or for more occurrances, we expect them to occur
in
the ratios nA:nB:nC and so on. Without any further kind of distinction,
that is all we can say; the occurrances will happen in that ratio, but
we
can’t say which will happen next because no concept of a repeatable
temporal sequence has yet been invented.
Yes, with the caveat that we would expect them to occur in something
near the same proportion in the long run, not exactly the same
proportion, and that we might be prepared to put a 3 to 1 bet on the
next event being A, both being expressions of subjective probability,
based in this case only on statistical evidence.
Next, we observe that A and B occur as follows:
ABBAABAABBBAABAABBBBBAAABABBAAA.
That doesn’t provide any new knowledge, because we still can’t say what
will happen next, A or B. But we may come to realize that this is
happening:
ABBAABAABBBAABAABBBBBAAABABBAAA.
ABBAABAABBBAABAABBBBBAAABABBAAA.
ABBAABAABBBABAAABBBBBAAABABBAAA.
ABBAABAABBBAABAABBBBBAAABABBAAA.
ABBAABAABBBAABAABBBBBAAABABBAAA.
Now we are back to the start: something happens. Something happens.
Something happens. Eventually, the same thing is happening again and
again. Now it’s a bigger thing, but it’s still a thing.
It’s another category, with variation in the perceptions that
contribute to it, themselves being categories. This is a sequence
category! Maybe, to me, or to Sherlock, the displaced B in the third
one is a critical clue to this being a different thing, not the same
thing as the others. Why did Mr Arbuthnot catch the 7:30 that morning
rather than the usual 8:10?
Notice, however, that this thing has parts, and the parts are the same
in
very occurrance. If we see ABBAAB we now know what the rest of this
thing
will be.
No we don’t. It’s another “White Swan”. You know what the rest will be
if it actually is another of “the same thing”, and if “the same thing”
category is tightly defined to allow only exact repetition. What you
may perceive is another instance of “the same thing coming up” In your
example, you don’t know whether what follows is going to be AABBBAAB or
AABBBABA, even if you have already perceived the category
ABBAABAABBBAABAABBBBBAAABABBAAA.
No statistics are involved. This thing occurs, or it does
not,
and there is nothing in between.
The latter, agreed, is true of categories. One perceives that an
instance of a category occurs, or not. But this does not follow from,
or lead to “No statistics are involved”. In a “White Swan” situation
where the perceptions are sequences, the more often “the same” start
has led to “the same” continuation, the more reasonable is the
development of the category perception of the whole once the start has
been perceived (at lower perceptual levels). You certainly wouldn’t
perceive that category the first time that sequence had occurred. The
second time, if the start were distinct enough from other patterns you
have seen, you might say “I’ve seen this before”, much as one does the
second time one hears a new piece of music. A category perception is a
decision. In the Bayesian sense, one perceives a category when the
evidence for it exceeds the evidence for a different category and for
“not a category I’ve seen” by a sufficient margin.
We now have a logical variable. We have
a pattern.
It’s interesting that we have statistics made of logical variables
whose
states are not statistical. It’s also interesting that we can see
repetitions of the same logical variables when they are not, in fact,
the
same logical variables (see the third repetition above). When we see
enough of the pattern, we don’t have to see it all to make up our minds
what it is – that is, to cease to feel uncertain about what it
is.
True, except for the “states are not statistical”. They are the results
of decisions (category perceptions) based on statistical history.
As soon as we see a pattern, statistics is no longer necessary.
True, the decision has been made when the category has been perceived.
The
elements of the pattern tell us which pattern we’re looking at, and we
don’t need to see what patterns came before it or after it. If we see N
elements of the temporal pattern, we KNOW what element N+1 will be;
There’s no possibility ABBAABAABBBAA might be followed by CZVXY? None
at all? Zero?
no
probability is calculated and none is needed.
Are you sure? I don’t imagine anything like actual Bayesian analysis is
done during the development of the category perception. Nevertheless,
the same conceptual structure can be applied.
At what point in the sequence did you perceive that the sequence was in
fact a member of the categor? Was it at “A”? How many sequence
categories are in your perceptual dictionary starting with “A”.
Considering “Is category ABBAABAABBBAABAABBBBBAAABABBAAA present”, it
makes a difference if ABBAABAABBBAABAABBBBBAAABABBAAA is the only
category you have known to start with “A” or if you can perceive a few
dozen all of which start with “A”. You can’t usefully have all those
dozens of category detectors telling you at the same time that their
category is present, all with 100% certainty.
So maybe it was at “AB” or, as you said above, at “AABBBAAB”. Why would
the category perceptual input function wait just that long and no
longer before you perceived the existence of the category? Could it
possibly be that it took this long before the probability of that
category sufficiently exceeded the probability of the others and of “no
category I know”?
OK, backing off to the first “A”, Assuming you had several different
sequence categories that start with “A”, at least you have eliminated
the categories that don’t start with “A”. Or have you? Sometimes you
might make a mistake with a category. You perceive something as
belonging to one category, and later see it as belonging to another.
Taking this “A” to be a letter rather than a metaphor for an instance,
suppose the letters are handwritten capitals, and sometimes you mistake
an H for an A or vice-versa. Probabilistically, it’s more likely that
you perceive category “A” when the sequence is
ABBAABAABBBAABAABBBBBAAABABBAAA. but sometimes you might see
HBBAABAABBBAABAABBBBBAAABABBAAA instead. On such an occasion, do you,
or do you not, perceive the category that you label
“ABBAABAABBBAABAABBBBBAAABABBAAA”? Probably you do. If there aren’t
many categories that you have developed with the following sequence
BBAABAABBB, you are likely to perceive the sequence category
ABBAABAABBBAABAABBBBBAAABABBAAA despite having perceived the first
letter to be “H”.
It’s all probabilistic. You can’t guarantee that
HBBAABAABBBAABAABBBBBAAABABBAAA isn’t a category with a quite different
meaning than ABBAABAABBBAABAABBBBBAAABABBAAA, even if you do know that
sometimes you see “A” as “H” and what you thought was and “H” was
“really(!)” “A”.
That’s the next level of
knowlege. In the above example, note that when we see the repeating
sequence of elements, we still don’t know whether the next sequence
will
be the same one, but within the parts of each sequence, we do know. Or
think we do. Now we are using the distinction between A and B to
generate
meaning: AB is different from BA, whereas AA is not different from AA.
If
we see A we know that means we are not seeing B.
We come to a new concept “meaning”. That’s a really slippery one. I’m
sure we could have quite a long and inconclusive thread about it 
So statistics is used when we are trying to find regularities in
ocurrances of elements that are meaningless in themselves and are
unrelated to any other elements along any dimension except occurrance.
This is why statistical equations do not have to say what real-world
variables are indicated by the symbols in the equations. It doesn’t
matter what the real-world variables are, because their properties
(other
than existence) are irrelevant.
True.
This tells us something about information theory. Information theory is
cast in terms of probabilities. The amount of information in a message
can be calculated, given the possible number of different messages,
without knowing what any of the messages means. This tells us that
whatever information is in the formal sense, it is not information in
the
common-language sense; whatever the technical meaning of message is, is
it not what we normally mean: that is, it is not ABOUT anything.
That is what most people say. It’s something with which I profoundly
disagree, basing my disagreement directly on Shannon. It is too much of
a leap to say that because statistics, including information measures,
can be computed in the absence of meaning that therefore the use of
statistics (even frequentist, “objective” statistics) implies the
absence of meaning. Some men do not wear black hats, therefore a person
wearing a black hat is not a man, I suppose.
We can
calculate the information in a string of gibberish just as readily as
the
information in a poem.
Can you? That’s news to me. A reference would be nice.
That calculation will not reveal the difference in
meaning content.
On the other hand, the meaning content profoundly affects the
difference in information content, as does your background
understanding of the poet’s aesthetic tendencies. So, from the Bayesian
approach, the information does relate closely to the meaning. The
numerical value, of course, does not. That’s just a scalar number, so
it can’t be expected to.
This suggests that perhaps the choice of the term
information in information theory was unfortunate.
No, it was used because it refers precisely to how much meaning you can
get out of a message ABOUT something. That’s very close to the everyday
use of the term, at least as close as PCT “perception” is to everyday
“perception”.
It is certainly not
concerned with what we normally think of as information, which is
meaning.
Oh, I do love the way you use “certainly” in your messages to signal
assertions of which you seem to be unsure, but that you wish your
audience to think are not to be questioned.
The meaning in any particular circumstance depends on the
perceptual/conceptual structure affected by the message. The quantity
of meaning that could have been passed by the message to a particular
receiver cannot be determined from the message itself. What can be
determined independently of meaning is “channel capacity”, a limit on
how rapidly information about A could reach B through the channel.
There is no reason to assume that a message high in
information
content has any meaning, nor that a message with only 1 bit of
information in it has a minimal amount of meaning (one if by day, two
if
by night).
That is true. There are four possibilities:
(1) High information content, much meaning
(2) High information content, little meaning
(3) Low information content, much meaning
(4) Low information content, little meaning
You correctly assert that possibilities 2 and 3 cannot be dismissed
reasonably. But then nether can 1 and 4. All remain possible. And
reasonable.
This brings us to the next level of knowledge, which is knowledge
specific to meanings.
In my way of looking at it, that’s all that “knowledge” can be. But
then, as I said earlier, the concept of “meaning” is abominably
slippery.
I haven’t got so far into this, but perhaps others
can carry it on from here. Where we end up at this level is with
theories: organized patterns of meanings which purport to explain the
temporal sequences and other kinds of relationships among things that
we
observe. The simplest sort of theory is “What has happened before
will happen again,” and the most advanced, perhaps, consists of
models that simulate an unobservable reality to produce happenings that
can be checked against experience. Theories are partly statistical, and
models or simulations are not statistical at all. A model behaves only
and exactly as it is organized to behave, and can never do anything
different from that unless you specifically include a generator of
randomness in it – and then you will still be specifying exactly what
it
affects.
You can specify precisely what a model will do only if you can also
specify its inputs precisely. If you can specify its inputs, no
considerations of probability or information can be applied to its
behaviour. If, however, it is to behave in an unpredictable world
(simulated or otherwise), then its actions become probabilistic, and
information-theoretic approaches are viable. This applies most
specifically to control systems that have no advance knowledge of
either their reference values or the disturbances to their perceptions.
For example, any limit on the capacity of the channel from the senses
to the perceptual input function affect the quality of control for
high-bandwidth disturbances.
I would like to suggest another meaning of “meaning”, which is “a
change in the value of a controlled perception”. This may not seem
reasonable on the face of it, but I think it can be argued. “Meaning”
is, to me those aspects of the world that influence how your actions
can influence your perceptions. Changes that happen in the world that
have no relation to controlled perceptions may yet be perceived, but do
they have any meaning for you? What is the meaning of sunrise, even a
beautiful one, if it does not affect your actions (meaning that it does
not disturb your controlled perceptions)?
Thanks for the disturbance!
Martin