Uncertainty, Information and control part I: U, I, and M

MartinT · February 20, 2013, 5:43am

Re Negative feedback too simpl3.jpg

Re Negative feedback too simpl4.jpg

···

[Martin Taylor
2013.01.01.12.40]

    As you can see by the date stamp, I started this on New Year's

Day, expecting it to be quite short and to be finished that day.
I was wrong on both counts.

    This message is a follow-up to [Martin Taylor 2012.12.08.11.32]

and the precursor of what I hope will be messages more suited to
Richard Kennaway’s wish for a mathematical discussion of
information in control. It is intended as part 1 of 2. Part 1
deals with the concepts and algebra of uncertainty, information,
and mutual information. Part 2 applies the concepts to control.

    Since writing [Martin Taylor 2012.12.08.11.32], which showed the

close link between control and measurement, and followed the
information “flow” around the loop for a particular control
system, I have been trying to write a tutorial message that
would offer a mathematically accurate description of the general
case while being intuitively clear about what is going on. Each
of four rewrites from scratch has become long, full of
equations, and hardly intuitive to anyone not well versed in
information theory. So now I am trying to separate the
objectives. This message attempts to make information-theoretic
concepts more intuitive with few equations beyond the basic ones
that provide definitions. It is nevertheless very long, but I
have avoided most equations in an attempt to make the principles
intelligible. I hope it can be easily followed, and that it is
worth reading carefully.

    My hope is that introducing these concepts with no reference to

control will make it easier to follow the arguments in later
messages that will apply the concepts and the few equations to
control.

    This tutorial message has several sections, as follows:
    Basic background
       concepts
       definitions
          uncertainty
              comments on "uncertainty"
          information
              Information as change in uncertainty
              mutual information
    Time and rate
       Uncertainty
       Information
       Information and history
    Some worked out example cases
       1. A sequence like "ababababa..."
       2. A sequence like "abacacabacababacab...."
       3. A sequence in which neighbour is likely to follow

neighbour
4. A perfect adder to which the successive input values are
independend
5. A perfect adder to which the successive input values are
not independent
6. Same as (5) except that there is a gap in the history –
the most recent N input samples are unknown.

    [Because I don't know whether subscripts or Greek will come out

in everyone’s mail reader, I use “_i” for subscript “i”, and
Sum_i for “Sum over i”. "p(v)’ means “the probability of v being
the case”, so p(x_i) means "the probability X is in state i"and
p_x_i(v) means “the probability that v is the case, given that x
is in state i”, which can also be written p(v|x_i).]

    ------------Basic equations and concepts: definitions-------

    1. "Probability" and "Uncertainty"

    In information theory analysis, "uncertainty" has a precise

meaning, which overlaps the everyday meaning in much the same
way as the meaning of “perception” in PCT overlaps the everyday
meaning of that word. In everyday language, one is uncertain
about something if one has some idea about its possible states
– for example the weather might be sunny, cloudy, rainy, snowy,
hot, … – but does not know in which of these states the
weather is, was, or will be now or at some other time of
interest. One may be uncertain about which Pope Gregory
instituted the Gregorian calendar, while feeling it more likely
that it was Gregory VII or VIII than that it was Gregory I or
II. One may be uncertain whether tomorrow’s Dow will be up or
down, while believing strongly that one of these is more
probable than that the market will be closed because of some
disaster.

    If one knows nothing about the possibilities for the thing in

question (for example, “how high is the gelmorphyry of the
seventh frachillistator at the winter solstice?”) one seldom
thinks of oneself as “uncertain” about it. More probably, one
thinks of oneself as “ignorant” about it. However, “ignorance”
can be considered as one extreme of uncertainty, with
“certainty” as the other extreme. “Ignorance” could be seen as
“infinite uncertainty”, while “certainty” is “zero uncertainty”.
Neither extreme condition is exactly true of anything we
normally encounter, though sometimes we can come pretty close.

    Usually, when one is uncertain about something, one has an idea

which states are more likely to be the case and which are
unlikely. For example, next July 1 in Berlin the weather is
unlikely to be snowy, and likely to be hot, and it is fairly
likely to be wet. It is rather unlikely that the Pope is
secretly married. It is very likely that the Apollo program
landed men on the moon and brought them home. One assigns
relative probabilities to the different possible states. These
probabilities may not be numerically exact, or consciously
considered as numbers, but if you were asked if you would take a
5:1 bet on it snowing in Berlin next July 1 (you pay $5 if it
does snow and receive $1 if it doesn’t snow), you would probably
take the bet. If you would take the bet, your probability for it
to snow is below 0.2.

    Whether "uncertainty" is subjective or objective depends on how

you treat probability. If you think of probability as a limit of
the proportion of times one thing happens out of the number of
times it might have happened, then for you, uncertainty is
likely to be an objective measure. I do not accept that approach
to probability, for many reasons on which I will not waste space
in this message. For me, “probability” is a subjective measure,
and is always conditional on some other fact, so that there
really is no absolute probability of X being in state “i”.

    Typically, for the kind of things of interest here, when the

frequentist “objective” version is applicable, the two versions
of probability converge to the same value or very close to the
same value. Indeed, subjective probability may be quite strongly
influenced by how often we experience one outcome or another in
what we take to be “the same” conditions. In other cases, such
as coin tossing, we may never have actually counted how many
heads or tails have come up in our experience, but the shape of
the coin and the mechanics of tossing make it unlikely that
either side will predominate, so our subjective probability is
likely to be close to 0.5 for both “heads” and “tails”. As
always, there are conditionals: the coin is fair, the coin
tosser is not skilled in making it fall the way she wants, it
does not land and stay on edge, and so forth.

    In order to understand the mathematics of information and

uncertainty, there is no need to go into the deep philosophical
issues that underlie the concept of probability. All that is
necessary is that however probabilities are determined, if you
have a set of mutually exclusive states of something, and those
states together cover all the possible states, their
probabilities must sum to 1.0 exactly. Sometimes inclusivity is
achieved by including a state labelled “something else” or
“miscellaneous”, but this can be accommodated. If the states
form a continuum, as they would if you were to measure the
height of a person, probabilities are replaced by probability
density and the sum by an integral. The integral over all
possibilities must still sum to 1.0 exactly.

    -------definitions-----

    First, a couple of notations for the equations I do have to use:

    I use underbar to mean subscript, as in x_i for x subscript i.

    p(x_i) by itself means "Given what we know of X, this is the

probability for us that X is (was, will be) in state i" or, if
you are a frequentist, “Over the long term, the proportion of
the times that X will be (was) in state i is p(x_i)”.

    I mostly use capital letters (X) to represent systems of

possible states and the corresponding lower-case letters to
represent individual states (x for a particular but unspecified
state, x_i for the i’th state).

    p(x_i|A) means "If A is true, then this is our probability that

X is in state i" or “Over the long term, when A is true, the
proportion of the time that X will be in state i is p(x_i)”. In
typeset text, sometimes this is notated in a more awkward way,
using A as a subscript, which I would have to write as p_A(x_i).

    I will not make any significant distinction between the

subjective and objective approaches to probability, but will
often use one or the other form of wording interchangeably (such
as “if we know the state of X, then …”, which you can equally
read as “If X is in a specific state, then …”). Depending on
which approach you like, “probability”, and hence “uncertainty”,
is either subjective or objective. It is the reader’s choice.

    I use three main capital letter symbols apart from the Xs and Ys

that represent different systems. U( ) means “Uncertainty”, I( )
means Information, and M( : ) means mutual information between
whatever is symbolized on the two sides of the colon mark. The
parentheses should make it clear when, say, “U” means a system U
with states u_i and when it means “Uncertainty”.

    -----definition of "uncertainty"----

    If there is a system X, which could be a variable, a collection

of possibilities, or the like, that can be in distinguishable
states x_i, with probabilities p(x_i), the uncertainty of X is _defined _
by two formulae, one applicable to discrete collections or
variables

    U(X) = - Sum_i (p(x_i) log(p(x_i)))

    the other for the case in which "i" is a continuous variable. If

“i” is continuous, p(x_i) becomes a probability density and the
sum becomes an integral

    U(X) = - Integral_i (p(x_i) log(p(x_i)) di).

    That's it. No more, no less. Those two equations define

“uncertainty” in the discrete and the continuous case.

    (Aside: Von Neumann persuaded Shannon to stop calling U(X)

“uncertainty” and to call it “Entropy” because the equations are
the same. This renaming was unfortunate, as it has caused
confusion between the concepts of uncertainty and entropy ever
since.)

    If X is a continuous variable, the measure of uncertainty

depends on the choice of unit for “i”. If, for example, X has a
Gaussian distribution with standard deviation s, and we write H
for log(sqrt(2PIe)), the uncertainty of X is H + log(s). The
standard deviation is measured in whatever units are convenient,
so the absolute measure of uncertainty is undetermined. But
whatever the unit of measure, if the standard deviation is
doubled, the uncertainty is increased by 1 bit. Since we usually
are concerned not with the absolute uncertainty of a system, but
with changes in uncertainty as a consequence of some observation
or event, the calculations give well-defined results regardless
of the unit of measure. When the absolute uncertainty is of
interest, one convenient unit of measure is the resolution limit
of the measuring or observing instrument, since this determines
how much information can be obtained by measurement or
observation.

    There are a few other technical issues with the uncertainty of

continuous variables, but calculations using uncertainty work
out pretty much the same whether the variable is discrete or
continuous. Unless it matters, in what follows I will not
distinguish between discrete and continuous variables.

    --------comments on "uncertainty-----

    Why is Sum(p*log(p)) a natural measure of "uncertainty"?

Shannon, in his 1947 monograph that introduced information
theory to the wider world, proved that this is the only function
that satisfies three natural criteria:

    (1) When the probabilities vary continuously, the uncertainty

should also vary continuously.
(2) The uncertainty should be a maximum when all the
possibilities are equally probable.
(3) If the set is partitioned into two subsets, the uncertainty
of the whole should be the weighted sum of the uncertainties of
the two parts, the weights being the total probabilities of the
individual parts. Saying this in the form of an equation, if X
has possibilities x_1, x_2, … x_n and they are collected into
two sets X1 and X2, then
U(X) = p(X1)*U(X1) + p(X2)*U(X2),
where p(Xk) is the sum of the probabilities of the possibilities
collected into set Xk.

    One important consequence of the defining equation U(X) =

Sum_i(p(x_i)*log(p(x_i))) is that the uncertainty U(X) is always
positive and ranges between zero (when one of the p(x_i) is 1.0
and the rest are zero) and log(N) (when there are N equally
probable states of X). The logarithm can be to any base, but
base 2 is conventionally used. When base 2 is used, the
uncertainty is measured in “bits”. The reason for using base 2
is convenience. If you can determine the actual state through a
series of yes-no questions to which every answer is equally
likely to be yes or no, the uncertainty in bits is given by the
number of questions.

    I will use the term "global uncertainty of X" for the case when

nothing is assumed or known about X other than its long-term
probability distribution over its possible states. This is what
is meant by “U(X)” in what follows, unless some other
conditional is specified in words.

    In certain contexts, uncertainty is a conserved quantity. For

example, if X and Y are independent, the joint uncertainty of X
and Y, symbolised as U(X,Y), is

    U(X,Y) = U(X) + U(Y)

    where U(X,Y) is the uncertainty obtained from the defining

equation if each possible combination of X state and Y state
(x_i, y_j) is taken to be a state of the combined system. The
concept can be extended to any number of independent variables:

    U(X, Y, Z, W....) = U(X) + U(Y) + U(Z) + U(W) +...

    Using the subjective approach to probability, "independent"

means “independent given what is so far known”. It is not clear
how the term can legitimately be used in a frequentist sense,
because any finite sample will show some apparent relationships
among the systems, even if an infinite set of hypothetical
trials would theoretically reduce the apparent relationships to
zero. A frequentist might say that there is no mechanism that
connects the two systems, which means they must be independent.
However, to take that into account is to concede the
subjectivist position, since the frequentist is saying that *** as
far as he knows*** , there is no mechanism. Even if
there was a mechanism, to consider it would lead the frequentist
into the subjectivist heresy.

    Uncertainty is always "about" something. It involves the

probabilities of the different states of some system X, and
probabilities mean nothing from inside the system in question.
They are about X as seen from some other place or time. To say
this is equivalent to saying that all probabilities are
conditional. They depend on some kind of prior assumption, and
taking a viewpoint is one such assumption.

    Consider this example, to which I will return later: Joe half

expects that Bob will want a meeting, but does not know when or
where that is to be if it happens at all. What is Joe’s
uncertainty about the meeting?

    Joe thinks it 50-50 that Bob will want a meeting. p(meeting) =

0.5

    If there is to be a meeting, Joe thinks it might be at Joe's

office or at Bob’s office, again with a 50-50 chance.

    If there is to be a meeting, Joe expects it to be scheduled for

some exact hour, 10, 11, 2, or 3.

    For Joe, these considerations lead to several mutually exclusive

possibilities:

    No meeting, p=0.5

    Bob's office, p = p(meeting)*p(Bob's office | meeting) = 0.5*0.5

= 0.25

    Joe's office p = p(meeting)*p(Joe's office | meeting) = 0.5*0.5

= 0.25

    Joe has 1 bit of uncertainty about whether there is to be a

meeting, and with probability 0.5, another bit about where it is
to be. Overall, his uncertainty about whether and where the
meeting is to be is

    U(if, where) = -0.5log(0.5) - 2*(0.25log(0.25) = -0.5*(-1) -

0.5*(-2) = 1.5 bits

                            no meeting    meeting&&either

office

    Another way to get the same result is to use a pattern that will

turn up quite often

    U(if, where) = U(if) + U(where|if) = U(if) + p(no

meeting)U(where|no meeting) + p(meeting)U(where|meeting) = 1 +
0.50+0.51 = 1.5 bits

    Joe is also uncertain about the time of the meeting, if there is

one. It could be at 10, 11, 2, 3 regardless of where the meeting
is to be. That represents 2 bits of uncertainty, but those two
bits apply only if there is to be a meeting.

    U(time|meeting) = 2 bits

    U(time) = U(time)*p(meeting) + U(time)*p(no meeting) = 0.5*2 +

0.5*0 = 1.

    U(if, where, time) = 2.5 bits

    Compare this with Joe's uncertainty about the meeting if he knew

there would be a meeting, but did not know whether it included
Alice. Then we would have three mutually independent things Joe
would be uncertainty about:

    U(Alice) = 1 bit

    U(office) = 1 bit

    U(time) = 2 bits

    Joe's uncertainty about the meeting would be 4 bits.

    Notice what happened here. We added a dimension of possible

variation independent of the dimensions we had already, but
eliminated the dimension of whether there would be a meeting.
That dimension (“if”) interacted with the other two, because if
there was to be no meeting, there could be no uncertainty about
its time and place. Adding an independent dimension of variation
adds to the uncertainty; it does not multiply the uncertainty.

    U(meeting_configuration) = U(Alice) + U(office) + U(time)

    ------

    2. "Information" and related concepts

    "Information" always means change in uncertainty. It is a

differential quantity, whereas “uncertainty” is an absolute
quantity. I(X) = delta(U(X)).

    A change in the uncertainty about X may be because of the

introduction of some fact such as the value of a related
variable, or because of some event that changes the probability
distribution over the states of X. If it is sunny today, it is
more likely to be sunny tomorrow than if today had happened to
be rainy.

    To continue the "meeting" example, imagine that Joe received a

call from Bob saying that the meeting was on and in his office,
but not specifying the time. Joe’s uncertainty would be 2 bits
after the phone call, but was 2.5 bits before the call. He would
have gained 0.5 bit of information about the meeting. If the
call had said the meeting was off, Joe would have no further
uncertainty about it, and would have gained 2.5 bits. His
expected gain from a call about whether the meeting was off or
on, and its location if it was on, would be 0.50.5 + 0.52.5 =
1.5 bits, which is the uncertainty we computed by a different
route above for U(if, where).

    Often the "event" is an observation. If you observe a variable

quantity whose value you did not know exactly, your uncertainty
about it is likely (but not guaranteed) to be lower after the
observation than before. Again, thinking of Joe’s meeting, if
the call had simply told Joe the meeting was on, without
specifying the location, Joe’s uncertainty about the meeting
would have increased from 1.5 bits to 3 bits (1 bit for where
the meeting is to be held and 2 bits for when). In that case his
information gain would have been negative. However, averaged
over all the possible messages and taking account of their
probabilities, the information gain from a message or
observation is always zero or positive, even though there may be
individual occasions when it is negative.

    The information gained about X from an observation is:

    I_observation(X) = U_before-observation(X) -

U_after-observation(X)

    ---------

    Mutual Information

    The Mutual information between X and Y is the reduction in the

joint uncertainty of two systems X and Y due to the fact that X
and Y are related, which means that knowledge of the state of
one reduces uncertainty about the state of the other. If the
value of Y is observed, even imprecisely, more is known about X
than before the observation of Y.

    When two variables are correlated, fixing the value of one may

reduce the variance of the other. Likewise, we define “mutual
information” as the amount by which uncertainty of one is
reduced by knowledge of the state of the other. If U(X|Y) means
the uncertainty of X when you know the value of Y, averaged over
all the possible values of Y according to their probability
distribution, then the mutual information between X and Y is

    M(X:Y) = M(Y:X) = U(X) - U(X|Y) = U(Y) - U(Y|X)

    Here is another generic equation for mutual information, using

the reduction in the joint uncertainty of the two systema as
compared to the sum of their uncertainties:

    M(X:Y) = U(X) + U(Y) - U(X,Y)

    from which it is obvious that M(X:Y) = M(Y:X) and that M(X:Y)

cannot exceed the lesser of the two uncertainties U(X) and U(Y).
If X is a single-valued function of Y (meaning that if you know
the value of Y, you can thereby know the value of X) then M(X:Y)
= U(X).

    M(X:Y) is always positive, though there may be specific values

of one of the variables for which the uncertainty of the other
is increased over its average value, as happened when Joe was
told the meeting was on without being told were, when, or who
was attending.

    Going back again to Joe's office meeting, suppose Joe knows

there is to be a meeting, but does not know where or when. It
might be at Bob’s office or at Bob’s house. If it is at the
office, it might be at 10 or at 2. If at the house, it might be
at 6 (with a dinner invitation) or at 9 (after dinner).

    U(where) = 1 bit

    U(when) = 2 bits

    U(when) + U(where) = 3 bits

    But U(when, where) is not 3 bits, because there are actually

only four equiprobable combinations of place and time, 10 and 2
at the office, 6 and 9 at the house, so U(where, when) = 2 bits.
Therefore M(where:when) = 1 bit.

    -----time and rates----

    Since "information" is a difference quantity, the concept of

“information rate” becomes useful. It is the rate at which
uncertainty changes as a result of ongoing conditions or events.

    Consider I(X|observation), the information gained about X as the

result of an observation. Imagine that a series of observations
are made at regular intervals. if X does not change state over
time, then each successive observation is likely to further
reduce U(X). Because of the definition of “information” as
“change in uncertainty”, the observer gains information at a
rate that is the average reduction in uncertainty per
observation divided by the time between observations.

    If X is a discrete system, there is a limit to how much

information can be gained about X by observing it. If the global
uncertainty of X (what you believe about X before observing it)
is U(X), no amount of observation can reduce that uncertainty
below zero – the value of U(X) when you know the state of X
exactly. Accordingly, the information gained by continued
observation of a static quantity plateaus at the global
uncertainty of that variable – the amount that the observer did
not know about it before starting the series of observations.

    (Figure 1)

    This is one place where the difference between continuous and

discrete variables matters. If the system X has a finite number,
N, of distinguishable states, its uncertainty cannot be greater
than log(N), so log(N) is the maximum amount of information that
could be obtained about it by continued or repeated observation.
But if the states of X form a continuum, it has an infinite
number of states, meaning that in theory there is no limit to
the amount of information about it that could be gained by
continued observation. In practice, of course, the resolution of
the observing system imposes a limit. Precision is never
infinite, even if the limit is imposed by quantum uncertainty.

    What about the situation in which X is or may be changing? If a

changing X is observed once, the uncertainty about X becomes a
function of time since the observation. X may change state
continuously or abruptly, slowly or fast. However it may change,
the “information rate” of X is defined as dU(X)/dt provided that
the observation timing is independent of anything related to a
particular state or state change of X.

    -----------

    [Aside: Here we have another example of the kind of

complementarity between two measures that is at the heart of the
Heisenberg Uncertainty Principle of quantum mechanics. If you
know exactly where something is, you know nothing about its
velocity, and vice versa. The rate at which you get information
from observing something can tell you about both its value and
the rate of change of value, but the more closely define one of
these, the less you can know about the other. The information
rate determines the size of the “precision*velocity” cell.]

    -----------

    If an observation is made at time t0 and never thereafter,

immediately after the observation at time t0+e (“e” represents
“epsilon”, a vanishingly small quantity) the uncertainty of X is
U_(t0+e)(X) . From that moment, U_t(X) increases over time. The
rate at which U_t(X) increases depends, in the casual
tautological sense, on how fast the value of X changes. If the
variable is a waveform of bandwidth W, this can be made a bit
more precise. The actual rate depends on the spectrum of the
waveform, but for a white noise (the worst case) the rate dU/dt
= Wlog(2pieN) where N is the noise power (which again
depends on the choice of units).

    However, just as continued observation cannot yield more

information than the initial uncertainty of X, so continued
external influences or other cause of variation cannot increase
U_t(X) beyond its global value (its value determined from the
long-term statistics of X).

    (Figure 2)

    ----------

    3. Information and history

    In this section, we are interested in the mutual information

between the preceding history and the following observation,
which is symbolized M(X|Xhistory). In everyday language, we want
to know how much we can and cannot know about what we will
observe from what we have observed.

    Mutual information need not be just between the states of system

X and the states of a different system Y. It can be between the
states of system X at two different times, which we can call t0
and t1. If t0 and t1 are well separated, the state of X at time
t0 tells you nothing about its state at time t1. This would be
the case at any time after Tc in Figure 2. But if t1 follows t0
very closely (i.e. near the left axis in Figure 2), it is very
probable that the state of X has not changed very much, meaning
that the Mutual Information between the two states is large.
They are highly redundant, which is another way of saying that
the mutual information between two things is close to its
maximum possible value.

    Suppose X is the weather. If it is sunny at 10 o'clock, it is

likely to be sunny at one second past ten, or one minute past
ten. The probability of it remaining sunny is still pretty high
at 11, and still moderately high at ten tomorrow. But by this
time next week, the observation that it is sunny now will not
tell you very much more about whether it will be sunny then than
will a book about the local climate at this time of year.

    As the example suggests, M(X_t:X_(t-tau)) -- the complement of

U_(t-tau)(X|observation at t0) shown in Figure 2-- is a function
of tau that usually decreases with increasing tau. I say
“usually”, because it is not true of periodic or nearly periodic
systems. The temperature today is likely to be nearer the
temperature this time last year than to the temperature at some
randomly chosen date through the year. For now we will ignore
such cyclic variation, and treat M(X_t:X_(t-tau)) as if it
declined with increasing tau, until, at some value of tau, it
reaches zero.

    ------footnote-------

    For a signal of bandwidth W, M(X_t:X_(t-tau)) reaches zero at

tau = 1/2W. No physical signal is absolutely restricted to a
precisely defined bandwidth, but the approximation can be pretty
close in some cases. Good audio sampling is much faster than
1/2W because the physical signals of musical instruments have
appreciable power at frequencies well about the listener’s
hearing bandwidth.

    -----end footnote-----

    Suppose a signal is sampled at regular intervals separated by

tau seconds, with tau small enough that knowledge of the signal
value at one sample reduces the uncertainty of the state of the
next and previous samples. The probability distribution of the
next sample is constrained by the values of the previous
samples. One might think of the value of the sample as being
that of a perception, but that is not necessary.

    (Figure 2)

    Figure 2 suggests how knowledge of the past values may constrain

the probability distribution of the next sample value when the
inter-sample time, tau, is short compared to the rate of change
of the variable.

    The first panel suggests the uncertainty of the next sample if

one knows nothing of the history of the variable. If the
variable is from the set of possible states of X, its
uncertainty is U(X). The second and third panels show how the
uncertainty is reduced if the values of the previous one or two
samples are known. In the middle panel, one possible value of
the previous sample is shown, and in the third panel the values
of the two previous samples are taken into account.

    The figure is drawn to show the changes in a slowly changing

waveform, but that is just one possibility. It could equally
easily represent the probabilities in a sequence of letters in a
text. For example, the left panel might represent the
possibility of observing any letter of the alphabet, the middle
the distribution of the next character after observing a “t”,
and the right panel the distribution of the following character
after observing “th”. Since we assume that the statistics of X
are stationary (i.e. do not depend on when you sample them,
provided that choice of the sampling moment is independent of
anything happening in X), M(X0:X1) = M(X1:X2) = M(X2:X3)…

    We want to know the uncertainty of X at a given moment t0 given

its entire previous history. If X is a slowly varying waveform,
its uncertainty knowing the previous history right up to t0 is
zero, which is uninteresting. In that case we have to ask about
its history only up to some time t0-tau. We will actually
consider the history to consist of successive samples separated
by tau seconds. In the case of a discrete system such as the
letters of a text, no such restriction is required, but by
making the restriction on the sampling of the continuous
waveform we can treat both kinds of system similarly.

    Formally, one can write

    U(X0|History) = U(X0) - M(X0:History)

    but that is not much use unless one can calculate the mutual

information between the next sample value x0 and the history of
system X. It is possible again to write a general formula, but
its actual calculation depends on the sequential statistics of
X.

    The History of X can be written as a sequence of samples

counting back from the present: X1, X2, X3, … and
M(X0:History) can be written M(X0:X1, X2, X3,…)

    M(X0:History) = M(X0|X1) + M(X0:X2|X1) + M(X0:X3|X2,X1) +

M(X0:X4|X1,X2,X3) + …

    In words, the mutual information between the next sample and the

history can be partitioned into the sum of an infinite series of
mutual information elements, each one being the amount one
sample contributes to the total mutual information, given the
specific values of all the samples between it and the most
recent sample.

    To make this concrete, let's consider a few specific examples

    --------

    Example 1: X is a sequence of symbols "a" and "b", that

alternate “…ababab…”

    If one looks at an arbitrary moment, one has an 0.5 chance of

the next sample being a or b: U(X0) = 1 bit. If one has seen one
sample, which might have been either a or b, there is no
uncertainty about the next sample: U(X0|X1) = 0, which gives
M(X0:X1) = U(X).

    What then of M(X0:X2|X1)? Remember that M(A:B) cannot exceed the

lesser of U(A) and U(B). In this case, U(X2|X1) = 0 because if
X1 was a X2 must have been b, and vice-versa. So M(X0:X2|X1) = 0
and all subsequent members of the series are also zero.

    One can extend this example to any series or waveform that is

known to be strictly periodic with a known period.

    -------

    Example 2: X is a sequence of symbols a, b, and c with the

property that a is followed equally probably by b or c, b by c
or a, and c by a or b. No symbol is ever repeated immediately,
but apart from that, each other symbol is equally likely at any
sample moment.

    U(X0) = log(3) = 1.58 bits

    U(X0|X1) = 1 bit, which gives

    M(X0:X1) = 0.58 bits

    It is irrelevant how X1 took on its value from X2. If x1 = a, x2

could have equally well been b or c. Therefore

    M(X0:X2|X1) = 0

    --------

    Example 3. X is a sequence of five different symbols labelled 1,

2, 3, 4, 5. Considering 5 and 1 to be neighbours, each symbol
can be followed by itself with probability 0.5 or one of its
neighbours with probabiity 0.25. Over the long term, each symbol
appears with equal probability.

    U(X0) = log(5) = 2.32 bits

    U(X0|X1) =-0.5 log 0.5 - 2* (0.25 log 0.25) (because the

probability is 0.5 its predecessor was itself and 0.25 that it
was either of its two neighbours)

                   = 0.5 + 0.5 = 1 bit

    M(X0:X1) = 2.32 - 1 = 1.32 bits

    M(X0:X2|X1) depends on the relationship between X2 and X0,

unlike the situation in example 2. There are five possibilities
for x0 (the value of X0). the probability of these five depends
on the relationship between x0 and x2. We must consider each of
these five, and partition M(X0:X1) according to the
probabilities for each value of X2. In a formula:

    M(X0:X2|X1) = Sum_k(p(X2=xk)*M(X0:X2|X1=xk))

    Since the probability p(X2=xk) is the same for each value of xk

(all symbols are equally likely over the long term), this sum
comes down to

    M(X0:X2|X1) = (1/N)*Sum_k(M(X0:X2|X1=xk))

    We consider all these possibilities individually. Since all the

relationships are symmetric (meaning that it doesn’t matter
which value we select as the possible next sample, x0), we need
deal only with the relationships themselves.

    The sequence could be {x0, x0, x0}, {x0, neighbour, x0}, {x0,

neighbout, neighbour} or {x0, neighbour, non-neighbour}.
Examples of each might be {3.3.3}, {3,4,3}, {3,4,4} and {3,4,5},
which would have probabilities 0.50.5 = 0.25,
20.250.25=0.125, 20.250.5=0,25, and 0.250.25=0.125
respectively.

    There is no need to do the actual calculations here. They are

easy but time-consuming. The point is that we can calculate
M(X0:X2), M(X0:X2|X1) and the rest directly from the conditional
probabilities.

    ---------

    Example 4: X is built by summing a succession of samples of

another system Y, such that xj = Sum(History(X)) + yj, x0 = 0
for some specified value of k, and yj is a sequence of positive
and negative integer values, samples from a system Y, that
average zero. In this example, the numbering scheme goes in the
direction of time, later samples being labelled with larger
numbers. At the k’th sample the expression M(X|Xhistory)
becomes

    M(Xk:Yk|X_(k-1), X_(k-2), ... X(0))

    Each sample of Y contributes equally to the value of X, so there

is no need to consider the sequential order of the y values.
Assuming that Y is a series of numerical values and the samples
of Y are independent, the variance of X increases linearly with
the number of samples, so its standard deviation increases
proportionally to the square root of that number.

    Let us trace the contribution of the first few samples of Y to

the uncertainty of X.

    After one sample, y0, of Y, x0 will equal y0, which means U(X0)

= U(Y0).

    M(X0:Y0) = U(X) + U(Y) - U(X,Y) = U(X) = U(Y)

    After the next sample, U(X1) still is given by the formula Sum(p

log p), but the probabilities are no longer the same as they
were before the first sample. There is a wider range of
possibilities, from twice the minimum value of Y to twice the
maximum value of Y, and the probabilities of those possibilities
are not equal. Because the probability distribution over the
possible values of X is not uniform, U(X) < 2*U(Y).

    As more and more samples are included in the Sum, the range of X

increases, but values near zero come to have higher and higher
relative probabilities. The distribution of probabilities over
the possible values of X approaches a Gaussian with variance
that increases linearly with k, so U(X) approaches U(Y)sqrt(k).
The ratio U(X)/(kU(Y) decreases with increasing k, approaching
1/sqrt(k) as k becomes large .

    We want to consider the contribution of the k'th sample of Y.

How much does the kth sample of Y reduce the uncertainty of the
k+1th value of the Sum, as compared to knowing only the History
up to the previous sample? That is the mutual information
between the kth sample of Y and the k+1th value of X.

    For large k, (U(X_k) - U(X_(k-1)))/U(Y) approaches

sqrt(k)-sqrt(k-1), an increasingly small number. Yet it remains
true that U(X_k|History) - U(X_k|History, Yk) = U(Y). The
uncertainty of the value of X after the next sample of Y is
always the uncertainty of Y, because if you know the current
value of X, the next value is determined by the value of the Y
sample. How can these two statements be reconciled? The key is
the conditional of knowing the History. Without the prior
History (in this case the Sum), the contribution of each
successive Y sample to the uncertainty of X becomes less and
less, but if you do have the prior History, the value of the
next X sample will have an uncertainty that is the uncertainty
of the Y sample.

    The situation is the same if the system X is normalized, meaning

that its variance is independent of the number of prior samples
after a sufficient number of Y values have occurred. The
magnitude of the contribution of the k’th Y sample is reduced
proportionally to sqrt(k), but its contribution to the various
uncertainties are the same, provided that the different possible
values of X are resolved.

    --------

    Example 5: The same as Example 4 except that the samples of Y

are not independent. If Y is a continuous waveform, the samples
are taken more closely than the Nyquist limit (samples separated
by 1/2W, where W is the bandwidth of the waveform). If Y is a
discrete set of possibilities, successive samples are related by
some kind of grammar. The two kinds of possibility converge when
the discrete set of possibilities is linearly arranged and the
grammar merely makes transitions among nearby values more likely
than across distant values.

    The new issue here is that if the history of Y is known, U(Yk)

< U(Y). Therefore M(Xk:History(X), yk) < U(Y). Each
successive sample of Y reduces the uncertainty of X less than
would be the case if the samples of Y were independent of each
other. The difference is subsumed by the fact that
M(Xk|History(X)) is greater than it is in the case when the Y
samples are independent, by the same amount.

    ---------

    Example 6: System X is a function of the current and past

samples of system Y. We want to know the uncertainty of sample
xk when the values of successive Y samples are known only up to
sample y_(k-h). For example, in the grammar of example 3, we may
have observed the sequence 1,2,1,5, (samples y1, y2, y3, y4) and
want to know the uncertainty of sample x6, which is a function
of y1, y2, y3, y4, y5, and y6. We do not yet know the values of
y5 or y6. As a real-world example, a trader in the days of
sailing ships might have known the market conditions in China
and England several months before his ship captain actually
bargains for tea in Canton, but the earlier conditions are the
conditions on which he must base his instructions to the
captain.

[Martin Taylor 2012.12.08.11.32]
    ....



    We consider the "basic-standard" control system that has a pure
integrator as its output function.
    First, let's treat the effect of one step change in the
disturbance value. We assume that the error is initially zero,
and at time t0 the disturbance suddenly goes from whatever value
it may have had previously to d0. We treat this single step, and
then consider the effects of a series of steps. We start by
showing how the integrator rate parameter controls how fast
information about the disturbance arrives at the output – how
fast the control system measures the disturbance value.
    Let us trace the signal values at various places in the loop as
time passes after the step at time t0. At time t0+e (e means
epsilon, an indefinitely small amount), the error value is -d
and the output value qo is zero. At this early time the output
value is whatever it was before the step, and has an unchanged
influence on the value of the input quantity qi. As time passes,
the output value changes exponentially until after infinite time
it reaches -d. Assuming q0 was initially zero, then at time t:
    qo(t) = -d0*(1-e^(-G*(t-t0))).



    At the same time, the input value qi approaches zero by the same
exponential:
    qi(t) = 1 - e^(-G*(t-t0))







    Exponential changes in value are quite interesting in
information theory, as they often indicate a constant bit rate
of information transfer. To see this, you must understand that
although the absolute uncertainty about the value of a
continuous variable either before or after a measurement or
observation is always infinity, the information gained through
the measurement or observation is finite. If the range of
uncertainty about the value (such as standard deviation of
measurement) is r0 before the measure and r1 after the measure,
the information gained is log2(r0/r1) bits.
    Measurements and observations are never infinitely precise. When
someone says their perception is subjectively noise-free and
precise, they nevertheless would be unable to tell by eye if an
object changed location by a nanometer or declined in brightness
by one part in a billion. There is always some limit to the
resolution of any observation, so the issue of there being an
infinite uncertainty about the value of a continuous variable is
never a problem. In any case in which a problem of infinite
uncertainty might seem to be presenting itself, the solution is
usually to look at the problem from a different angle, but if
that is not possible it is always possible to presume a
resolution limit, a limit that might be several orders of
magnitude more precise than any physically realizable limit, but
a limit nevertheless.
    (If you want to get a more accurate statement, and see the
derivation of all this, I suggest you get Shannon’s book from
Amazon at the link I provided earlier ).
In the above example, q0 could have started at any value
whatever, and would still have approached -d exponentially with
the same time constant. The output value qo has the character of
a continuous measurement of the value of the disturbance, and
this measurement gains precision at a rate given by
e^(-G*(t-t0)). (The units of G are 1/seconds).
What is this rate of gain in precision, in bits/sec? One bit of
information has been gained when the uncertainty about the value
of d0 has been halved. In other words, the output has gained one
bit of information about the disturbance by the time tx such
that
e^(-G*(tx-t0)) = 0.5
Taking logarithms (base 2) of both sides, we have
log2(e)(-G(tx-t0)) = -1
tx-t0 = 1/(Glog2(e)) = 1/(G1.443…) seconds
That is the time it takes for q0 to halve its distance to its
final value -d0 no matter what its starting value might have
been, which means it has gained one bit of information about the
value of the disturbance. The bit rate is therefore G1.443…
bits/sec. That is the rate at which the output quantity q0 gains
information about the disturbance, and also is the rate at which
the input quantity loses information about the disturbance. The
input quantity must lose information about the disturbance,
because it always approaches the reference value, no matter what
the value of the disturbance step. I suppose you could look upon G1.443 as the bit rate of the
loop considered as a communication channel, but it is probably
not helpful do do so, as the communication is from the input
quantity back to itself. Without the complete loop, q0 does not
act as a measure of d0, but continues negatively increasing
linearly without limit at a rate Gd0. It is better just to
think of G1.443 as the rate at which information from the
disturbance appears at the output quantity.
As a further matter of clarification, I suppose we ought to
mention the behavioural illusion here. If the environmental
feedback path is not a simple connector with a gain of unity,
the output quantity differs from the disturbance value. The
mismatch does not affect the information relationship between
the disturbance and the output quantity, since the environmental
feedback path provides a single-valued environmental feedback
function. It is always possible to compute the influence of the
output on the input quantity if you know the output value q0.
This means that q0 and the influence of the output on the input
quantity are informationally identical, whatever their
functional relationship, providing the environmental feedback
function does not change. If it does change, the effect is
similar to adding a noise source to a communication channel, a
situation that was Shannon’s primary concern and provides no
particular complication to the analysis. (Or in the case of
Marken’s demo that involves switching the sense of the
environmental feedback path, the communication analogy is the
band-switching secret communication procedure patented by the
movie star Hedy Lamarr; in both cases, the switching must be
separately detected or otherwise known).
As a specific example, if the environmental feedback path has a
gain of 2, the output quantity will converge to a value of d/2.
The convergence to d/2 is at the same exponential rate as would
have been the case for en environmental feedback path gain of
1.0, providing an information gain rate that is determined only
by the integration rate parameter G. Information from the
disturbance still arrives at the output quantity (and is lost
from the input quantity) at a rate G1.443… From now on, we will give the label “T” to the rate at which the
output gains information about the disturbance. T = G1.443… bits/sec.
The rate of information gain about the disturbance value at the
output quantity is independent of the disturbance waveform and
of the environmental feedback function. It is a property of the
internal circuitry of the control system. Accordingly, we can
make some estimates about the achievable precision of control
for disturbances that are not simple step functions. First,
however, we should look at the effect of a second and subsequent
step change in the disturbance value.
The disturbance value changed abruptly to d0 at time t0. Now at
time t1 the disturbance value jumps to d1. The input quantity
adds d1-d0 to whatever value it had arrived at by t1. The error
value jumps by the same amount, but the output value qo does not
change immediately, because of the integrator output function.
It starts changing exponentially toward (the negative of) the
new value of the disturbance.
At t1+e, the output quantity has a value that is unknown,
because ever since t0 it has been in the process of approaching
-d0 exponentially from some unknown starting point. The
uncertainty of its current value is T*(t1-t0) bits less than it
had been at time t0, because it was performing a process
functionally equivalent to measuring the value d0. Now it has to
start forgetting the value d0 and start measuring the value d1.
Less colloquially, the mutual information between the output
quantity and the disturbance value is instantly decreased by the
uncertainty of the change in disturbance value, while the mutual
information between the disturbance value and the input value is
increased by the same amount.
By how much are the mutual information values changed by the new
step? Since the absolute uncertainty of a continuous variable is
infinite, can that change be measured? Shannon shows again that
it can. The amount is log2(r_new/r_old) where r_new is the new
range of uncertainty (such as standard deviation) and r_old is
the old range of uncertainty. The new range of uncertainty
depends on the probability distribution of values that can be
taken by the disturbance at that time, which may be unknown. It
is not a new unknown, however, since the original disturbance
step came from the same distribution, and the original value of
the output quantity probably came from the same distribution
(modified, of course, by the inverse of the environmental
feedback function) if the system had been controlling prior to
the first step that we analyzed.
After the second step change in the disturbance value, the
output quantity is gaining information, still about the
disturbance value, but not about d0. Now it is gaining
information about d1 and losing information about d0. At the
same time, the input quantity is losing information about d1, as
well as about any remanent information about d0 that might still
have been there at time t1. All this is happening at a rate of T
= G*1.44.3… bits/sec.
The effects of third and subsequent steps in the value of the
disturbance are the same. The output continues to gain
information about the most recent value of the disturbance at T
bits/sec while losing information about prior values at the same
rate, and while the input also loses information about the new
and old values at T bits/sec. Making the steps smaller and
closer in time, we can approximate a smooth waveform with any
desired degree of precision. No matter what the disturbance
waveform, the output is still getting information about its
value at T bits/sec, and losing information about its history at
the same rate. So long as the disturbance variation produces less than T
bits/sec, control is possible, but if the disturbance variation
is greater than T bits/sec, control is lost. The spike in the
input value at a step change in the disturbance is an example of
that, where control is completely lost at the moment of the
step, and gradually regained in the period when the disturbance
remains steady, not generating uncertainty.

This is getting rather too long for a single message, so I will
stop there. I hope it explains just how information gets from
the disturbance to the output, and how fast it does so in the
case of the ideal noiseless control system with no lag and a
pure integrator output function.
Martin
http://www.amazon.com/Mathematical-Theory-Communication-Claude-Shannon/dp/0252725484/ref=sr_1_1?s=books&ie=UTF8&qid=1354736909&sr=1-1&keywords=Shannon%2C+C.+E.

MartinT · February 20, 2013, 6:46am

[Martin Taylor 2013.02.20.01.41]

I'm sorry, but I pushed the "send" button when I meant to push

“save”. I was in the middle of editing, both to tighten what has
been written and to finish the end, and was about to go to bed. But
I guess since it has been posted, you may be able to see what I have
been aiming at. I guess that now I should work on the control part,
and leave off trying to finish an intelligible introduction to
information and uncertainty. I hope the earlier part, at least, of
the inadvertently posted message may be helpful when it comes to
understanding the use of information theory in the analysis of
control. The later part, with the examples, is essentially unedited,
and also unfinished.

Martin

...

···

On 2013/02/20 12:43 AM, Martin Taylor
wrote:

    [Martin Taylor

2013.01.01.12.40]

      As you can see by the date stamp, I started this on New Year's

Day, expecting it to be quite short and to be finished that
day. I was wrong on both counts.

      This message is a follow-up to [Martin Taylor

2012.12.08.11.32] and the precursor of what I hope will be
messages more suited to Richard Kennaway’s wish for a
mathematical discussion of information in control. It is
intended as part 1 of 2. Part 1 deals with the concepts and
algebra of uncertainty, information, and mutual information.
Part 2 applies the concepts to control.

rupert · February 20, 2013, 8:59am

Re Negative feedback too simpl3.jpg

Re Negative feedback too simpl4.jpg

···

[Rupert Young 2013.02.20 09.00]

  Martin, perhaps I missed it in previous discussions, but I haven't

seen a definition of “information” in the context of perceptual
control theory.

  I work in the information processing industry, of passing

representative data between computer systems, such as within
health care. One system produces a message containing information
such as a name and address, and a hospital number. At some level
this data is encoded into computer signals, and it is transported
to another computer where it is decoded and the information is
extracted.

  That is how I understand information, and is probably how most

other people understand it. Though I suspect that is not what you
mean by information in your posts.

  Could you provide a simple definition of what you mean by

“information”?

  Your post is very long, and looks highly technical and may be

beyond the ken of many of us (me anyway) on this list, so before
committing a lot of time to it I’d like to understand what it has
got to do with PCT.

  On 20/02/2013 05:43, Martin Taylor wrote:

Regards,
Rupert

    [Martin Taylor

2013.01.01.12.40]

      As you can see by the date stamp, I started this on New Year's

Day, expecting it to be quite short and to be finished that
day. I was wrong on both counts.

      This message is a follow-up to [Martin Taylor

2012.12.08.11.32] and the precursor of what I hope will be
messages more suited to Richard Kennaway’s wish for a
mathematical discussion of information in control. It is
intended as part 1 of 2. Part 1 deals with the concepts and
algebra of uncertainty, information, and mutual information.
Part 2 applies the concepts to control.

      Since writing [Martin Taylor 2012.12.08.11.32], which showed

the close link between control and measurement, and followed
the information “flow” around the loop for a particular
control system, I have been trying to write a tutorial message
that would offer a mathematically accurate description of the
general case while being intuitively clear about what is going
on. Each of four rewrites from scratch has become long, full
of equations, and hardly intuitive to anyone not well versed
in information theory. So now I am trying to separate the
objectives. This message attempts to make
information-theoretic concepts more intuitive with few
equations beyond the basic ones that provide definitions. It
is nevertheless very long, but I have avoided most equations
in an attempt to make the principles intelligible. I hope it
can be easily followed, and that it is worth reading
carefully.

      My hope is that introducing these concepts with no reference

to control will make it easier to follow the arguments in
later messages that will apply the concepts and the few
equations to control.

      This tutorial message has several sections, as follows:
      Basic background
         concepts
         definitions
            uncertainty
                comments on "uncertainty"
            information
                Information as change in uncertainty
                mutual information
      Time and rate
         Uncertainty
         Information
         Information and history
      Some worked out example cases
         1. A sequence like "ababababa..."
         2. A sequence like "abacacabacababacab...."
         3. A sequence in which neighbour is likely to follow

neighbour
4. A perfect adder to which the successive input values
are independend
5. A perfect adder to which the successive input values are
not independent
6. Same as (5) except that there is a gap in the history –
the most recent N input samples are unknown.

      [Because I don't know whether subscripts or Greek will come

out in everyone’s mail reader, I use “_i” for subscript “i”,
and Sum_i for “Sum over i”. "p(v)’ means “the probability of v
being the case”, so p(x_i) means "the probability X is in
state i"and p_x_i(v) means “the probability that v is the
case, given that x is in state i”, which can also be written
p(v|x_i).]

      ------------Basic equations and concepts: definitions-------

      1. "Probability" and "Uncertainty"

      In information theory analysis, "uncertainty" has a precise

meaning, which overlaps the everyday meaning in much the same
way as the meaning of “perception” in PCT overlaps the
everyday meaning of that word. In everyday language, one is
uncertain about something if one has some idea about its
possible states – for example the weather might be sunny,
cloudy, rainy, snowy, hot, … – but does not know in which
of these states the weather is, was, or will be now or at some
other time of interest. One may be uncertain about which Pope
Gregory instituted the Gregorian calendar, while feeling it
more likely that it was Gregory VII or VIII than that it was
Gregory I or II. One may be uncertain whether tomorrow’s Dow
will be up or down, while believing strongly that one of these
is more probable than that the market will be closed because
of some disaster.

      If one knows nothing about the possibilities for the thing in

question (for example, “how high is the gelmorphyry of the
seventh frachillistator at the winter solstice?”) one seldom
thinks of oneself as “uncertain” about it. More probably, one
thinks of oneself as “ignorant” about it. However,
“ignorance” can be considered as one extreme of uncertainty,
with “certainty” as the other extreme. “Ignorance” could be
seen as “infinite uncertainty”, while “certainty” is “zero
uncertainty”. Neither extreme condition is exactly true of
anything we normally encounter, though sometimes we can come
pretty close.

      Usually, when one is uncertain about something, one has an

idea which states are more likely to be the case and which are
unlikely. For example, next July 1 in Berlin the weather is
unlikely to be snowy, and likely to be hot, and it is fairly
likely to be wet. It is rather unlikely that the Pope is
secretly married. It is very likely that the Apollo program
landed men on the moon and brought them home. One assigns
relative probabilities to the different possible states. These
probabilities may not be numerically exact, or consciously
considered as numbers, but if you were asked if you would take
a 5:1 bet on it snowing in Berlin next July 1 (you pay $5 if
it does snow and receive $1 if it doesn’t snow), you would
probably take the bet. If you would take the bet, your
probability for it to snow is below 0.2.

      Whether "uncertainty" is subjective or objective depends on

how you treat probability. If you think of probability as a
limit of the proportion of times one thing happens out of the
number of times it might have happened, then for you,
uncertainty is likely to be an objective measure. I do not
accept that approach to probability, for many reasons on which
I will not waste space in this message. For me, “probability”
is a subjective measure, and is always conditional on some
other fact, so that there really is no absolute probability of
X being in state “i”.

      Typically, for the kind of things of interest here, when the

frequentist “objective” version is applicable, the two
versions of probability converge to the same value or very
close to the same value. Indeed, subjective probability may be
quite strongly influenced by how often we experience one
outcome or another in what we take to be “the same”
conditions. In other cases, such as coin tossing, we may never
have actually counted how many heads or tails have come up in
our experience, but the shape of the coin and the mechanics of
tossing make it unlikely that either side will predominate, so
our subjective probability is likely to be close to 0.5 for
both “heads” and “tails”. As always, there are conditionals:
the coin is fair, the coin tosser is not skilled in making it
fall the way she wants, it does not land and stay on edge, and
so forth.

      In order to understand the mathematics of information and

uncertainty, there is no need to go into the deep
philosophical issues that underlie the concept of probability.
All that is necessary is that however probabilities are
determined, if you have a set of mutually exclusive states of
something, and those states together cover all the possible
states, their probabilities must sum to 1.0 exactly. Sometimes
inclusivity is achieved by including a state labelled
“something else” or “miscellaneous”, but this can be
accommodated. If the states form a continuum, as they would if
you were to measure the height of a person, probabilities are
replaced by probability density and the sum by an integral.
The integral over all possibilities must still sum to 1.0
exactly.

      -------definitions-----

      First, a couple of notations for the equations I do have to

use:

      I use underbar to mean subscript, as in x_i for x subscript i.

      p(x_i) by itself means "Given what we know of X, this is the

probability for us that X is (was, will be) in state i" or, if
you are a frequentist, “Over the long term, the proportion of
the times that X will be (was) in state i is p(x_i)”.

      I mostly use capital letters (X) to represent systems of

possible states and the corresponding lower-case letters to
represent individual states (x for a particular but
unspecified state, x_i for the i’th state).

      p(x_i|A) means "If A is true, then this is our probability

that X is in state i" or “Over the long term, when A is true,
the proportion of the time that X will be in state i is
p(x_i)”. In typeset text, sometimes this is notated in a more
awkward way, using A as a subscript, which I would have to
write as p_A(x_i).

      I will not make any significant distinction between the

subjective and objective approaches to probability, but will
often use one or the other form of wording interchangeably
(such as “if we know the state of X, then …”, which you can
equally read as “If X is in a specific state, then …”).
Depending on which approach you like, “probability”, and hence
“uncertainty”, is either subjective or objective. It is the
reader’s choice.

      I use three main capital letter symbols apart from the Xs and

Ys that represent different systems. U( ) means “Uncertainty”,
I( ) means Information, and M( : ) means mutual information
between whatever is symbolized on the two sides of the colon
mark. The parentheses should make it clear when, say, “U”
means a system U with states u_i and when it means
“Uncertainty”.

      -----definition of "uncertainty"----

      If there is a system X, which could be a variable, a

collection of possibilities, or the like, that can be in
distinguishable states x_i, with probabilities p(x_i), the
uncertainty of X is _defined _ by two formulae,
one applicable to discrete collections or variables

      U(X) = - Sum_i (p(x_i) log(p(x_i)))

      the other for the case in which "i" is a continuous variable.

If “i” is continuous, p(x_i) becomes a probability density and
the sum becomes an integral

      U(X) = - Integral_i (p(x_i) log(p(x_i)) di).

      That's it. No more, no less. Those two equations define

“uncertainty” in the discrete and the continuous case.

      (Aside: Von Neumann persuaded Shannon to stop calling U(X)

“uncertainty” and to call it “Entropy” because the equations
are the same. This renaming was unfortunate, as it has caused
confusion between the concepts of uncertainty and entropy ever
since.)

      If X is a continuous variable, the measure of uncertainty

depends on the choice of unit for “i”. If, for example, X has
a Gaussian distribution with standard deviation s, and we
write H for log(sqrt(2PIe)), the uncertainty of X is H +
log(s). The standard deviation is measured in whatever units
are convenient, so the absolute measure of uncertainty is
undetermined. But whatever the unit of measure, if the
standard deviation is doubled, the uncertainty is increased by
1 bit. Since we usually are concerned not with the absolute
uncertainty of a system, but with changes in uncertainty as a
consequence of some observation or event, the calculations
give well-defined results regardless of the unit of measure.
When the absolute uncertainty is of interest, one convenient
unit of measure is the resolution limit of the measuring or
observing instrument, since this determines how much
information can be obtained by measurement or observation.

      There are a few other technical issues with the uncertainty of

continuous variables, but calculations using uncertainty work
out pretty much the same whether the variable is discrete or
continuous. Unless it matters, in what follows I will not
distinguish between discrete and continuous variables.

      --------comments on "uncertainty-----

      Why is Sum(p*log(p)) a natural measure of "uncertainty"?

Shannon, in his 1947 monograph that introduced information
theory to the wider world, proved that this is the only
function that satisfies three natural criteria:

      (1) When the probabilities vary continuously, the uncertainty

should also vary continuously.
(2) The uncertainty should be a maximum when all the
possibilities are equally probable.
(3) If the set is partitioned into two subsets, the
uncertainty of the whole should be the weighted sum of the
uncertainties of the two parts, the weights being the total
probabilities of the individual parts. Saying this in the form
of an equation, if X has possibilities x_1, x_2, … x_n and
they are collected into two sets X1 and X2, then
U(X) = p(X1)*U(X1) + p(X2)*U(X2),
where p(Xk) is the sum of the probabilities of the
possibilities collected into set Xk.

      One important consequence of the defining equation U(X) =

Sum_i(p(x_i)*log(p(x_i))) is that the uncertainty U(X) is
always positive and ranges between zero (when one of the
p(x_i) is 1.0 and the rest are zero) and log(N) (when there
are N equally probable states of X). The logarithm can be to
any base, but base 2 is conventionally used. When base 2 is
used, the uncertainty is measured in “bits”. The reason for
using base 2 is convenience. If you can determine the actual
state through a series of yes-no questions to which every
answer is equally likely to be yes or no, the uncertainty in
bits is given by the number of questions.

      I will use the term "global uncertainty of X" for the case

when nothing is assumed or known about X other than its
long-term probability distribution over its possible states.
This is what is meant by “U(X)” in what follows, unless some
other conditional is specified in words.

      In certain contexts, uncertainty is a conserved quantity. For

example, if X and Y are independent, the joint uncertainty of
X and Y, symbolised as U(X,Y), is

      U(X,Y) = U(X) + U(Y)

      where U(X,Y) is the uncertainty obtained from the defining

equation if each possible combination of X state and Y state
(x_i, y_j) is taken to be a state of the combined system. The
concept can be extended to any number of independent
variables:

      U(X, Y, Z, W....) = U(X) + U(Y) + U(Z) + U(W) +...

      Using the subjective approach to probability, "independent"

means “independent given what is so far known”. It is not
clear how the term can legitimately be used in a frequentist
sense, because any finite sample will show some apparent
relationships among the systems, even if an infinite set of
hypothetical trials would theoretically reduce the apparent
relationships to zero. A frequentist might say that there is
no mechanism that connects the two systems, which means they
must be independent. However, to take that into account is to
concede the subjectivist position, since the frequentist is
saying that as far as he knows , there is no
mechanism. Even if there was a mechanism, to consider it would
lead the frequentist into the subjectivist heresy.

      Uncertainty is always "about" something. It involves the

probabilities of the different states of some system X, and
probabilities mean nothing from inside the system in question.
They are about X as seen from some other place or time. To say
this is equivalent to saying that all probabilities are
conditional. They depend on some kind of prior assumption, and
taking a viewpoint is one such assumption.

      Consider this example, to which I will return later: Joe half

expects that Bob will want a meeting, but does not know when
or where that is to be if it happens at all. What is Joe’s
uncertainty about the meeting?

      Joe thinks it 50-50 that Bob will want a meeting. p(meeting) =

0.5

      If there is to be a meeting, Joe thinks it might be at Joe's

office or at Bob’s office, again with a 50-50 chance.

      If there is to be a meeting, Joe expects it to be scheduled

for some exact hour, 10, 11, 2, or 3.

      For Joe, these considerations lead to several mutually

exclusive possibilities:

      No meeting, p=0.5

      Bob's office, p = p(meeting)*p(Bob's office | meeting) =

0.5*0.5 = 0.25

      Joe's office p = p(meeting)*p(Joe's office | meeting) =

0.5*0.5 = 0.25

      Joe has 1 bit of uncertainty about whether there is to be a

meeting, and with probability 0.5, another bit about where it
is to be. Overall, his uncertainty about whether and where the
meeting is to be is

      U(if, where) = -0.5log(0.5) - 2*(0.25log(0.25) = -0.5*(-1) -

0.5*(-2) = 1.5 bits

                              no meeting    meeting&&either

office

      Another way to get the same result is to use a pattern that

will turn up quite often

      U(if, where) = U(if) + U(where|if) = U(if) + p(no

meeting)*U(where|no meeting) + p(meeting)*U(where|meeting) = 1

0.50+0.51 = 1.5 bits

    Joe is also uncertain about the time of the meeting, if there

is one. It could be at 10, 11, 2, 3 regardless of where the
meeting is to be. That represents 2 bits of uncertainty, but
those two bits apply only if there is to be a meeting.

      U(time|meeting) = 2 bits

      U(time) = U(time)*p(meeting) + U(time)*p(no meeting) = 0.5*2 +

0.5*0 = 1.

      U(if, where, time) = 2.5 bits

      Compare this with Joe's uncertainty about the meeting if he

knew there would be a meeting, but did not know whether it
included Alice. Then we would have three mutually independent
things Joe would be uncertainty about:

      U(Alice) = 1 bit

      U(office) = 1 bit

      U(time) = 2 bits

      Joe's uncertainty about the meeting would be 4 bits.

      Notice what happened here. We added a dimension of possible

variation independent of the dimensions we had already, but
eliminated the dimension of whether there would be a meeting.
That dimension (“if”) interacted with the other two, because
if there was to be no meeting, there could be no uncertainty
about its time and place. Adding an independent dimension of
variation adds to the uncertainty; it does not multiply the
uncertainty.

      U(meeting_configuration) = U(Alice) + U(office) + U(time)

      ------

      2. "Information" and related concepts

      "Information" always means change in uncertainty. It is a

differential quantity, whereas “uncertainty” is an absolute
quantity. I(X) = delta(U(X)).

      A change in the uncertainty about X may be because of the

introduction of some fact such as the value of a related
variable, or because of some event that changes the
probability distribution over the states of X. If it is sunny
today, it is more likely to be sunny tomorrow than if today
had happened to be rainy.

      To continue the "meeting" example, imagine that Joe received a

call from Bob saying that the meeting was on and in his
office, but not specifying the time. Joe’s uncertainty would
be 2 bits after the phone call, but was 2.5 bits before the
call. He would have gained 0.5 bit of information about the
meeting. If the call had said the meeting was off, Joe would
have no further uncertainty about it, and would have gained
2.5 bits. His expected gain from a call about whether the
meeting was off or on, and its location if it was on, would be
0.50.5 + 0.52.5 = 1.5 bits, which is the uncertainty we
computed by a different route above for U(if, where).

      Often the "event" is an observation. If you observe a variable

quantity whose value you did not know exactly, your
uncertainty about it is likely (but not guaranteed) to be
lower after the observation than before. Again, thinking of
Joe’s meeting, if the call had simply told Joe the meeting was
on, without specifying the location, Joe’s uncertainty about
the meeting would have increased from 1.5 bits to 3 bits (1
bit for where the meeting is to be held and 2 bits for when).
In that case his information gain would have been negative.
However, averaged over all the possible messages and taking
account of their probabilities, the information gain from a
message or observation is always zero or positive, even though
there may be individual occasions when it is negative.

      The information gained about X from an observation is:

      I_observation(X) = U_before-observation(X) -

U_after-observation(X)

      ---------

      Mutual Information

      The Mutual information between X and Y is the reduction in the

joint uncertainty of two systems X and Y due to the fact that
X and Y are related, which means that knowledge of the state
of one reduces uncertainty about the state of the other. If
the value of Y is observed, even imprecisely, more is known
about X than before the observation of Y.

      When two variables are correlated, fixing the value of one may

reduce the variance of the other. Likewise, we define “mutual
information” as the amount by which uncertainty of one is
reduced by knowledge of the state of the other. If U(X|Y)
means the uncertainty of X when you know the value of Y,
averaged over all the possible values of Y according to their
probability distribution, then the mutual information between
X and Y is

      M(X:Y) = M(Y:X) = U(X) - U(X|Y) = U(Y) - U(Y|X)

      Here is another generic equation for mutual information, using

the reduction in the joint uncertainty of the two systema as
compared to the sum of their uncertainties:

      M(X:Y) = U(X) + U(Y) - U(X,Y)

      from which it is obvious that M(X:Y) = M(Y:X) and that M(X:Y)

cannot exceed the lesser of the two uncertainties U(X) and
U(Y). If X is a single-valued function of Y (meaning that if
you know the value of Y, you can thereby know the value of X)
then M(X:Y) = U(X).

      M(X:Y) is always positive, though there may be specific values

of one of the variables for which the uncertainty of the other
is increased over its average value, as happened when Joe was
told the meeting was on without being told were, when, or who
was attending.

      Going back again to Joe's office meeting, suppose Joe knows

there is to be a meeting, but does not know where or when. It
might be at Bob’s office or at Bob’s house. If it is at the
office, it might be at 10 or at 2. If at the house, it might
be at 6 (with a dinner invitation) or at 9 (after dinner).

      U(where) = 1 bit

      U(when) = 2 bits

      U(when) + U(where) = 3 bits

      But U(when, where) is not 3 bits, because there are actually

only four equiprobable combinations of place and time, 10 and
2 at the office, 6 and 9 at the house, so U(where, when) = 2
bits. Therefore M(where:when) = 1 bit.

      -----time and rates----

      Since "information" is a difference quantity, the concept of

“information rate” becomes useful. It is the rate at which
uncertainty changes as a result of ongoing conditions or
events.

      Consider I(X|observation), the information gained about X as

the result of an observation. Imagine that a series of
observations are made at regular intervals. if X does not
change state over time, then each successive observation is
likely to further reduce U(X). Because of the definition of
“information” as “change in uncertainty”, the observer gains
information at a rate that is the average reduction in
uncertainty per observation divided by the time between
observations.

      If X is a discrete system, there is a limit to how much

information can be gained about X by observing it. If the
global uncertainty of X (what you believe about X before
observing it) is U(X), no amount of observation can reduce
that uncertainty below zero – the value of U(X) when you know
the state of X exactly. Accordingly, the information gained by
continued observation of a static quantity plateaus at the
global uncertainty of that variable – the amount that the
observer did not know about it before starting the series of
observations.

      (Figure 1)

      This is one place where the difference between continuous and

discrete variables matters. If the system X has a finite
number, N, of distinguishable states, its uncertainty cannot
be greater than log(N), so log(N) is the maximum amount of
information that could be obtained about it by continued or
repeated observation. But if the states of X form a continuum,
it has an infinite number of states, meaning that in theory
there is no limit to the amount of information about it that
could be gained by continued observation. In practice, of
course, the resolution of the observing system imposes a
limit. Precision is never infinite, even if the limit is
imposed by quantum uncertainty.

      What about the situation in which X is or may be changing? If

a changing X is observed once, the uncertainty about X becomes
a function of time since the observation. X may change state
continuously or abruptly, slowly or fast. However it may
change, the “information rate” of X is defined as dU(X)/dt
provided that the observation timing is independent of
anything related to a particular state or state change of X.

      -----------

      [Aside: Here we have another example of the kind of

complementarity between two measures that is at the heart of
the Heisenberg Uncertainty Principle of quantum mechanics. If
you know exactly where something is, you know nothing about
its velocity, and vice versa. The rate at which you get
information from observing something can tell you about both
its value and the rate of change of value, but the more
closely define one of these, the less you can know about the
other. The information rate determines the size of the
“precision*velocity” cell.]

      -----------

      If an observation is made at time t0 and never thereafter,

immediately after the observation at time t0+e (“e” represents
“epsilon”, a vanishingly small quantity) the uncertainty of X
is U_(t0+e)(X) . From that moment, U_t(X) increases over time.
The rate at which U_t(X) increases depends, in the casual
tautological sense, on how fast the value of X changes. If the
variable is a waveform of bandwidth W, this can be made a bit
more precise. The actual rate depends on the spectrum of the
waveform, but for a white noise (the worst case) the rate
dU/dt = Wlog(2pieN) where N is the noise power (which
again depends on the choice of units).

      However, just as continued observation cannot yield more

information than the initial uncertainty of X, so continued
external influences or other cause of variation cannot
increase U_t(X) beyond its global value (its value determined
from the long-term statistics of X).

      (Figure 2)

      ----------

      3. Information and history

      In this section, we are interested in the mutual information

between the preceding history and the following observation,
which is symbolized M(X|Xhistory). In everyday language, we
want to know how much we can and cannot know about what we
will observe from what we have observed.

      Mutual information need not be just between the states of

system X and the states of a different system Y. It can be
between the states of system X at two different times, which
we can call t0 and t1. If t0 and t1 are well separated, the
state of X at time t0 tells you nothing about its state at
time t1. This would be the case at any time after Tc in Figure
2. But if t1 follows t0 very closely (i.e. near the left axis
in Figure 2), it is very probable that the state of X has not
changed very much, meaning that the Mutual Information between
the two states is large. They are highly redundant, which is
another way of saying that the mutual information between two
things is close to its maximum possible value.

      Suppose X is the weather. If it is sunny at 10 o'clock, it is

likely to be sunny at one second past ten, or one minute past
ten. The probability of it remaining sunny is still pretty
high at 11, and still moderately high at ten tomorrow. But by
this time next week, the observation that it is sunny now will
not tell you very much more about whether it will be sunny
then than will a book about the local climate at this time of
year.

      As the example suggests, M(X_t:X_(t-tau)) -- the complement of

U_(t-tau)(X|observation at t0) shown in Figure 2-- is a
function of tau that usually decreases with increasing tau. I
say “usually”, because it is not true of periodic or nearly
periodic systems. The temperature today is likely to be nearer
the temperature this time last year than to the temperature at
some randomly chosen date through the year. For now we will
ignore such cyclic variation, and treat M(X_t:X_(t-tau)) as if
it declined with increasing tau, until, at some value of tau,
it reaches zero.

      ------footnote-------

      For a signal of bandwidth W, M(X_t:X_(t-tau)) reaches zero at

tau = 1/2W. No physical signal is absolutely restricted to a
precisely defined bandwidth, but the approximation can be
pretty close in some cases. Good audio sampling is much faster
than 1/2W because the physical signals of musical instruments
have appreciable power at frequencies well about the
listener’s hearing bandwidth.

      -----end footnote-----

      Suppose a signal is sampled at regular intervals separated by

tau seconds, with tau small enough that knowledge of the
signal value at one sample reduces the uncertainty of the
state of the next and previous samples. The probability
distribution of the next sample is constrained by the values
of the previous samples. One might think of the value of the
sample as being that of a perception, but that is not
necessary.

      (Figure 2)

      Figure 2 suggests how knowledge of the past values may

constrain the probability distribution of the next sample
value when the inter-sample time, tau, is short compared to
the rate of change of the variable.

      The first panel suggests the uncertainty of the next sample if

one knows nothing of the history of the variable. If the
variable is from the set of possible states of X, its
uncertainty is U(X). The second and third panels show how the
uncertainty is reduced if the values of the previous one or
two samples are known. In the middle panel, one possible value
of the previous sample is shown, and in the third panel the
values of the two previous samples are taken into account.

      The figure is drawn to show the changes in a slowly changing

waveform, but that is just one possibility. It could equally
easily represent the probabilities in a sequence of letters in
a text. For example, the left panel might represent the
possibility of observing any letter of the alphabet, the
middle the distribution of the next character after observing
a “t”, and the right panel the distribution of the following
character after observing “th”. Since we assume that the
statistics of X are stationary (i.e. do not depend on when you
sample them, provided that choice of the sampling moment is
independent of anything happening in X), M(X0:X1) = M(X1:X2) =
M(X2:X3)…

      We want to know the uncertainty of X at a given moment t0

given its entire previous history. If X is a slowly varying
waveform, its uncertainty knowing the previous history right
up to t0 is zero, which is uninteresting. In that case we have
to ask about its history only up to some time t0-tau. We will
actually consider the history to consist of successive samples
separated by tau seconds. In the case of a discrete system
such as the letters of a text, no such restriction is
required, but by making the restriction on the sampling of the
continuous waveform we can treat both kinds of system
similarly.

      Formally, one can write

      U(X0|History) = U(X0) - M(X0:History)

      but that is not much use unless one can calculate the mutual

information between the next sample value x0 and the history
of system X. It is possible again to write a general formula,
but its actual calculation depends on the sequential
statistics of X.

      The History of X can be written as a sequence of samples

counting back from the present: X1, X2, X3, … and
M(X0:History) can be written M(X0:X1, X2, X3,…)

      M(X0:History) = M(X0|X1) + M(X0:X2|X1) + M(X0:X3|X2,X1) +

M(X0:X4|X1,X2,X3) + …

      In words, the mutual information between the next sample and

the history can be partitioned into the sum of an infinite
series of mutual information elements, each one being the
amount one sample contributes to the total mutual information,
given the specific values of all the samples between it and
the most recent sample.

      To make this concrete, let's consider a few specific examples

      --------

      Example 1: X is a sequence of symbols "a" and "b", that

alternate “…ababab…”

      If one looks at an arbitrary moment, one has an 0.5 chance of

the next sample being a or b: U(X0) = 1 bit. If one has seen
one sample, which might have been either a or b, there is no
uncertainty about the next sample: U(X0|X1) = 0, which gives
M(X0:X1) = U(X).

      What then of M(X0:X2|X1)? Remember that M(A:B) cannot exceed

the lesser of U(A) and U(B). In this case, U(X2|X1) = 0
because if X1 was a X2 must have been b, and vice-versa. So
M(X0:X2|X1) = 0 and all subsequent members of the series are
also zero.

      One can extend this example to any series or waveform that is

known to be strictly periodic with a known period.

      -------

      Example 2: X is a sequence of symbols a, b, and c with the

property that a is followed equally probably by b or c, b by c
or a, and c by a or b. No symbol is ever repeated immediately,
but apart from that, each other symbol is equally likely at
any sample moment.

      U(X0) = log(3) = 1.58 bits

      U(X0|X1) = 1 bit, which gives

      M(X0:X1) = 0.58 bits

      It is irrelevant how X1 took on its value from X2. If x1 = a,

x2 could have equally well been b or c. Therefore

      M(X0:X2|X1) = 0

      --------

      Example 3. X is a sequence of five different symbols labelled

1, 2, 3, 4, 5. Considering 5 and 1 to be neighbours, each
symbol can be followed by itself with probability 0.5 or one
of its neighbours with probabiity 0.25. Over the long term,
each symbol appears with equal probability.

      U(X0) = log(5) = 2.32 bits

      U(X0|X1) =-0.5 log 0.5 - 2* (0.25 log 0.25) (because the

probability is 0.5 its predecessor was itself and 0.25 that it
was either of its two neighbours)

                     = 0.5 + 0.5 = 1 bit

      M(X0:X1) = 2.32 - 1 = 1.32 bits

      M(X0:X2|X1) depends on the relationship between X2 and X0,

unlike the situation in example 2. There are five
possibilities for x0 (the value of X0). the probability of
these five depends on the relationship between x0 and x2. We
must consider each of these five, and partition M(X0:X1)
according to the probabilities for each value of X2. In a
formula:

      M(X0:X2|X1) = Sum_k(p(X2=xk)*M(X0:X2|X1=xk))

      Since the probability p(X2=xk) is the same for each value of

xk (all symbols are equally likely over the long term), this
sum comes down to

      M(X0:X2|X1) = (1/N)*Sum_k(M(X0:X2|X1=xk))

      We consider all these possibilities individually. Since all

the relationships are symmetric (meaning that it doesn’t
matter which value we select as the possible next sample, x0),
we need deal only with the relationships themselves.

      The sequence could be {x0, x0, x0}, {x0, neighbour, x0}, {x0,

neighbout, neighbour} or {x0, neighbour, non-neighbour}.
Examples of each might be {3.3.3}, {3,4,3}, {3,4,4} and
{3,4,5}, which would have probabilities 0.50.5 = 0.25,
20.250.25=0.125, 20.250.5=0,25, and 0.250.25=0.125
respectively.

      There is no need to do the actual calculations here. They are

easy but time-consuming. The point is that we can calculate
M(X0:X2), M(X0:X2|X1) and the rest directly from the
conditional probabilities.

      ---------

      Example 4: X is built by summing a succession of samples of

another system Y, such that xj = Sum(History(X)) + yj, x0 = 0
for some specified value of k, and yj is a sequence of
positive and negative integer values, samples from a system Y,
that average zero. In this example, the numbering scheme goes
in the direction of time, later samples being labelled with
larger numbers. At the k’th sample the expression
M(X|Xhistory) becomes

      M(Xk:Yk|X_(k-1), X_(k-2), ... X(0))

      Each sample of Y contributes equally to the value of X, so

there is no need to consider the sequential order of the y
values. Assuming that Y is a series of numerical values and
the samples of Y are independent, the variance of X increases
linearly with the number of samples, so its standard deviation
increases proportionally to the square root of that number.

      Let us trace the contribution of the first few samples of Y to

the uncertainty of X.

      After one sample, y0, of Y, x0 will equal y0, which means

U(X0) = U(Y0).

      M(X0:Y0) = U(X) + U(Y) - U(X,Y) = U(X) = U(Y)

      After the next sample, U(X1) still is given by the formula

Sum(p log p), but the probabilities are no longer the same as
they were before the first sample. There is a wider range of
possibilities, from twice the minimum value of Y to twice the
maximum value of Y, and the probabilities of those
possibilities are not equal. Because the probability
distribution over the possible values of X is not uniform,
U(X) < 2*U(Y).

      As more and more samples are included in the Sum, the range of

X increases, but values near zero come to have higher and
higher relative probabilities. The distribution of
probabilities over the possible values of X approaches a
Gaussian with variance that increases linearly with k, so U(X)
approaches U(Y)sqrt(k). The ratio U(X)/(kU(Y) decreases with
increasing k, approaching 1/sqrt(k) as k becomes large .

      We want to consider the contribution of the k'th sample of Y.

How much does the kth sample of Y reduce the uncertainty of
the k+1th value of the Sum, as compared to knowing only the
History up to the previous sample? That is the mutual
information between the kth sample of Y and the k+1th value of
X.

      For large k, (U(X_k) - U(X_(k-1)))/U(Y) approaches

sqrt(k)-sqrt(k-1), an increasingly small number. Yet it
remains true that U(X_k|History) - U(X_k|History, Yk) = U(Y).
The uncertainty of the value of X after the next sample of Y
is always the uncertainty of Y, because if you know the
current value of X, the next value is determined by the value
of the Y sample. How can these two statements be reconciled?
The key is the conditional of knowing the History. Without the
prior History (in this case the Sum), the contribution of each
successive Y sample to the uncertainty of X becomes less and
less, but if you do have the prior History, the value of the
next X sample will have an uncertainty that is the uncertainty
of the Y sample.

      The situation is the same if the system X is normalized,

meaning that its variance is independent of the number of
prior samples after a sufficient number of Y values have
occurred. The magnitude of the contribution of the k’th Y
sample is reduced proportionally to sqrt(k), but its
contribution to the various uncertainties are the same,
provided that the different possible values of X are resolved.

      --------

      Example 5: The same as Example 4 except that the samples of Y

are not independent. If Y is a continuous waveform, the
samples are taken more closely than the Nyquist limit (samples
separated by 1/2W, where W is the bandwidth of the waveform).
If Y is a discrete set of possibilities, successive samples
are related by some kind of grammar. The two kinds of
possibility converge when the discrete set of possibilities is
linearly arranged and the grammar merely makes transitions
among nearby values more likely than across distant values.

      The new issue here is that if the history of Y is known, U(Yk)

< U(Y). Therefore M(Xk:History(X), yk) < U(Y). Each
successive sample of Y reduces the uncertainty of X less than
would be the case if the samples of Y were independent of each
other. The difference is subsumed by the fact that
M(Xk|History(X)) is greater than it is in the case when the Y
samples are independent, by the same amount.

      ---------

      Example 6: System X is a function of the current and past

samples of system Y. We want to know the uncertainty of sample
xk when the values of successive Y samples are known only up
to sample y_(k-h). For example, in the grammar of example 3,
we may have observed the sequence 1,2,1,5, (samples y1, y2,
y3, y4) and want to know the uncertainty of sample x6, which
is a function of y1, y2, y3, y4, y5, and y6. We do not yet
know the values of y5 or y6. As a real-world example, a trader
in the days of sailing ships might have known the market
conditions in China and England several months before his ship
captain actually bargains for tea in Canton, but the earlier
conditions are the conditions on which he must base his
instructions to the captain.

[Martin Taylor 2012.12.08.11.32]
      ....



      We consider the "basic-standard" control system that has a
pure integrator as its output function.
      First, let's treat the effect of one step change in the
disturbance value. We assume that the error is initially zero,
and at time t0 the disturbance suddenly goes from whatever
value it may have had previously to d0. We treat this single
step, and then consider the effects of a series of steps. We
start by showing how the integrator rate parameter controls
how fast information about the disturbance arrives at the
output – how fast the control system measures the disturbance
value.
      Let us trace the signal values at various places in the loop
as time passes after the step at time t0. At time t0+e (e
means epsilon, an indefinitely small amount), the error value
is -d and the output value qo is zero. At this early time the
output value is whatever it was before the step, and has an
unchanged influence on the value of the input quantity qi. As
time passes, the output value changes exponentially until
after infinite time it reaches -d. Assuming q0 was initially
zero, then at time t:
      qo(t) = -d0*(1-e^(-G*(t-t0))).



      At the same time, the input value qi approaches zero by the
same exponential:
      qi(t) = 1 - e^(-G*(t-t0))







      Exponential changes in value are quite interesting in
information theory, as they often indicate a constant bit rate
of information transfer. To see this, you must understand that
although the absolute uncertainty about the value of a
continuous variable either before or after a measurement or
observation is always infinity, the information gained through
the measurement or observation is finite. If the range of
uncertainty about the value (such as standard deviation of
measurement) is r0 before the measure and r1 after the
measure, the information gained is log2(r0/r1) bits.
      Measurements and observations are never infinitely precise.
When someone says their perception is subjectively noise-free
and precise, they nevertheless would be unable to tell by eye
if an object changed location by a nanometer or declined in
brightness by one part in a billion. There is always some
limit to the resolution of any observation, so the issue of
there being an infinite uncertainty about the value of a
continuous variable is never a problem. In any case in which a
problem of infinite uncertainty might seem to be presenting
itself, the solution is usually to look at the problem from a
different angle, but if that is not possible it is always
possible to presume a resolution limit, a limit that might be
several orders of magnitude more precise than any physically
realizable limit, but a limit nevertheless.
      (If you want to get a more accurate statement, and see the
derivation of all this, I suggest you get Shannon’s book from
Amazon at the link I provided earlier ).
In the above example, q0 could have started at any value
whatever, and would still have approached -d exponentially
with the same time constant. The output value qo has the
character of a continuous measurement of the value of the
disturbance, and this measurement gains precision at a rate
given by e^(-G*(t-t0)). (The units of G are 1/seconds).
What is this rate of gain in precision, in bits/sec? One bit
of information has been gained when the uncertainty about the
value of d0 has been halved. In other words, the output has
gained one bit of information about the disturbance by the
time tx such that
e^(-G*(tx-t0)) = 0.5
Taking logarithms (base 2) of both sides, we have
log2(e)(-G(tx-t0)) = -1
tx-t0 = 1/(Glog2(e)) = 1/(G1.443…) seconds
That is the time it takes for q0 to halve its distance to its
final value -d0 no matter what its starting value might have
been, which means it has gained one bit of information about
the value of the disturbance. The bit rate is therefore
G1.443… bits/sec. That is the rate at which the output
quantity q0 gains information about the disturbance, and also
is the rate at which the input quantity loses information
about the disturbance. The input quantity must lose
information about the disturbance, because it always
approaches the reference value, no matter what the value of
the disturbance step. I suppose you could look upon G1.443 as the bit rate of the
loop considered as a communication channel, but it is probably
not helpful do do so, as the communication is from the input
quantity back to itself. Without the complete loop, q0 does
not act as a measure of d0, but continues negatively
increasing linearly without limit at a rate Gd0. It is better
just to think of G1.443 as the rate at which information from
the disturbance appears at the output quantity.
As a further matter of clarification, I suppose we ought to
mention the behavioural illusion here. If the environmental
feedback path is not a simple connector with a gain of unity,
the output quantity differs from the disturbance value. The
mismatch does not affect the information relationship between
the disturbance and the output quantity, since the
environmental feedback path provides a single-valued
environmental feedback function. It is always possible to
compute the influence of the output on the input quantity if
you know the output value q0. This means that q0 and the
influence of the output on the input quantity are
informationally identical, whatever their functional
relationship, providing the environmental feedback function
does not change. If it does change, the effect is similar to
adding a noise source to a communication channel, a situation
that was Shannon’s primary concern and provides no particular
complication to the analysis. (Or in the case of Marken’s demo
that involves switching the sense of the environmental
feedback path, the communication analogy is the band-switching
secret communication procedure patented by the movie star Hedy
Lamarr; in both cases, the switching must be separately
detected or otherwise known).
As a specific example, if the environmental feedback path has
a gain of 2, the output quantity will converge to a value of
d/2. The convergence to d/2 is at the same exponential rate as
would have been the case for en environmental feedback path
gain of 1.0, providing an information gain rate that is
determined only by the integration rate parameter G.
Information from the disturbance still arrives at the output
quantity (and is lost from the input quantity) at a rate
G1.443… From now on, we will give the label “T” to the rate at which
the output gains information about the disturbance. T = G1.443… bits/sec.
The rate of information gain about the disturbance value at
the output quantity is independent of the disturbance waveform
and of the environmental feedback function. It is a property
of the internal circuitry of the control system. Accordingly,
we can make some estimates about the achievable precision of
control for disturbances that are not simple step functions.
First, however, we should look at the effect of a second and
subsequent step change in the disturbance value.
The disturbance value changed abruptly to d0 at time t0. Now
at time t1 the disturbance value jumps to d1. The input
quantity adds d1-d0 to whatever value it had arrived at by t1.
The error value jumps by the same amount, but the output value
qo does not change immediately, because of the integrator
output function. It starts changing exponentially toward (the
negative of) the new value of the disturbance.
At t1+e, the output quantity has a value that is unknown,
because ever since t0 it has been in the process of
approaching -d0 exponentially from some unknown starting
point. The uncertainty of its current value is T*(t1-t0) bits
less than it had been at time t0, because it was performing a
process functionally equivalent to measuring the value d0. Now
it has to start forgetting the value d0 and start measuring
the value d1. Less colloquially, the mutual information
between the output quantity and the disturbance value is
instantly decreased by the uncertainty of the change in
disturbance value, while the mutual information between the
disturbance value and the input value is increased by the same
amount.
By how much are the mutual information values changed by the
new step? Since the absolute uncertainty of a continuous
variable is infinite, can that change be measured? Shannon
shows again that it can. The amount is log2(r_new/r_old) where
r_new is the new range of uncertainty (such as standard
deviation) and r_old is the old range of uncertainty. The new
range of uncertainty depends on the probability distribution
of values that can be taken by the disturbance at that time,
which may be unknown. It is not a new unknown, however, since
the original disturbance step came from the same distribution,
and the original value of the output quantity probably came
from the same distribution (modified, of course, by the
inverse of the environmental feedback function) if the system
had been controlling prior to the first step that we analyzed.
After the second step change in the disturbance value, the
output quantity is gaining information, still about the
disturbance value, but not about d0. Now it is gaining
information about d1 and losing information about d0. At the
same time, the input quantity is losing information about d1,
as well as about any remanent information about d0 that might
still have been there at time t1. All this is happening at a
rate of T = G*1.44.3… bits/sec.
The effects of third and subsequent steps in the value of the
disturbance are the same. The output continues to gain
information about the most recent value of the disturbance at
T bits/sec while losing information about prior values at the
same rate, and while the input also loses information about
the new and old values at T bits/sec. Making the steps smaller
and closer in time, we can approximate a smooth waveform with
any desired degree of precision. No matter what the
disturbance waveform, the output is still getting information
about its value at T bits/sec, and losing information about
its history at the same rate. So long as the disturbance variation produces less than T
bits/sec, control is possible, but if the disturbance
variation is greater than T bits/sec, control is lost. The
spike in the input value at a step change in the disturbance
is an example of that, where control is completely lost at the
moment of the step, and gradually regained in the period when
the disturbance remains steady, not generating uncertainty.

This is getting rather too long for a single message, so I
will stop there. I hope it explains just how information gets
from the disturbance to the output, and how fast it does so in
the case of the ideal noiseless control system with no lag and
a pure integrator output function.
Martin
http://www.amazon.com/Mathematical-Theory-Communication-Claude-Shannon/dp/0252725484/ref=sr_1_1?s=books&ie=UTF8&qid=1354736909&sr=1-1&keywords=Shannon%2C+C.+E.

AveryAndrews · February 20, 2013, 9:34am

[Avery Andrews 2013.02.20 24:27 Eastern Oz DST]

A possible bridge between your conception of 'information' and what Martin is talking about might be found in ch 1 of McKay (2003)
_Information Theory, Inference, and Learning Algorithms_, available at www.cs.toronto.edu/~mackay/itila/p0.html.

Very briefly, Martins' information theory might be viewed as the substrate for the standard computer systems notion of information, since, without it, disk drives wouldn't work, the stuff coming out of one end of a phone line wouldn't bear much resemblance to what went into the other, etc. etc.

I've only started learning about this stuff recently myself, and not gotten very far. A lot more people have been perceiving its relevance to language than used to, somewhat along the lines of Harris that Bruce Nevins has been trying to inform people about for so long, but not identical, I think.

- Avery

···

________________________________________
From: Control Systems Group Network (CSGnet) [CSGNET@LISTSERV.ILLINOIS.EDU] On Behalf Of Rupert Young [rupert@MOONSIT.CO.UK]
Sent: Wednesday, February 20, 2013 7:59 PM
To: CSGNET@LISTSERV.ILLINOIS.EDU
Subject: Re: Uncertainty, Information and control part I: U, I, and M

[Rupert Young 2013.02.20 09.00]

Martin, perhaps I missed it in previous discussions, but I haven't seen a definition of "information" in the context of perceptual control theory.

I work in the information processing industry, of passing representative data between computer systems, such as within health care. One system produces a message containing information such as a name and address, and a hospital number. At some level this data is encoded into computer signals, and it is transported to another computer where it is decoded and the information is extracted.

That is how I understand information, and is probably how most other people understand it. Though I suspect that is not what you mean by information in your posts.

Could you provide a simple definition of what you mean by "information"?

Your post is very long, and looks highly technical and may be beyond the ken of many of us (me anyway) on this list, so before committing a lot of time to it I'd like to understand what it has got to do with PCT.

Regards,
Rupert

On 20/02/2013 05:43, Martin Taylor wrote:
[Martin Taylor 2013.01.01.12.40]

As you can see by the date stamp, I started this on New Year's Day, expecting it to be quite short and to be finished that day. I was wrong on both counts.

This message is a follow-up to [Martin Taylor 2012.12.08.11.32] and the precursor of what I hope will be messages more suited to Richard Kennaway's wish for a mathematical discussion of information in control. It is intended as part 1 of 2. Part 1 deals with the concepts and algebra of uncertainty, information, and mutual information. Part 2 applies the concepts to control.

Since writing [Martin Taylor 2012.12.08.11.32], which showed the close link between control and measurement, and followed the information "flow" around the loop for a particular control system, I have been trying to write a tutorial message that would offer a mathematically accurate description of the general case while being intuitively clear about what is going on. Each of four rewrites from scratch has become long, full of equations, and hardly intuitive to anyone not well versed in information theory. So now I am trying to separate the objectives. This message attempts to make information-theoretic concepts more intuitive with few equations beyond the basic ones that provide definitions. It is nevertheless very long, but I have avoided most equations in an attempt to make the principles intelligible. I hope it can be easily followed, and that it is worth reading carefully.

My hope is that introducing these concepts with no reference to control will make it easier to follow the arguments in later messages that will apply the concepts and the few equations to control.

This tutorial message has several sections, as follows:
Basic background
   concepts
   definitions
      uncertainty
          comments on "uncertainty"
      information
          Information as change in uncertainty
          mutual information
Time and rate
   Uncertainty
   Information
   Information and history
Some worked out example cases
   1. A sequence like "ababababa..."
   2. A sequence like "abacacabacababacab...."
   3. A sequence in which neighbour is likely to follow neighbour
   4. A perfect adder to which the successive input values are independend
   5. A perfect adder to which the successive input values are not independent
   6. Same as (5) except that there is a gap in the history -- the most recent N input samples are unknown.

[Because I don't know whether subscripts or Greek will come out in everyone's mail reader, I use "_i" for subscript "i", and Sum_i for "Sum over i". "p(v)' means "the probability of v being the case", so p(x_i) means "the probability X is in state i"and p_x_i(v) means "the probability that v is the case, given that x is in state i", which can also be written p(v|x_i).]

------------Basic equations and concepts: definitions-------

1. "Probability" and "Uncertainty"

In information theory analysis, "uncertainty" has a precise meaning, which overlaps the everyday meaning in much the same way as the meaning of "perception" in PCT overlaps the everyday meaning of that word. In everyday language, one is uncertain about something if one has some idea about its possible states -- for example the weather might be sunny, cloudy, rainy, snowy, hot, ... -- but does not know in which of these states the weather is, was, or will be now or at some other time of interest. One may be uncertain about which Pope Gregory instituted the Gregorian calendar, while feeling it more likely that it was Gregory VII or VIII than that it was Gregory I or II. One may be uncertain whether tomorrow's Dow will be up or down, while believing strongly that one of these is more probable than that the market will be closed because of some disaster.

If one knows nothing about the possibilities for the thing in question (for example, "how high is the gelmorphyry of the seventh frachillistator at the winter solstice?") one seldom thinks of oneself as "uncertain" about it. More probably, one thinks of oneself as "ignorant" about it. However, "ignorance" can be considered as one extreme of uncertainty, with "certainty" as the other extreme. "Ignorance" could be seen as "infinite uncertainty", while "certainty" is "zero uncertainty". Neither extreme condition is exactly true of anything we normally encounter, though sometimes we can come pretty close.

Usually, when one is uncertain about something, one has an idea which states are more likely to be the case and which are unlikely. For example, next July 1 in Berlin the weather is unlikely to be snowy, and likely to be hot, and it is fairly likely to be wet. It is rather unlikely that the Pope is secretly married. It is very likely that the Apollo program landed men on the moon and brought them home. One assigns relative probabilities to the different possible states. These probabilities may not be numerically exact, or consciously considered as numbers, but if you were asked if you would take a 5:1 bet on it snowing in Berlin next July 1 (you pay $5 if it does snow and receive $1 if it doesn't snow), you would probably take the bet. If you would take the bet, your probability for it to snow is below 0.2.

Whether "uncertainty" is subjective or objective depends on how you treat probability. If you think of probability as a limit of the proportion of times one thing happens out of the number of times it might have happened, then for you, uncertainty is likely to be an objective measure. I do not accept that approach to probability, for many reasons on which I will not waste space in this message. For me, "probability" is a subjective measure, and is always conditional on some other fact, so that there really is no absolute probability of X being in state "i".

Typically, for the kind of things of interest here, when the frequentist "objective" version is applicable, the two versions of probability converge to the same value or very close to the same value. Indeed, subjective probability may be quite strongly influenced by how often we experience one outcome or another in what we take to be "the same" conditions. In other cases, such as coin tossing, we may never have actually counted how many heads or tails have come up in our experience, but the shape of the coin and the mechanics of tossing make it unlikely that either side will predominate, so our subjective probability is likely to be close to 0.5 for both "heads" and "tails". As always, there are conditionals: the coin is fair, the coin tosser is not skilled in making it fall the way she wants, it does not land and stay on edge, and so forth.

In order to understand the mathematics of information and uncertainty, there is no need to go into the deep philosophical issues that underlie the concept of probability. All that is necessary is that however probabilities are determined, if you have a set of mutually exclusive states of something, and those states together cover all the possible states, their probabilities must sum to 1.0 exactly. Sometimes inclusivity is achieved by including a state labelled "something else" or "miscellaneous", but this can be accommodated. If the states form a continuum, as they would if you were to measure the height of a person, probabilities are replaced by probability density and the sum by an integral. The integral over all possibilities must still sum to 1.0 exactly.

-------definitions-----

First, a couple of notations for the equations I do have to use:

I use underbar to mean subscript, as in x_i for x subscript i.

p(x_i) by itself means "Given what we know of X, this is the probability for us that X is (was, will be) in state i" or, if you are a frequentist, "Over the long term, the proportion of the times that X will be (was) in state i is p(x_i)".

I mostly use capital letters (X) to represent systems of possible states and the corresponding lower-case letters to represent individual states (x for a particular but unspecified state, x_i for the i'th state).

p(x_i|A) means "If A is true, then this is our probability that X is in state i" or "Over the long term, when A is true, the proportion of the time that X will be in state i is p(x_i)". In typeset text, sometimes this is notated in a more awkward way, using A as a subscript, which I would have to write as p_A(x_i).

I will not make any significant distinction between the subjective and objective approaches to probability, but will often use one or the other form of wording interchangeably (such as "if we know the state of X, then ...", which you can equally read as "If X is in a specific state, then ..."). Depending on which approach you like, "probability", and hence "uncertainty", is either subjective or objective. It is the reader's choice.

I use three main capital letter symbols apart from the Xs and Ys that represent different systems. U( ) means "Uncertainty", I( ) means Information, and M( : ) means mutual information between whatever is symbolized on the two sides of the colon mark. The parentheses should make it clear when, say, "U" means a system U with states u_i and when it means "Uncertainty".

-----definition of "uncertainty"----

If there is a system X, which could be a variable, a collection of possibilities, or the like, that can be in distinguishable states x_i, with probabilities p(x_i), the uncertainty of X is _defined_ by two formulae, one applicable to discrete collections or variables

U(X) = - Sum_i (p(x_i) log(p(x_i)))

the other for the case in which "i" is a continuous variable. If "i" is continuous, p(x_i) becomes a probability density and the sum becomes an integral

U(X) = - Integral_i (p(x_i) log(p(x_i)) di).

That's it. No more, no less. Those two equations define "uncertainty" in the discrete and the continuous case.

(Aside: Von Neumann persuaded Shannon to stop calling U(X) "uncertainty" and to call it "Entropy" because the equations are the same. This renaming was unfortunate, as it has caused confusion between the concepts of uncertainty and entropy ever since.)

If X is a continuous variable, the measure of uncertainty depends on the choice of unit for "i". If, for example, X has a Gaussian distribution with standard deviation s, and we write H for log(sqrt(2*PI*e)), the uncertainty of X is H + log(s). The standard deviation is measured in whatever units are convenient, so the absolute measure of uncertainty is undetermined. But whatever the unit of measure, if the standard deviation is doubled, the uncertainty is increased by 1 bit. Since we usually are concerned not with the absolute uncertainty of a system, but with changes in uncertainty as a consequence of some observation or event, the calculations give well-defined results regardless of the unit of measure. When the absolute uncertainty is of interest, one convenient unit of measure is the resolution limit of the measuring or observing instrument, since this determines how much information can be obtained by measurement or observation.

There are a few other technical issues with the uncertainty of continuous variables, but calculations using uncertainty work out pretty much the same whether the variable is discrete or continuous. Unless it matters, in what follows I will not distinguish between discrete and continuous variables.

--------comments on "uncertainty-----

Why is Sum(p*log(p)) a natural measure of "uncertainty"? Shannon, in his 1947 monograph that introduced information theory to the wider world, proved that this is the only function that satisfies three natural criteria:

(1) When the probabilities vary continuously, the uncertainty should also vary continuously.
(2) The uncertainty should be a maximum when all the possibilities are equally probable.
(3) If the set is partitioned into two subsets, the uncertainty of the whole should be the weighted sum of the uncertainties of the two parts, the weights being the total probabilities of the individual parts. Saying this in the form of an equation, if X has possibilities x_1, x_2, ... x_n and they are collected into two sets X1 and X2, then
U(X) = p(X1)*U(X1) + p(X2)*U(X2),
where p(Xk) is the sum of the probabilities of the possibilities collected into set Xk.

One important consequence of the defining equation U(X) = Sum_i(p(x_i)*log(p(x_i))) is that the uncertainty U(X) is always positive and ranges between zero (when one of the p(x_i) is 1.0 and the rest are zero) and log(N) (when there are N equally probable states of X). The logarithm can be to any base, but base 2 is conventionally used. When base 2 is used, the uncertainty is measured in "bits". The reason for using base 2 is convenience. If you can determine the actual state through a series of yes-no questions to which every answer is equally likely to be yes or no, the uncertainty in bits is given by the number of questions.

I will use the term "global uncertainty of X" for the case when nothing is assumed or known about X other than its long-term probability distribution over its possible states. This is what is meant by "U(X)" in what follows, unless some other conditional is specified in words.

In certain contexts, uncertainty is a conserved quantity. For example, if X and Y are independent, the joint uncertainty of X and Y, symbolised as U(X,Y), is

U(X,Y) = U(X) + U(Y)

where U(X,Y) is the uncertainty obtained from the defining equation if each possible combination of X state and Y state (x_i, y_j) is taken to be a state of the combined system. The concept can be extended to any number of independent variables:

U(X, Y, Z, W....) = U(X) + U(Y) + U(Z) + U(W) +...

Using the subjective approach to probability, "independent" means "independent given what is so far known". It is not clear how the term can legitimately be used in a frequentist sense, because any finite sample will show some apparent relationships among the systems, even if an infinite set of hypothetical trials would theoretically reduce the apparent relationships to zero. A frequentist might say that there is no mechanism that connects the two systems, which means they must be independent. However, to take that into account is to concede the subjectivist position, since the frequentist is saying that as far as he knows, there is no mechanism. Even if there was a mechanism, to consider it would lead the frequentist into the subjectivist heresy.

Uncertainty is always "about" something. It involves the probabilities of the different states of some system X, and probabilities mean nothing from inside the system in question. They are about X as seen from some other place or time. To say this is equivalent to saying that all probabilities are conditional. They depend on some kind of prior assumption, and taking a viewpoint is one such assumption.

Consider this example, to which I will return later: Joe half expects that Bob will want a meeting, but does not know when or where that is to be if it happens at all. What is Joe's uncertainty about the meeting?

Joe thinks it 50-50 that Bob will want a meeting. p(meeting) = 0.5
If there is to be a meeting, Joe thinks it might be at Joe's office or at Bob's office, again with a 50-50 chance.
If there is to be a meeting, Joe expects it to be scheduled for some exact hour, 10, 11, 2, or 3.

For Joe, these considerations lead to several mutually exclusive possibilities:
No meeting, p=0.5
Bob's office, p = p(meeting)*p(Bob's office | meeting) = 0.5*0.5 = 0.25
Joe's office p = p(meeting)*p(Joe's office | meeting) = 0.5*0.5 = 0.25

Joe has 1 bit of uncertainty about whether there is to be a meeting, and with probability 0.5, another bit about where it is to be. Overall, his uncertainty about whether and where the meeting is to be is

U(if, where) = -0.5log(0.5) - 2*(0.25log(0.25) = -0.5*(-1) - 0.5*(-2) = 1.5 bits
no meeting meeting&&either office

Another way to get the same result is to use a pattern that will turn up quite often

U(if, where) = U(if) + U(where|if) = U(if) + p(no meeting)*U(where|no meeting) + p(meeting)*U(where|meeting) = 1 + 0.5*0+0.5*1 = 1.5 bits

Joe is also uncertain about the time of the meeting, if there is one. It could be at 10, 11, 2, 3 regardless of where the meeting is to be. That represents 2 bits of uncertainty, but those two bits apply only if there is to be a meeting.

U(time|meeting) = 2 bits

U(time) = U(time)*p(meeting) + U(time)*p(no meeting) = 0.5*2 + 0.5*0 = 1.

U(if, where, time) = 2.5 bits

Compare this with Joe's uncertainty about the meeting if he knew there would be a meeting, but did not know whether it included Alice. Then we would have three mutually independent things Joe would be uncertainty about:

U(Alice) = 1 bit
U(office) = 1 bit
U(time) = 2 bits

Joe's uncertainty about the meeting would be 4 bits.

Notice what happened here. We added a dimension of possible variation independent of the dimensions we had already, but eliminated the dimension of whether there would be a meeting. That dimension ("if") interacted with the other two, because if there was to be no meeting, there could be no uncertainty about its time and place. Adding an independent dimension of variation adds to the uncertainty; it does not multiply the uncertainty.

U(meeting_configuration) = U(Alice) + U(office) + U(time)

------

2. "Information" and related concepts

"Information" always means change in uncertainty. It is a differential quantity, whereas "uncertainty" is an absolute quantity. I(X) = delta(U(X)).

A change in the uncertainty about X may be because of the introduction of some fact such as the value of a related variable, or because of some event that changes the probability distribution over the states of X. If it is sunny today, it is more likely to be sunny tomorrow than if today had happened to be rainy.

To continue the "meeting" example, imagine that Joe received a call from Bob saying that the meeting was on and in his office, but not specifying the time. Joe's uncertainty would be 2 bits after the phone call, but was 2.5 bits before the call. He would have gained 0.5 bit of information about the meeting. If the call had said the meeting was off, Joe would have no further uncertainty about it, and would have gained 2.5 bits. His expected gain from a call about whether the meeting was off or on, and its location if it was on, would be 0.5*0.5 + 0.5*2.5 = 1.5 bits, which is the uncertainty we computed by a different route above for U(if, where).

Often the "event" is an observation. If you observe a variable quantity whose value you did not know exactly, your uncertainty about it is likely (but not guaranteed) to be lower after the observation than before. Again, thinking of Joe's meeting, if the call had simply told Joe the meeting was on, without specifying the location, Joe's uncertainty about the meeting would have increased from 1.5 bits to 3 bits (1 bit for where the meeting is to be held and 2 bits for when). In that case his information gain would have been negative. However, averaged over all the possible messages and taking account of their probabilities, the information gain from a message or observation is always zero or positive, even though there may be individual occasions when it is negative.

The information gained about X from an observation is:

I_observation(X) = U_before-observation(X) - U_after-observation(X)

---------
Mutual Information

The Mutual information between X and Y is the reduction in the joint uncertainty of two systems X and Y due to the fact that X and Y are related, which means that knowledge of the state of one reduces uncertainty about the state of the other. If the value of Y is observed, even imprecisely, more is known about X than before the observation of Y.

When two variables are correlated, fixing the value of one may reduce the variance of the other. Likewise, we define "mutual information" as the amount by which uncertainty of one is reduced by knowledge of the state of the other. If U(X|Y) means the uncertainty of X when you know the value of Y, averaged over all the possible values of Y according to their probability distribution, then the mutual information between X and Y is

M(X:Y) = M(Y:X) = U(X) - U(X|Y) = U(Y) - U(Y|X)

Here is another generic equation for mutual information, using the reduction in the joint uncertainty of the two systema as compared to the sum of their uncertainties:

M(X:Y) = U(X) + U(Y) - U(X,Y)

from which it is obvious that M(X:Y) = M(Y:X) and that M(X:Y) cannot exceed the lesser of the two uncertainties U(X) and U(Y). If X is a single-valued function of Y (meaning that if you know the value of Y, you can thereby know the value of X) then M(X:Y) = U(X).

M(X:Y) is always positive, though there may be specific values of one of the variables for which the uncertainty of the other is increased over its average value, as happened when Joe was told the meeting was on without being told were, when, or who was attending.

Going back again to Joe's office meeting, suppose Joe knows there is to be a meeting, but does not know where or when. It might be at Bob's office or at Bob's house. If it is at the office, it might be at 10 or at 2. If at the house, it might be at 6 (with a dinner invitation) or at 9 (after dinner).

U(where) = 1 bit
U(when) = 2 bits
U(when) + U(where) = 3 bits

But U(when, where) is not 3 bits, because there are actually only four equiprobable combinations of place and time, 10 and 2 at the office, 6 and 9 at the house, so U(where, when) = 2 bits. Therefore M(where:when) = 1 bit.

-----time and rates----

Since "information" is a difference quantity, the concept of "information rate" becomes useful. It is the rate at which uncertainty changes as a result of ongoing conditions or events.

Consider I(X|observation), the information gained about X as the result of an observation. Imagine that a series of observations are made at regular intervals. if X does not change state over time, then each successive observation is likely to further reduce U(X). Because of the definition of "information" as "change in uncertainty", the observer gains information at a rate that is the average reduction in uncertainty per observation divided by the time between observations.

If X is a discrete system, there is a limit to how much information can be gained about X by observing it. If the global uncertainty of X (what you believe about X before observing it) is U(X), no amount of observation can reduce that uncertainty below zero -- the value of U(X) when you know the state of X exactly. Accordingly, the information gained by continued observation of a static quantity plateaus at the global uncertainty of that variable -- the amount that the observer did not know about it before starting the series of observations.

[cid:part1.07010101.05000906@moonsit.co.uk](Figure 1)

This is one place where the difference between continuous and discrete variables matters. If the system X has a finite number, N, of distinguishable states, its uncertainty cannot be greater than log(N), so log(N) is the maximum amount of information that could be obtained about it by continued or repeated observation. But if the states of X form a continuum, it has an infinite number of states, meaning that in theory there is no limit to the amount of information about it that could be gained by continued observation. In practice, of course, the resolution of the observing system imposes a limit. Precision is never infinite, even if the limit is imposed by quantum uncertainty.

What about the situation in which X is or may be changing? If a changing X is observed once, the uncertainty about X becomes a function of time since the observation. X may change state continuously or abruptly, slowly or fast. However it may change, the "information rate" of X is defined as dU(X)/dt provided that the observation timing is independent of anything related to a particular state or state change of X.

-----------
[Aside: Here we have another example of the kind of complementarity between two measures that is at the heart of the Heisenberg Uncertainty Principle of quantum mechanics. If you know exactly where something is, you know nothing about its velocity, and vice versa. The rate at which you get information from observing something can tell you about both its value and the rate of change of value, but the more closely define one of these, the less you can know about the other. The information rate determines the size of the "precision*velocity" cell.]
-----------

If an observation is made at time t0 and never thereafter, immediately after the observation at time t0+e ("e" represents "epsilon", a vanishingly small quantity) the uncertainty of X is U_(t0+e)(X) . From that moment, U_t(X) increases over time. The rate at which U_t(X) increases depends, in the casual tautological sense, on how fast the value of X changes. If the variable is a waveform of bandwidth W, this can be made a bit more precise. The actual rate depends on the spectrum of the waveform, but for a white noise (the worst case) the rate dU/dt = W*log(2*pi*e*N) where N is the noise power (which again depends on the choice of units).

However, just as continued observation cannot yield more information than the initial uncertainty of X, so continued external influences or other cause of variation cannot increase U_t(X) beyond its global value (its value determined from the long-term statistics of X).

[cid:part2.08010706.06020605@moonsit.co.uk](Figure 2)

----------

3. Information and history

In this section, we are interested in the mutual information between the preceding history and the following observation, which is symbolized M(X|Xhistory). In everyday language, we want to know how much we can and cannot know about what we will observe from what we have observed.

Mutual information need not be just between the states of system X and the states of a different system Y. It can be between the states of system X at two different times, which we can call t0 and t1. If t0 and t1 are well separated, the state of X at time t0 tells you nothing about its state at time t1. This would be the case at any time after Tc in Figure 2. But if t1 follows t0 very closely (i.e. near the left axis in Figure 2), it is very probable that the state of X has not changed very much, meaning that the Mutual Information between the two states is large. They are highly redundant, which is another way of saying that the mutual information between two things is close to its maximum possible value.

Suppose X is the weather. If it is sunny at 10 o'clock, it is likely to be sunny at one second past ten, or one minute past ten. The probability of it remaining sunny is still pretty high at 11, and still moderately high at ten tomorrow. But by this time next week, the observation that it is sunny now will not tell you very much more about whether it will be sunny then than will a book about the local climate at this time of year.

As the example suggests, M(X_t:X_(t-tau)) -- the complement of U_(t-tau)(X|observation at t0) shown in Figure 2-- is a function of tau that usually decreases with increasing tau. I say "usually", because it is not true of periodic or nearly periodic systems. The temperature today is likely to be nearer the temperature this time last year than to the temperature at some randomly chosen date through the year. For now we will ignore such cyclic variation, and treat M(X_t:X_(t-tau)) as if it declined with increasing tau, until, at some value of tau, it reaches zero.

------footnote-------
For a signal of bandwidth W, M(X_t:X_(t-tau)) reaches zero at tau = 1/2W. No physical signal is absolutely restricted to a precisely defined bandwidth, but the approximation can be pretty close in some cases. Good audio sampling is much faster than 1/2W because the physical signals of musical instruments have appreciable power at frequencies well about the listener's hearing bandwidth.
-----end footnote-----

Suppose a signal is sampled at regular intervals separated by tau seconds, with tau small enough that knowledge of the signal value at one sample reduces the uncertainty of the state of the next and previous samples. The probability distribution of the next sample is constrained by the values of the previous samples. One might think of the value of the sample as being that of a perception, but that is not necessary.

[cid:part3.09010909.00080403@moonsit.co.uk](Figure 2)

Figure 2 suggests how knowledge of the past values may constrain the probability distribution of the next sample value when the inter-sample time, tau, is short compared to the rate of change of the variable.

The first panel suggests the uncertainty of the next sample if one knows nothing of the history of the variable. If the variable is from the set of possible states of X, its uncertainty is U(X). The second and third panels show how the uncertainty is reduced if the values of the previous one or two samples are known. In the middle panel, one possible value of the previous sample is shown, and in the third panel the values of the two previous samples are taken into account.

The figure is drawn to show the changes in a slowly changing waveform, but that is just one possibility. It could equally easily represent the probabilities in a sequence of letters in a text. For example, the left panel might represent the possibility of observing any letter of the alphabet, the middle the distribution of the next character after observing a "t", and the right panel the distribution of the following character after observing "th". Since we assume that the statistics of X are stationary (i.e. do not depend on when you sample them, provided that choice of the sampling moment is independent of anything happening in X), M(X0:X1) = M(X1:X2) = M(X2:X3)...

We want to know the uncertainty of X at a given moment t0 given its entire previous history. If X is a slowly varying waveform, its uncertainty knowing the previous history right up to t0 is zero, which is uninteresting. In that case we have to ask about its history only up to some time t0-tau. We will actually consider the history to consist of successive samples separated by tau seconds. In the case of a discrete system such as the letters of a text, no such restriction is required, but by making the restriction on the sampling of the continuous waveform we can treat both kinds of system similarly.

Formally, one can write

U(X0|History) = U(X0) - M(X0:History)

but that is not much use unless one can calculate the mutual information between the next sample value x0 and the history of system X. It is possible again to write a general formula, but its actual calculation depends on the sequential statistics of X.

The History of X can be written as a sequence of samples counting back from the present: X1, X2, X3, ... and M(X0:History) can be written M(X0:X1, X2, X3,...)

M(X0:History) = M(X0|X1) + M(X0:X2|X1) + M(X0:X3|X2,X1) + M(X0:X4|X1,X2,X3) + ...

In words, the mutual information between the next sample and the history can be partitioned into the sum of an infinite series of mutual information elements, each one being the amount one sample contributes to the total mutual information, given the specific values of all the samples between it and the most recent sample.

To make this concrete, let's consider a few specific examples

--------
Example 1: X is a sequence of symbols "a" and "b", that alternate "....ababab..."

If one looks at an arbitrary moment, one has an 0.5 chance of the next sample being a or b: U(X0) = 1 bit. If one has seen one sample, which might have been either a or b, there is no uncertainty about the next sample: U(X0|X1) = 0, which gives M(X0:X1) = U(X).

What then of M(X0:X2|X1)? Remember that M(A:B) cannot exceed the lesser of U(A) and U(B). In this case, U(X2|X1) = 0 because if X1 was a X2 must have been b, and vice-versa. So M(X0:X2|X1) = 0 and all subsequent members of the series are also zero.

One can extend this example to any series or waveform that is known to be strictly periodic with a known period.

-------
Example 2: X is a sequence of symbols a, b, and c with the property that a is followed equally probably by b or c, b by c or a, and c by a or b. No symbol is ever repeated immediately, but apart from that, each other symbol is equally likely at any sample moment.

U(X0) = log(3) = 1.58 bits

U(X0|X1) = 1 bit, which gives

M(X0:X1) = 0.58 bits

It is irrelevant how X1 took on its value from X2. If x1 = a, x2 could have equally well been b or c. Therefore

M(X0:X2|X1) = 0

--------
Example 3. X is a sequence of five different symbols labelled 1, 2, 3, 4, 5. Considering 5 and 1 to be neighbours, each symbol can be followed by itself with probability 0.5 or one of its neighbours with probabiity 0.25. Over the long term, each symbol appears with equal probability.

U(X0) = log(5) = 2.32 bits

U(X0|X1) =-0.5 log 0.5 - 2* (0.25 log 0.25) (because the probability is 0.5 its predecessor was itself and 0.25 that it was either of its two neighbours)
= 0.5 + 0.5 = 1 bit

M(X0:X1) = 2.32 - 1 = 1.32 bits

M(X0:X2|X1) depends on the relationship between X2 and X0, unlike the situation in example 2. There are five possibilities for x0 (the value of X0). the probability of these five depends on the relationship between x0 and x2. We must consider each of these five, and partition M(X0:X1) according to the probabilities for each value of X2. In a formula:

M(X0:X2|X1) = Sum_k(p(X2=xk)*M(X0:X2|X1=xk))

Since the probability p(X2=xk) is the same for each value of xk (all symbols are equally likely over the long term), this sum comes down to

M(X0:X2|X1) = (1/N)*Sum_k(M(X0:X2|X1=xk))

We consider all these possibilities individually. Since all the relationships are symmetric (meaning that it doesn't matter which value we select as the possible next sample, x0), we need deal only with the relationships themselves.

The sequence could be {x0, x0, x0}, {x0, neighbour, x0}, {x0, neighbout, neighbour} or {x0, neighbour, non-neighbour}. Examples of each might be {3.3.3}, {3,4,3}, {3,4,4} and {3,4,5}, which would have probabilities 0.5*0.5 = 0.25, 2*0.25*0.25=0.125, 2*0.25*0.5=0,25, and 0.25*0.25=0.125 respectively.

There is no need to do the actual calculations here. They are easy but time-consuming. The point is that we can calculate M(X0:X2), M(X0:X2|X1) and the rest directly from the conditional probabilities.

---------
Example 4: X is built by summing a succession of samples of another system Y, such that xj = Sum(History(X)) + yj, x0 = 0 for some specified value of k, and yj is a sequence of positive and negative integer values, samples from a system Y, that average zero. In this example, the numbering scheme goes in the direction of time, later samples being labelled with larger numbers. At the k'th sample the expression M(X|Xhistory) becomes

M(Xk:Yk|X_(k-1), X_(k-2), ... X(0))

Each sample of Y contributes equally to the value of X, so there is no need to consider the sequential order of the y values. Assuming that Y is a series of numerical values and the samples of Y are independent, the variance of X increases linearly with the number of samples, so its standard deviation increases proportionally to the square root of that number.

Let us trace the contribution of the first few samples of Y to the uncertainty of X.

After one sample, y0, of Y, x0 will equal y0, which means U(X0) = U(Y0).

M(X0:Y0) = U(X) + U(Y) - U(X,Y) = U(X) = U(Y)

After the next sample, U(X1) still is given by the formula Sum(p log p), but the probabilities are no longer the same as they were before the first sample. There is a wider range of possibilities, from twice the minimum value of Y to twice the maximum value of Y, and the probabilities of those possibilities are not equal. Because the probability distribution over the possible values of X is not uniform, U(X) < 2*U(Y).

As more and more samples are included in the Sum, the range of X increases, but values near zero come to have higher and higher relative probabilities. The distribution of probabilities over the possible values of X approaches a Gaussian with variance that increases linearly with k, so U(X) approaches U(Y)*sqrt(k). The ratio U(X)/(k*U(Y) decreases with increasing k, approaching 1/sqrt(k) as k becomes large .

We want to consider the contribution of the k'th sample of Y. How much does the kth sample of Y reduce the uncertainty of the k+1th value of the Sum, as compared to knowing only the History up to the previous sample? That is the mutual information between the kth sample of Y and the k+1th value of X.

For large k, (U(X_k) - U(X_(k-1)))/U(Y) approaches sqrt(k)-sqrt(k-1), an increasingly small number. Yet it remains true that U(X_k|History) - U(X_k|History, Yk) = U(Y). The uncertainty of the value of X after the next sample of Y is always the uncertainty of Y, because if you know the current value of X, the next value is determined by the value of the Y sample. How can these two statements be reconciled? The key is the conditional of knowing the History. Without the prior History (in this case the Sum), the contribution of each successive Y sample to the uncertainty of X becomes less and less, but if you do have the prior History, the value of the next X sample will have an uncertainty that is the uncertainty of the Y sample.

The situation is the same if the system X is normalized, meaning that its variance is independent of the number of prior samples after a sufficient number of Y values have occurred. The magnitude of the contribution of the k'th Y sample is reduced proportionally to sqrt(k), but its contribution to the various uncertainties are the same, provided that the different possible values of X are resolved.

--------
Example 5: The same as Example 4 except that the samples of Y are not independent. If Y is a continuous waveform, the samples are taken more closely than the Nyquist limit (samples separated by 1/2W, where W is the bandwidth of the waveform). If Y is a discrete set of possibilities, successive samples are related by some kind of grammar. The two kinds of possibility converge when the discrete set of possibilities is linearly arranged and the grammar merely makes transitions among nearby values more likely than across distant values.

The new issue here is that if the history of Y is known, U(Yk) < U(Y). Therefore M(Xk:History(X), yk) < U(Y). Each successive sample of Y reduces the uncertainty of X less than would be the case if the samples of Y were independent of each other. The difference is subsumed by the fact that M(Xk|History(X)) is greater than it is in the case when the Y samples are independent, by the same amount.

---------
Example 6: System X is a function of the current and past samples of system Y. We want to know the uncertainty of sample xk when the values of successive Y samples are known only up to sample y_(k-h). For example, in the grammar of example 3, we may have observed the sequence 1,2,1,5, (samples y1, y2, y3, y4) and want to know the uncertainty of sample x6, which is a function of y1, y2, y3, y4, y5, and y6. We do not yet know the values of y5 or y6. As a real-world example, a trader in the days of sailing ships might have known the market conditions in China and England several months before his ship captain actually bargains for tea in Canton, but the earlier conditions are the conditions on which he must base his instructions to the captain.

[Martin Taylor 2012.12.08.11.32]

....

We consider the "basic-standard" control system that has a pure integrator as its output function.

First, let's treat the effect of one step change in the disturbance value. We assume that the error is initially zero, and at time t0 the disturbance suddenly goes from whatever value it may have had previously to d0. We treat this single step, and then consider the effects of a series of steps. We start by showing how the integrator rate parameter controls how fast information about the disturbance arrives at the output -- how fast the control system measures the disturbance value.

Let us trace the signal values at various places in the loop as time passes after the step at time t0. At time t0+e (e means epsilon, an indefinitely small amount), the error value is -d and the output value qo is zero. At this early time the output value is whatever it was before the step, and has an unchanged influence on the value of the input quantity qi. As time passes, the output value changes exponentially until after infinite time it reaches -d. Assuming q0 was initially zero, then at time t:

qo(t) = -d0*(1-e^(-G*(t-t0))).

At the same time, the input value qi approaches zero by the same exponential:

qi(t) = 1 - e^(-G*(t-t0))

[cid:part4.00060600.08030804@moonsit.co.uk]

Exponential changes in value are quite interesting in information theory, as they often indicate a constant bit rate of information transfer. To see this, you must understand that although the absolute uncertainty about the value of a continuous variable either before or after a measurement or observation is always infinity, the information gained through the measurement or observation is finite. If the range of uncertainty about the value (such as standard deviation of measurement) is r0 before the measure and r1 after the measure, the information gained is log2(r0/r1) bits.

Measurements and observations are never infinitely precise. When someone says their perception is subjectively noise-free and precise, they nevertheless would be unable to tell by eye if an object changed location by a nanometer or declined in brightness by one part in a billion. There is always some limit to the resolution of any observation, so the issue of there being an infinite uncertainty about the value of a continuous variable is never a problem. In any case in which a problem of infinite uncertainty might seem to be presenting itself, the solution is usually to look at the problem from a different angle, but if that is not possible it is always possible to presume a resolution limit, a limit that might be several orders of magnitude more precise than any physically realizable limit, but a limit nevertheless.

(If you want to get a more accurate statement, and see the derivation of all this, I suggest you get Shannon's book from Amazon at the link I provided earlier <http://www.amazon.com/Mathematical-Theory-Communication-Claude-Shannon/dp/0252725484/ref=sr_1_1?s=books&ie=UTF8&qid=1354736909&sr=1-1&keywords=Shannon%2C+C.+E.><http://www.amazon.com/Mathematical-Theory-Communication-Claude-Shannon/dp/0252725484/ref=sr_1_1?s=books&ie=UTF8&qid=1354736909&sr=1-1&keywords=Shannon%2C+C.+E.>).

In the above example, q0 could have started at any value whatever, and would still have approached -d exponentially with the same time constant. The output value qo has the character of a continuous measurement of the value of the disturbance, and this measurement gains precision at a rate given by e^(-G*(t-t0)). (The units of G are 1/seconds).

What is this rate of gain in precision, in bits/sec? One bit of information has been gained when the uncertainty about the value of d0 has been halved. In other words, the output has gained one bit of information about the disturbance by the time tx such that

e^(-G*(tx-t0)) = 0.5

Taking logarithms (base 2) of both sides, we have

log2(e)*(-G*(tx-t0)) = -1

tx-t0 = 1/(G*log2(e)) = 1/(G*1.443...) seconds

That is the time it takes for q0 to halve its distance to its final value -d0 no matter what its starting value might have been, which means it has gained one bit of information about the value of the disturbance. The bit rate is therefore G*1.443... bits/sec. That is the rate at which the output quantity q0 gains information about the disturbance, and also is the rate at which the input quantity loses information about the disturbance. The input quantity must lose information about the disturbance, because it always approaches the reference value, no matter what the value of the disturbance step.

I suppose you could look upon G*1.443 as the bit rate of the loop considered as a communication channel, but it is probably not helpful do do so, as the communication is from the input quantity back to itself. Without the complete loop, q0 does not act as a measure of d0, but continues negatively increasing linearly without limit at a rate G*d0. It is better just to think of G*1.443 as the rate at which information from the disturbance appears at the output quantity.

As a further matter of clarification, I suppose we ought to mention the behavioural illusion here. If the environmental feedback path is not a simple connector with a gain of unity, the output quantity differs from the disturbance value. The mismatch does not affect the information relationship between the disturbance and the output quantity, since the environmental feedback path provides a single-valued environmental feedback function. It is always possible to compute the influence of the output on the input quantity if you know the output value q0. This means that q0 and the influence of the output on the input quantity are informationally identical, whatever their functional relationship, providing the environmental feedback function does not change. If it does change, the effect is similar to adding a noise source to a communication channel, a situation that was Shannon's primary concern and provides no particular complication to the analysis. (Or in the case of Marken's demo that involves switching the sense of the environmental feedback path, the communication analogy is the band-switching secret communication procedure patented by the movie star Hedy Lamarr; in both cases, the switching must be separately detected or otherwise known).

As a specific example, if the environmental feedback path has a gain of 2, the output quantity will converge to a value of d/2. The convergence to d/2 is at the same exponential rate as would have been the case for en environmental feedback path gain of 1.0, providing an information gain rate that is determined only by the integration rate parameter G. Information from the disturbance still arrives at the output quantity (and is lost from the input quantity) at a rate G*1.443...

from now on, we will give the label "T" to the rate at which the output gains information about the disturbance.
T = G*1.443... bits/sec.

The rate of information gain about the disturbance value at the output quantity is independent of the disturbance waveform and of the environmental feedback function. It is a property of the internal circuitry of the control system. Accordingly, we can make some estimates about the achievable precision of control for disturbances that are not simple step functions. First, however, we should look at the effect of a second and subsequent step change in the disturbance value.

The disturbance value changed abruptly to d0 at time t0. Now at time t1 the disturbance value jumps to d1. The input quantity adds d1-d0 to whatever value it had arrived at by t1. The error value jumps by the same amount, but the output value qo does not change immediately, because of the integrator output function. It starts changing exponentially toward (the negative of) the new value of the disturbance.

[cid:part6.08030405.01080809@moonsit.co.uk]

At t1+e, the output quantity has a value that is unknown, because ever since t0 it has been in the process of approaching -d0 exponentially from some unknown starting point. The uncertainty of its current value is T*(t1-t0) bits less than it had been at time t0, because it was performing a process functionally equivalent to measuring the value d0. Now it has to start forgetting the value d0 and start measuring the value d1. Less colloquially, the mutual information between the output quantity and the disturbance value is instantly decreased by the uncertainty of the change in disturbance value, while the mutual information between the disturbance value and the input value is increased by the same amount.

By how much are the mutual information values changed by the new step? Since the absolute uncertainty of a continuous variable is infinite, can that change be measured? Shannon shows again that it can. The amount is log2(r_new/r_old) where r_new is the new range of uncertainty (such as standard deviation) and r_old is the old range of uncertainty. The new range of uncertainty depends on the probability distribution of values that can be taken by the disturbance at that time, which may be unknown. It is not a new unknown, however, since the original disturbance step came from the same distribution, and the original value of the output quantity probably came from the same distribution (modified, of course, by the inverse of the environmental feedback function) if the system had been controlling prior to the first step that we analyzed.

After the second step change in the disturbance value, the output quantity is gaining information, still about the disturbance value, but not about d0. Now it is gaining information about d1 and losing information about d0. At the same time, the input quantity is losing information about d1, as well as about any remanent information about d0 that might still have been there at time t1. All this is happening at a rate of T = G*1.44.3... bits/sec.

The effects of third and subsequent steps in the value of the disturbance are the same. The output continues to gain information about the most recent value of the disturbance at T bits/sec while losing information about prior values at the same rate, and while the input also loses information about the new and old values at T bits/sec. Making the steps smaller and closer in time, we can approximate a smooth waveform with any desired degree of precision. No matter what the disturbance waveform, the output is still getting information about its value at T bits/sec, and losing information about its history at the same rate.

So long as the disturbance variation produces less than T bits/sec, control is possible, but if the disturbance variation is greater than T bits/sec, control is lost. The spike in the input value at a step change in the disturbance is an example of that, where control is completely lost at the moment of the step, and gradually regained in the period when the disturbance remains steady, not generating uncertainty.

---------------

This is getting rather too long for a single message, so I will stop there. I hope it explains just how information gets from the disturbance to the output, and how fast it does so in the case of the ideal noiseless control system with no lag and a pure integrator output function.

Martin

MartinT · February 20, 2013, 3:57pm

[Martin Taylor 2013.02.20.10.22]

I am following Shannon's 1949 use of the term, as a reduction (a

differential or derivative quantity) in what Shannon called
“Entropy” and I call “Uncertainty”. The concept of “information” is
context-free, as is “average” or “frequency”, which can be used to
describe audio signals, earthquake vibrations, weather patterns,
light signals from distant stars, X-ray photos, etc., etc. So a
“definition of information in the context of perceptual control
theory” is the same as “a definition of information in the context
of the physics of black holes” or “a definition of information in
the context of medical diagnosis”. It is the reduction of
uncertainty consequent on some change of circumstance.

The critical definition is of "Uncertainty", and that hinges on the

definition of “probability”.

"Probability" is the hardest of these concepts to get straight. I

tried, in my long unfinished message, to be agnostic about its
definition. I simply required that “probability” has to have one
particular property: if there is a set of mutually exclusive
possibilities for the state of some system X, each of which has a
probability of being the actual state, the sum of all the
probabilities must be exactly 1.0. With that criterion for
“probability”, “Uncertainty” is defined by an equation, and only by
that equation. If the number of possibilities is finite (always true
in practice) and the probability of the j’th state is p(j),
uncertainty U is DEFINED as Sum-over-j(p(j)*log(p(j))). When you
apply it to your personal feeling of uncertainty about something,
this equation behaves pretty much as one would think it should.

I'm sorry if my message seems overly technical. The reason for

writing it in the first place was that so much nonsense had been
written on CSGnet about Information Theory that I thought that
writing its application to control theory without an introduction to
information theory would lead only to unwarranted critiques. The
reason it is so long delayed since the topic was current (in early
December) is that I have been trying to make it intelligible by
reducing the technicalities as much as I could without introducing
errors that would cause trouble in Part 2, the application to
control theory. I think and hope that for the most part you could
read it without worrying too much about the equations, but the
equations should make precise what the words mean.

As I said, the long message got released inadvertently, when I hit

the “Send” instead of the “Save” button before going to bed last
night. It isn’t what I would have released, given more time and
effort, but it is probably not too far from what the final version
would have been (unless I would finally have thrown it all away, as
I did with its predecessors, and tried a fifth or sixth rewrite from
scratch).

Give it a go, if you are interested in applying information theory

to control, because without a proper understanding of mutual
information and the information carried by history, my intended Part
2 of the message won’t make much sense.

Martin

···

[Rupert Young 2013.02.20 09.00]

    Martin, perhaps I missed it in previous discussions, but I

haven’t seen a definition of “information” in the context of
perceptual control theory.

rsmarken · February 20, 2013, 4:30pm

[From Rick Marken (2013.02.20.0830)]

Martin Taylor
( 2013.01.01.12.40)_-
    MT: Since writing [Martin Taylor 2012.12.08.11.32], which showed the
close link between control and measurement, and followed the
information “flow” around the loop for a particular control
system,

RM: Isn’t that the one that showed that there was information about the disturbance in perception? I thought we already showed that there isn’t. Is this zombie coming back again?

    MT: My hope is that introducing these concepts with no reference to
control will make it easier to follow the arguments in later
messages that will apply the concepts and the few equations to
control…

    RM: Well, that was quite a textbook!! But I think I'm ready for the part that shows how all this is related to control theory. But didn't you already do that in [Martin Taylor 2012.12.08.11.32]? Which was flawed because you treated the events in a control loop as though they were sequential. For example, you said:

    MT: First, let's treat the effect of one step change in the
disturbance value. We assume that the error is initially zero,
and at time t0 the disturbance suddenly goes from whatever value
it may have had previously to d0. We treat this single step, and
then consider the effects of a series of steps. We start by
showing how the integrator rate parameter controls how fast
information about the disturbance arrives at the output – how
fast the control system measures the disturbance value.
    Let us trace the signal values at various places in the loop as
time passes after the step at time t0. At time t0+e (e means
epsilon, an indefinitely small amount), the error value is -d
and the output value qo is zero. At this early time the output
value is whatever it was before the step, and has an unchanged
influence on the value of the input quantity qi. As time passes,
the output value changes exponentially until after infinite time
it reaches -d. Assuming q0 was initially zero, then at time t:
    qo(t) = -d0*(1-e^(-G*(t-t0))).

RM: So you are assuming that e is a function of a change in d, and that output, qo, is a function of e. So your control model is:

d → e–> qo

In fact, it works like this:

d → qi–>e–> qo
^ |
>________|

So e is a result of the simultaneous effects of d AND qo on the input variable, qi.

You just cannot get away from the fact that the input to a control system is proportional to d + qo. There is simply no way for the system to tell from the sum of d and qi how much of that sum is due to d and how much is due to it’s own output, qo.

Maybe your new informational analysis clears all this up. But it looks like we might be seeing another episode of “Night of the Living Information About the Disturbance in Perception”. If so, that’s fine with me; I love a good horror show;-)

Best

Rick

···

–
Richard S. Marken PhD
rsmarken@gmail.com
www.mindreadings.com

MartinT · February 20, 2013, 4:40pm

[Martin Taylor 2013.02.20.11.38]

[From Rick Marken (2013.02.20.0830)]
          Martin Taylor
( 2013.01.01.12.40)_-
            MT: Since writing [Martin Taylor 2012.12.08.11.32],
which showed the close link between control and
measurement, and followed the information “flow” around
the loop for a particular control system,

      RM: Isn't that the one that showed that there was information
about the disturbance in perception? I thought we already
showed that there isn’t. Is this zombie coming back again?

I guess you wrote this message before reading my response to Rupert

Young, in which I said: “I’m sorry if my message seems overly
technical. The reason for writing it in the first place was that so
much nonsense had been written on CSGnet about Information Theory
that I thought that writing its application to control theory
without an introduction to information theory would lead only to
unwarranted critiques.”

Your message is an example of what I meant.

Martin

···

rsmarken · February 20, 2013, 4:47pm

[From Rick Marken (2013.02.20.0850)]

Martin Taylor (2013.02.20.11.38)_-

MT: I guess you wrote this message before reading my response to Rupert
Young, in which I said: “I’m sorry if my message seems overly
technical. The reason for writing it in the first place was that so
much nonsense had been written on CSGnet about Information Theory
that I thought that writing its application to control theory
without an introduction to information theory would lead only to
unwarranted critiques.”
Your message is an example of what I meant.

RM: I look forward to you showing me precisely why it is nonsensical. I can’t tell you how tiring it gets being right;-)

Best

Rick

···

–
Richard S. Marken PhD

rsmarken@gmail.com
www.mindreadings.com

MartinT · February 20, 2013, 6:27pm

[Martin Taylor 2013.02.20.11.59]

[From Rick Marken (2013.02.20.0850)]
        Martin Taylor
(2013.02.20.11.38)_-

        MT: I guess you wrote this message before reading my
response to Rupert Young, in which I said: “I’m sorry if my
message seems overly technical. The reason for writing it in
the first place was that so much nonsense had been written
on CSGnet about Information Theory that I thought that
writing its application to control theory without an
introduction to information theory would lead only to
unwarranted critiques.”
        Your message is an example of what I meant.

  RM: I look forward to you showing me precisely why it is
nonsensical. I can’t tell you how tiring it gets being right;-)

It's hard to know where to begin, but I suppose one place is in that

you totally ignore the fact that information is the differential of
uncertainty. When you try to apply the static solution of an
equilibrium state to a dynamic problem, that’s nonsense.

Let's look at what you wrote in [From Rick Marken

(2013.02.20.0830)].

I had developed the equation: qo(t) = -d0*(1-e^(-G*(t-t0))) as the

trace of the output after a step change in the disturbance, to which
you commented:

  RM: So you are assuming that e is a function of a change in d, and
that output, qo, is a function of e. So your control model is:
  d --> e--> qo

That's the first nonsense. Reading does help, whatever you may

think. If you read the derivation, it depends on the next diagram
you deployed in order to instruct me on how control works.

  In fact, it works like this:



  d --> qi-->e--> qo

          ^             |

          >________|

   

  So e is a result of the simultaneous effects of d AND qo on the

input variable, qi.

Exactly so. without that fact, as you should have known, the

equation could not be developed.

  You just cannot get away from the fact that the input to a control
system is proportional to d + qo. There is simply no way for the
system to tell from the sum of d and qi how much of that sum is
due to d and how much is due to it’s own output, qo.

If you consider only the equilibrium state at which the system

arrives after an infinite time with a static disturbance value, you
are correct. But information is a concept from a different domain, a
domain in which things change over time. So that’s another nonsense.

Your diagram should have a "(t)" after each variable name. With the

“(t)” you cannot say “There is simply no way for the system to tell
from the sum of d_t0(t) and qi_t0(t) how much of that sum is due to
d(t) and how much is due to it’s own output, qo(t).” (x_t0(t) means
the value of x(t) at time t0). Such a statement would be simply
wrong, as I showed in deriving the equation you cited; the equation
itself shows the value of qo(t) as a function of d(t) for a specific
time course of d(t). So another nonsense is to claim the equation is
wrong because it contradicts a pre-formed opinion of yours rather
than because its mathematical derivation is faulty.

That equation actually has very little to do with information, and

was not derived using information theory, so it is a bit of nonsense
to haul it out as a counter-example contradicting the idea that
information analysis is relevant to control systems. The equation
does show, however, that it is possible to derive qo(t) from d(t),
since any waveform of d(t) can be approximated by a series of steps,
each of which contributes to qo(t) in the way the equation
describes. If you were to go and read all of the long message and
check out the graded examples, you will find that at the end I was
in the process of developing the example that described this process
in informational terms. Originally I had used the equation to point
out how the information about the disturbance available from the
output increased linearly over time, but that was working in the
opposite direction.

Another nonsense is the implication that even if (counterfactually)

it were true that “there is simply no way for the system to tell
from the sum of d and qi how much of that sum is due to d and how
much is due to it’s own output, qo” this would imply that there was
no information about the relationship between d, qo and qi. How this
works is demonstrated very early in the long message.

I don't know how much more nonsense I can squeeze out of [From Rick

Marken (2013.02.20.0830)]. I think 5 nonsenses in 9 lines must be
something of a record, without looking any further.

This all reminds me a little of what happened one New Year's Eve

when I had taken a ride to the party with a designated driver. On
the way back, we were stopped at a RIDE check (anti-alcohol). The
policeman checked for alcohol on the breath (none), and asked the
driver for his licence. He gave it back with the comment that the
driver should have signed it. Then he informed the driver that one
headlight wasn’t working. Then he checked the brakelights, and found
one wasn’t working. He came back to the driver and asked: “Would you
like me to check your car some more, sir?”

The driver got off with a warning to take his signed driver's

licence to a police station within a few days, along with evidence
that the faults had been fixed. I wonder if you should have your
licence to comment on information theory signed? At the moment, you
wouldn’t even be allowed to take the driving test.

Martin

rsmarken · February 21, 2013, 7:41pm

Martin accidentally sent some posts directly to me and I replied to Martin not knowing they didn’t go to the net. So I am posting (with Martin’s permission) Martin’s last reply which also contains the prior posts. But this is the main one and I’ll reply to it as soon as I get a chance.

Best

Rick

···

[Martin Taylor 2013.02.21.10.28]
[From Rick Marken (2013.02.20.2300)]
      [Martin Taylor
2013.02.20.23.34]

[From Rick Marken (2013.02.20.1812)]
      I'm not going to refute what you wrote (quoted below), since
you can refute it yourself by working through the derivation
of the equation. There’s no need for me to do it again.

    RM: Actually I don't believe I saw what I think of as a
derivation of the equation in question. Could you show me the
derivation please. Just the equations would be fine. Start off
with the basic closed loop system equations and the show how you
get to:

qo(t) = - d0*(1-e^(-G*(t-t0)))

You are quite right. I didn't go through the derivation that
produces the equation when I first wrote it. So here it is.
The easiest way is to use Laplace transforms, since in a linear
system you can substitute a Laplace transform of a waveform in every
place where the equilibrium analysis has a simple scalar value such
as a signal value of a multiplier. You do the algebra with the
Laplace transforms exactly as you do it with the simple variables,
and then enter the actual Laplace transforms of the input variable
waveforms (disturbance and reference). This only works for linear
systems, and the results are usually easy to work with only when
there are no transport lags. The loop we are discussing is linear
and has no transport lags, so my first approach would always be to
try Laplace methods.
I'll go through the usual loop analysis, but finding qo rather than
p in terms of d, assuming r is always zero. All the links in the
loop are assumed to be unity multipliers with no transport lag, and
so are all the functions in the loop except the output function,
which has gain G.
qo = G*e

     = G*(r-p)

     = -G*p

     = -G*qi

     = -G*(qo+d)

qo*(1+G) = -G*d

qo = -d*G/(1+G)



That's the standard analysis. The Laplace analysis of the waveforms
is the same except for putting (s) after each variable, like this:
qo(s) = G(s)*e(s)

         = G(s)*(r(s) -p(s))

         = -G(s)*p(s)

         = -G(s)*qi(s)

         = -G(s)*(qo(s)+d(s))

qo(s)*(1+G(s)) = -G(s)*d(s)

qo(s) = -d(s)*G(s)/(1+G(s))



This is a completely general solution for all control loops with the
defined properties. The Laplace transform of a waveform describes
the waveform in much the same way as does a Fourier transform, but
only applies after time zero.
Next we have to specify the Gain function (my analysis was for a
pure integrator) and the disturbance waveform (my analysis was for a
step function of size d0).
A convenient place to find Laplace transforms for common cases is
Laplace transform. Here we find the Laplace transforms
for the integrating function and for the step change in a variable.
The Laplace transform of an integrator is 1/s, and, not
coincidentally, for a step function of size k it is k/s. So we
substitute these into the equations, using d0 instead of k. Using
the symbol G without the (s) as the integrator gain, so that
output = G*integral(input*dt) , or in Laplace form, output =
(G/s)*input, we have
qo(s) = -(d0/s)*(G/s)/(1+G/s)

         = -(d0/s)*G/(G+s)

         = -d0*G/(s*(G+s))



Looking at the Wikipedia table of Laplace transforms, we find that
k/(s*(s+k)) is a simple form needing no further algebraic
manipulation. It is an “exponential approach”, (1-exp(-kt)). So we
now have
qo(t) = -d0*(1-exp(-Gt)) for t >0



If the step is at time t0 instead of at time 0, substitute t-t0 for
t in this equation, giving
qo(t) = -d0*(1 - e^(-G*(t-t0)))



which is the form of the equation for which you wanted the
derivation. In generating the equation, we assumed qo and d were
both zero before the step, but it is easy to add in the appropriate
constants if they had other values before the step.
You could do the same thing with differential equations, but when
the system is linear, Laplace transforms give the same result much
more easily when the forms are simple.
-------------



Previously, you wrote [From Rick Marken (2013.02.20.1812)]:
  MT: If you consider only the equilibrium state at
which the system arrives after an infinite time with a static
disturbance value, you are correct. But information is a concept
from a different domain, a domain in which things change over
time. So that’s another nonsense.
            RM: But I am think of the variables as changing over
time. So again I don’t see that what I said is nonsense.
There is simply no way for the system to tell from the
sum of d and qi or from the dum of the derivatives of
these variables how much of that sum is due to d (or
(d/dt)d) and how much is due to it’s own output, qo (or
(d/dt)qo).

You are still thinking only of one point in time. Please read what
Rupert Young called my “tutorial” at least to the point where I
discuss M(X|Xhistory). We aren’t talking about derivatives, but of
integrals. I don’t want to have to rewrite the tutorial introduction
every time you make a statement like this.
              MT:
the equation itself shows the value of qo(t) as a
function of d(t) for a specific time course of d(t).
So another nonsense is to claim the equation is wrong
because it contradicts a pre-formed opinion of yours
rather than because its mathematical derivation is
faulty.
            RM: No, I'm claiming that the mathematical derivation is
faulty because it fails to take into account that fact
that qo is changing during the step change in d.

Actually, it isn't. Integrators don't change their output
instantaneously. For any integrator with any input other than a
Kronecker delta function, qo(t+epsilon) - qo(t-epsilon) approaches
zero as epsilon approaches zero. When the input is a step function,
qo starts to change only when the effect of the step reaches the
integrator’s input terminal, but it hasn’t changed at all before
that. In the analysis, we postulate that the effect of the step is
infinitely quickly delivered, but nevertheless, the integrator is
integrating the effect of the step, and until the step happens,
there is no effect to integrate.
              MT: That equation actually has very little to do with
information, and was not derived using information
theory, so it is a bit of nonsense to haul it out as a
counter-example contradicting the idea that
information analysis is relevant to control systems.
            RM: I did not (and do not) say that informational
analysis is irrelevant to control systems. All I said is
that there is no information about the disturbance in
the input to a control system.

Another statement that shows your need to read either my "tutorial"
or some other material on information theory.
              MT:
The equation does show, however, that it is possible
to derive qo(t) from d(t),
            RM: But you can do that using control theory. And you
can do it correctly. There is no need for information
theory.

True. Did I ever suggest the contrary? Did I not use standard
analytic methods in the derivation? I seem to remember pointing out
that some information results can be derived from the analysis,
but never that informational analysis contributed to the control
equation.
              MT: Originally I had used the equation to point out
how the information about the disturbance available
from the output increased linearly over time, but that
was working in the opposite direction.
            RM: This I don't understand ths but it's not important.

Yes, it IS important. If you don't understand it, you won't
understand much else in this whole discussion. For example, it is
one starting point from which we can derive the information
transmission capacity of an integrator to be G*1.443… bits/sec
[Martin Taylor 2012.12.08.11.32]. Read the tutorial (or any other
suitable material on information theory). Don’t rely on your
intuition about what “must” be true.
            There is no question
that the output can be considered to have information
about the disturbance if you know the functions that
relate disturbance to input and output to input.

If control is good, or even if it is poor so long as there is some
measure of control, the output necessarily has information about the
disturbance. No need to know anything about the functions involved.
It’s a tautology.
              MT:
Another nonsense is the implication that even if
(counterfactually) it were true that “there is simply
no way for the system to tell from the sum of d and qi
how much of that sum is due to d and how much is due
to it’s own output, qo” this would imply that there
was no information about the relationship between d,
qo and qi.
            RM: I don't understand this either.

I knew you didn't, and I assumed that there might be others who also
wouldn’t understand it. That’s why I wrote the tutorial – so that
after reading it, you would understand such statements, or at least
be able to make an informed comment. This particular example comes
quite early in the tutorial.
              MT: I don't know how much more nonsense I can squeeze
out of [From Rick Marken (2013.02.20.0830)]. I think 5
nonsenses in 9 lines must be something of a record,
without looking any further.
            RM: I'm just not convinced that what I said is nonsense.
But one thing I know for sure is not nonsense is that
there is no way for the system to tell from the sum of d
and qo how much of that sum is due to d and how much is
due to it’s own output, qo.

You are right that this statement is not nonsense. It's just
irrelevant. We all know that what you say here is true, and we all
SHOULD know that it’s not the issue. The issue is whether you can
tell the relationships among the current values of the variables
using the histories of those values and the functions of the loop.
            y\You have tried to
show it’s nonsense by assuming that qo is constant
during a step change in d. Yours is really an open loop
analysis of closed-loop control.

I hope that the explicit derivation above shows the nonsense
coefficient of this statement. It is close to 1.0, colloquially
known as “utter”, both because your statement about the equation is
nonsense and because your attempted linkage to the preceding
statement is also nonsense. Maybe I should say that its nonsense
coefficient is 2.0, but that would be like an athletic coach saying
his team was giving 200%.
            But I am looking forward to seeing your essay on the
application of information theory to control.

Good. But don't expect to understand it unless you understand most
of what is in the “tutorial”.
Incidentally, I would appreciate comments on the "tutorial",
especially to point out any errors. I’m not really interested in
philosophical discussions on the nature of probability or
uncertainty, since all that matters to the tutorial is the
mathematics, whether expressed verbally or in equations. I really
would like to know what might make it easier to understand, or what
needs correcting because it is wrong.
Martin

–
Richard S. Marken PhD
rsmarken@gmail.com
www.mindreadings.com

rsmarken · February 22, 2013, 5:12am

[From Rick Marken (2013.02.21.2110)]

Martin Taylor (2013.02.21.10.28)–
      MT: I'm not going to refute what you wrote (quoted below), since
you can refute it yourself by working through the derivation
of the equation. There’s no need for me to do it again.

    RM: Actually I don't believe I saw what I think of as a
derivation of the equation in question. Could you show me the
derivation please. Just the equations would be fine. Start off
with the basic closed loop system equations and the show how you
get to:

qo(t) = - d0*(1-e^(-G*(t-t0)))

MT: You are quite right. I didn't go through the derivation that
produces the equation when I first wrote it. So here it is.
...



I'll go through the usual loop analysis, but finding qo rather than
p in terms of d, assuming r is always zero. All the links in the
loop are assumed to be unity multipliers with no transport lag, and
so are all the functions in the loop except the output function,
which has gain G.
qo = G*e

     = G*(r-p)

     = -G*p

     = -G*qi

     = -G*(qo+d)

qo*(1+G) = -G*d

qo = -d*G/(1+G)



That's the standard analysis. The Laplace analysis of the waveforms
is the same except for putting (s) after each variable, like this:
qo(s) = G(s)*e(s)

         = G(s)*(r(s) -p(s))

         = -G(s)*p(s)

         = -G(s)*qi(s)

         = -G(s)*(qo(s)+d(s))

qo(s)*(1+G(s)) = -G(s)*d(s)

qo(s) = -d(s)*G(s)/(1+G(s))



This is a completely general solution for all control loops with the
defined properties. The Laplace transform of a waveform describes
the waveform in much the same way as does a Fourier transform, but
only applies after time zero.

RM: Yes, these look fine. But it’s important to note that you are assuming that the feedback function relating qo to qi is 1.0. If we showed the feedback function (call it H) then the equation relating qi to d (using the standard analysis) would look like this:

(1) qo = -G*(H(qo)+d)

and the relationship between qo and d becomes

(2) qo = 1/H[-d*G/(1+G)]

where 1/H is the inverse of the feedback function, H. In a high gain system (G large) this reduces to

(3) qo = -(1/H) d

(this is basically the relationship between qo and d that is given on p. 157 of Living Control Systems except that the feedback function is called g rather than H, and the h function in that equation refers to the function relating the disturbance to qi. When the relationship between qo and d is written as in equation (3) it is clear that the equation is not describing the causal relationship between d and qo; d doesn’t cause qo via the connection from qo to qi (that’s what H is, recall). What we are interested in is the causal relationship between d and qo via the organism; but the organism function, G, that connects d to qo (equation (1)) is canceled out of the relationship between qo and d (by the expression G/(1+G)).

Anyway, all of the above is not a derivation of the equation you gave which was:

(4) qo(t) = - d0*(1-e^(-G*(t-t0)))

This equation says purports to show how the organism function, G, relates qo to d. It’s saying the qo is an (exponential) function of d. This is quite different than function you derived using the “standard” (closed loop) analysis:

(5) qo = -d*G/(1+G)

See the difference. Equation (4) says that qo is a direct function, G, of d and that the effect of d on qo decays over time; the larger the value of G the more rapid the decay. Equation 5 does not say that qo is a direct function, G, of d; as G increases, G falls out of the picture. So let’s see how you came up with equation (4).

MT: Next we have to specify the Gain function (my analysis was for a
pure integrator) and the disturbance waveform (my analysis was for a
step function of size d0).
A convenient place to find Laplace transforms for common cases is
Laplace transform. Here we find the Laplace transforms
for the integrating function and for the step change in a variable.
The Laplace transform of an integrator is 1/s, and, not
coincidentally, for a step function of size k it is k/s. So we
substitute these into the equations, using d0 instead of k. Using
the symbol G without the (s) as the integrator gain, so that
output = G*integral(input*dt) , or in Laplace form, output =
(G/s)*input, we have
qo(s) = -(d0/s)*(G/s)/(1+G/s)

RM: I presume the above equation is the step change version of

qo(s) = -d(s)*G(s)/(1+G(s))

So now you continue the derivation:

MT: = -(d0/s)*G/(G+s)

                = -d0*G/(s*(G+s))

RM: Well, I’m not sure I follow the algebra but I think we still have d multiplied by the what is essentially the ratio G/(G+1) which converges to 1 as G gets large.So qo is not really a function, G, of d.

MT: Looking at the Wikipedia table of Laplace transforms, we find that
k/(s*(s+k)) is a simple form needing no further algebraic
manipulation. It is an “exponential approach”, (1-exp(-kt)). So we
now have
qo(t) = -d0*(1-exp(-Gt)) for t >0

RM: Houston we have a problem. Somehow we have gone from a closed-loop equation

(6) qo(s) = -d0G/(s(G+s))

where G – the organism function – factors out as the function relating d to qo, to an open-loop equation

(7) qo(t) = -d0*(1-exp(-Gt))

where G is the function that transforms d into qo.

Something seems wrong here because qo(t) is a function, via the organism function, G, of both d(t) and qo(t) itself.

Maybe this is where Richard Kenneway can help. I can’t judge whether the math that takes us from (6) to (7) is carried out correctly; hopefully Richard can. But it just seems like something is wrong here.

MT: If the step is at time t0 instead of at time 0, substitute t-t0 for
t in this equation, giving
qo(t) = -d0*(1 - e^(-G*(t-t0)))



which is the form of the equation for which you wanted the
derivation. In generating the equation, we assumed qo and d were
both zero before the step, but it is easy to add in the appropriate
constants if they had other values before the step.

RM: I’ll buy this only if the derivation of equation (7) is correct. I’ll wait for Richard Kennaway’s verdict on this. But just in terms of my understanding of how control works – having built many digital simulations of control – this equation is just not a description of my understanding of how control works. The system (organism) function, G, does convert changes in the disturbance into output; it converts changes in the the difference between perception and reference into output. And the difference between perception and reference depends on qo, d and r.

            RM: No, I'm claiming that the mathematical derivation is
faulty because it fails to take into account that fact
that qo is changing during the step change in d.

MT: Actually, it isn't. Integrators don't change their output
instantaneously. For any integrator with any input other than a
Kronecker delta function, qo(t+epsilon) - qo(t-epsilon) approaches
zero as epsilon approaches zero. When the input is a step function,
qo starts to change only when the effect of the step reaches the
integrator’s input terminal, but it hasn’t changed at all before
that. In the analysis, we postulate that the effect of the step is
infinitely quickly delivered, but nevertheless, the integrator is
integrating the effect of the step, and until the step happens,
there is no effect to integrate.

RM: But you are looking only at the situation where qo is constant at the time that the disturbance step hits. That will tell you something about the transport lag in the loop but during a control task qo is changing constantly while d is changing. So the effect of a change in d during time instant dt (the effect of the derivative of d) depends on how qo happens to be changing during the same dt. So during any time instant, dt, the change in qi (the input) depends on the change inboth d and qo during that instant. In terms of derivatives:

d(qi)/dt = d(d/dt)+d(qo)/dt

The output in response to d(qi)/dt is delayed due to transport lag but it is still a response to the sum of disturbance and output, not just disturbance.

            RM: You have tried to
show it’s nonsense by assuming that qo is constant
during a step change in d. Yours is really an open loop
analysis of closed-loop control.

MT: I hope that the explicit derivation above shows the nonsense
coefficient of this statement.

RM: Actually, it didn’t. But let’s wait to see if Richard K. will chime in on this to see if my intuitions are correct about the mathematical derivation (that it’s wrong) or not.

Best

Rick

···

–
Richard S. Marken PhD
rsmarken@gmail.com
www.mindreadings.com

MartinT · February 22, 2013, 4:29pm

[Martin Taylor 2013.02.22.10.36]

True. Why mention it? It isn't the circuit being analyzed. The only

reason I can see for you to mention it is that you are controlling
so strongly for it not to be true that information theory hs
something to offer in the analysis of control systems, and you think
that by bringing a red herring you can derail the analysis. You
can’t. All this change of circuit diagram does is add a multiplier
H(s) into the mix. It’s irrelevant.
More nonsense. It doesn’t say that at all. What it says is that
after a step function of the disturbance, the output takes an
exponential approach to its compensating value. If you don’t believe
that happens, simulate it and see. Of course, you can’t exactly
simulate a zero-lag loop, but you can come pretty close, and the
result won’t be much different.
No. Only that there are two equations dealing with different things.
Firstly, in your static equation (5), if the circuit has a pure
integrator output function, G is infinite and the correct equation
is
(5a) qo = -d
Secondly, if you take equation 4 to the “final” value reached after
infinite time, it becomes … wait for it …
(4a) qo(inf) = -d0
So no, I don’t see the difference.
No it doesn’t. Three mistakes in one sentence – a new record, I
think: (1) equation 4 does not say anything that could be translated
as qo = G(d). qo is a function of time, with G as one of its
parameters (arguments). (2) the “effect” is not of d (a function of
time) but of d0 (a fixed value). (3), the effect does not decay over
time. The difference between d0 and qo decays over time, which, if
you want to put it in terms of the effect of d0 on G, would have to
be verbalized as showing that the effect of d0 on qo increases over
time. So far, your comments continue the “nonsense” theme of this
interchange.
Half right, but you had no need to “presume”. You could have got it
right if you had checked the Wikipedia page I pointed you to. The
half you got right is that d0/s does represent a step change of size
d0 in d. The half you didn’t is that G/s represents an integrator
with an integration rate of G per second, not anything to do with a
step. [Aside: the form is the same because the time function of the
integration of a Dirac delta function is actually a step.]
Funny, I would have thought it easy enough to follow putting the s
divisor on the bottom where it belongs was pretty trivial algebra.
Failure to follow that trivial step doesn’t give one much confidence
in what comments might follow.
??? I can’t make any sense of this, which I guess makes it nonsense
to me. Saying that the integration rate G gets large is just saying
everything happens faster. In the limit of infinite integration rate
(infinite G) the quoted equation would become qo(s) = -d0/s, which
would mean qo(t) was a step of size -d0, exactly as you would
expect. But that doesn’t seem to be what you are saying. The form of
the qo wave is indeed a function of G, and its size is proportional
to d0. But that also seems to contradict what you might be saying. I
remain ignorant of any sense that you might have intended.
The two equations are THE SAME, except that one is written in “s”,
the other in “t”. It’s the same kind of interchange as when you
write the same waveform as a Fourier transform written in “f” or a
time function written in “t”. They represent exactly the same thing
(for t>0 in the Laplace case).
Yes, The closed loop IS how the equation was derived. So what’s the
problem?
Or you could look it up in Wikipedia, as I suggested you do if you
aren’t comfortable with my assertion that it is correct. Or you
could do the simulation and see if the result is actually a correct
description of what happens. Either way is better than relying on
Papal authority to tell you what to believe.
Yep. Opinion trumps mathematics. That is normal among the religious
fundamentalists. It should not be normal here.
Do, please, go to Wikipedia or any other source for tables of
Laplace transforms, and work it out for yourself. Don’t rely on
others to convince you. Do the math or do the simulation. Either
would be better than simple opinion.
Yep. That’s the basis for the derivation you don’t like.
No. I’m talking about the situation where qo has a determinate value
at the time the disturbance step hits.
Yes, but that is irrelevant. It doesn’t matter where qo came from or
would have been going. In a linear system, the effect of the step is
simply added to all of that. That’s why you can use the step
analysis to approximate as closely as you want to the effects of a
continuously changing disturbance, by treating the continuous
variation as a series of infinitesimal steps whose effects over
infinite future time are all added together.
Not in a linear system. In a non-linear system, perhaps that would
be true. But the analysis is for a specific linear system, as I
emphasized many times.
Yep. That is the only way you can get the equations I used. Go back
up to my equations that you quoted up top, and trace them around the
loop. Go on. Do it.
You usually rely on the authority of Bill Powers to determine the
scientific truth of anything, rather than doing your own analysis or
simulation. Why this time is it Richard who has to be the Pope? Just
do the simulation if you don’t trust the maths. Convince yourself,
rather than conforming to authority. Martin

···

On 2013/02/22 12:12 AM, Richard Marken
wrote:

[From Rick Marken (2013.02.21.2110)]
                Martin Taylor
(2013.02.21.10.28)–
                      MT: I'm
not going to refute what you wrote (quoted
below), since you can refute it yourself by
working through the derivation of the
equation. There’s no need for me to do it
again.

                    RM: Actually I don't believe I saw what I
think of as a derivation of the equation in
question. Could you show me the derivation
please. Just the equations would be fine. Start
off with the basic closed loop system equations
and the show how you get to:

qo(t) = - d0*(1-e^(-G*(t-t0)))

              MT: You are quite right. I didn't go through the
derivation that produces the equation when I first
wrote it. So here it is.
              ...



              I'll go through the usual loop analysis, but finding
qo rather than p in terms of d, assuming r is always
zero. All the links in the loop are assumed to be
unity multipliers with no transport lag, and so are
all the functions in the loop except the output
function, which has gain G.
              qo = G*e

                   = G*(r-p)

                   = -G*p

                   = -G*qi

                   = -G*(qo+d)

              qo*(1+G) = -G*d

              qo = -d*G/(1+G)



              That's the standard analysis. The Laplace analysis of
the waveforms is the same except for putting (s) after
each variable, like this:
              qo(s) = G(s)*e(s)

                       = G(s)*(r(s) -p(s))

                       = -G(s)*p(s)

                       = -G(s)*qi(s)

                       = -G(s)*(qo(s)+d(s))

              qo(s)*(1+G(s)) = -G(s)*d(s)

              qo(s) = -d(s)*G(s)/(1+G(s))



              This is a completely general solution for all control
loops with the defined properties. The Laplace
transform of a waveform describes the waveform in much
the same way as does a Fourier transform, but only
applies after time zero.

      RM: Yes, these look fine.  But it's important to note that you
are assuming that the feedback function relating qo to qi is
1.0. If we showed the feedback function (call it H) then the
equation relating qi to d (using the standard analysis) would
look like this:
      (1)    qo    = -G*(H(qo)+d)

      Anyway, all of the above is not a derivation of the equation
you gave which was:
      (4)  qo(t) = - d0*(1-e^(-G*(t-t0)))



      This equation says purports to show how the organism function,
G, relates qo to d. It’s saying the qo is an (exponential)
function of d.

      This is quite different
than function you derived using the “standard” (closed loop)
analysis:
      (5)   qo = -d*G/(1+G)



      See the difference.

      Equation (4) says that qo
is a direct function, G, of d and that the effect of d on qo
decays over time; the larger the value of G the more rapid the
decay.

      Equation 5 does not say
that qo is a direct function, G, of d; as G increases, G falls
out of the picture. So let’s see how you came up with equation
(4).
        MT: Next we have to specify the Gain function (my analysis
was for a pure integrator) and the disturbance waveform (my
analysis was for a step function of size d0).
        A convenient place to find Laplace transforms for common
cases is Wikipedia>Laplace transform. Here we find the
Laplace transforms for the integrating function and for the
step change in a variable. The Laplace transform of an
integrator is 1/s, and, not coincidentally, for a step
function of size k it is k/s. So we substitute these into
the equations, using d0 instead of k. Using the symbol G
without the (s) as the integrator gain, so that
        output = G*integral(input*dt) , or in Laplace form, output =
(G/s)*input, we have
        qo(s) = -(d0/s)*(G/s)/(1+G/s)
        RM:  I presume the above equation is the step change version
of
            qo(s) = -d(s)*G(s)/(1+G(s))

        So now you continue the derivation:

MT: = -(d0/s)*G/(G+s)

                        = -d0*G/(s*(G+s))

        RM: Well, I'm not sure I follow the algebra

        but I think we still have d multiplied by the what is
essentially the ratio G/(G+1) which converges to 1 as G gets
large.So qo is not really a function, G, of d.

        MT: Looking at the Wikipedia table of Laplace transforms, we
find that k/(s*(s+k)) is a simple form needing no further
algebraic manipulation. It is an “exponential approach”,
(1-exp(-kt)). So we now have
        qo(t) = -d0*(1-exp(-Gt)) for t >0
        RM: Houston we have a problem. Somehow we have gone from a
closed-loop equation
        (6) qo(s) = -d0*G/(s*(G+s)) 



        where G -- the organism function --  factors out as the
function relating d to qo, to an open-loop equation
        (7) qo(t) = -d0*(1-exp(-Gt))

        where G is the function that transforms d into qo.



        Something seems wrong here because qo(t) is a function, via

the organism function, G, of both d(t) and qo(t) itself.

        Maybe this is where Richard Kenneway can help. I can't judge
whether the math that takes us from (6) to (7) is carried
out correctly; hopefully Richard can.

But it just seems like something is wrong here.

        MT: If the step is at time t0 instead of at time 0,
substitute t-t0 for t in this equation, giving
        qo(t) = -d0*(1 - e^(-G*(t-t0)))



        which is the form of the equation for which you wanted the
derivation. In generating the equation, we assumed qo and d
were both zero before the step, but it is easy to add in the
appropriate constants if they had other values before the
step.
        RM: I'll buy this only if the derivation of equation (7) is
correct.

        I'll wait for Richard Kennaway's verdict on this. But
just in terms of my understanding of how control works –
having built many digital simulations of control – this
equation is just not a description of my understanding of
how control works. The system (organism) function, G, does
convert changes in the disturbance into output; it converts
changes in the the difference between perception and
reference into output. And the difference between perception
and reference depends on qo, d and r.

                  RM: No, I'm
claiming that the mathematical derivation is
faulty because it fails to take into account that
fact that qo is changing during the step change
in d.

        MT: Actually, it isn't. Integrators don't change their
output instantaneously. For any integrator with any input
other than a Kronecker delta function, qo(t+epsilon) -
qo(t-epsilon) approaches zero as epsilon approaches zero.
When the input is a step function, qo starts to change only
when the effect of the step reaches the integrator’s input
terminal, but it hasn’t changed at all before that. In the
analysis, we postulate that the effect of the step is
infinitely quickly delivered, but nevertheless, the
integrator is integrating the effect of the step, and until
the step happens, there is no effect to integrate.
        RM: But you are looking only at the situation where qo is
constant at the time that the disturbance step hits.

        That will tell you something about the transport lag in
the loop but during a control task qo is changing constantly
while d is changing.

        So the effect of a change in d during time instant dt
(the effect of the derivative of d) depends on how qo
happens to be changing during the same dt.

        So during any time instant, dt, the change in qi (the
input) depends on the change inboth d and qo during that
instant. In terms of derivatives:
        d(qi)/dt = d(d/dt)+d(qo)/dt



        The output in response to d(qi)/dt is delayed due to
transport lag but it is still a response to the sum of
disturbance and output, not just disturbance.

                    RM: You
have tried to show it’s nonsense by assuming
that qo is constant during a step change in d.
Yours is really an open loop analysis of
closed-loop control.

        MT: I hope that the explicit derivation above shows the
nonsense coefficient of this statement.
        RM: Actually, it didn't. But let's wait to see if Richard K.
will chime in on this to see if my intuitions are correct
about the mathematical derivation (that it’s wrong) or not.

MartinT · February 22, 2013, 7:12pm

[Martin Taylor 2013.02.22.13.56]

[From Rick Marken (2013.02.21.2110)]

(7) qo(t) = -d0*(1-exp(-Gt))

where G is the function that transforms d into qo.

Something seems wrong here because qo(t) is a function, via the organism function, G, of both d(t) and qo(t) itself.

Since the whole problem seems to be that the mathematical result contradicts your intuition, maybe it can be resolved more easily than by going through long derivations.

Consider a small number of propositions. If you agree with them, we have a bigger problem.

1. A control loop is a physical system, which means that it is not affected by anything that happens in the future, but is deterministically affected (beyond quantum uncertainty) by what has happened in the past.

2. The idealized conceptual control systems we analyze are intended to be close replicas of real physical control systems that either exist or could be built.

3. The idealized control systems consist of pathways and functions connected in a single loop.

4. The only external influences on what happens in the loop are two functions of time: a disturbance variable symbolized by "d" or "d(t)" and a reference variable symbolized by "r" or "r(t)". For the cases we have been discussing, r does not vary, and is taken to be uniformly zero.

OK so far?

I don't use (2) and (3) in what follows, but I thought I would include them anyway.

5. Because of (4), the only external influence on changes in any of the variables in the loop is d(t), and because of (1), the only part of d(t) that has an effect is the part referring to the past. If the current moment is t0, the effects depend on some or all of d(t) for t<t0.

From 5, the value of any variable in the loop at t0 is a function only of the properties of the components of the loop and of d(t) for t<t0.

That's it, really.

Any equation of the form "x = ..." where x is the value of some variable in the loop at time t0, must be completed by "x = f(d(t), loop properties), where t < t0."

qo(t) = - d0*(1-e^(-G*(t-t0))) is such an equation.

Martin

rsmarken · February 23, 2013, 2:56am

[From Rick Marken (2013.02.22.1900)]

Martin Taylor (2013.02.22.10.36)–

True. Why mention it? It isn't the circuit being analyzed. The only
reason I can see for you to mention it is that you are controlling
so strongly for it not to be true that information theory hs
something to offer in the analysis of control systems, and you think
that by bringing a red herring you can derail the analysis. You
can’t. All this change of circuit diagram does is add a multiplier
H(s) into the mix. It’s irrelevant.

RM: You are absolutely right. I have been controlling so strongly against information theory that it has led to me making arguments that are, if not red herrings, at least very pink ones. I was pushing back against information theory by trying to imagine that there was an error in the math. It was easy to do because I don’t know Laplace transforms; that’s why I appealed to Richard Kennaway. He hasn’t joined in, probably because he knew the math was right and I was making a fool of myself without his help. Based on this post of yours I think I now understand Laplace transforms well enough to see that the math is right and the equation that I found questionable:

qo(t) = - d0*(1-e^(-G*(t-t0)))

is perfectly correct. This is the impulse response of a control system to a step change in the value of the disturbance.

I accept the fact that all of your math is correct. What I still don’t accept is that information theory has anything to contribute to an understanding of control. Indeed, I think that looking at control in terms of information theory is as misleading as looking at it in terms of the causal (input-output) model. Like the causal model, information theory is applicable to open-loop systems: communication systems.

It is misleading to look at closed-loop control in terms of information theory for the same reason it is misleading to look at it in terms of the causal model. The behavior of a control system can look like that of an open loop system because of the disturbance-resistive nature of control (this is the “behavioral illusion”). The outputs of a control system seem to occur in response to variations in independent variables – disturbances. Information theory and the causal model have led researchers to focus only on independent-output variable relationships, ignoring (or not being aware of) the fact that these relationships exist only when the independent variables are disturbances to controlled variables. Controlled variables are the central feature of control but they do not exist in open-loop models like the general linear model (the basis of conventional experimental research in psychology) or information theory.

I am controlling strongly for showing that information theory and the causal model have nothing to contribute to an understanding of control because behavioral researchers who cling to these theories will not do the kind of research that is needed to understand control. I have watched this happen over the 30+ years I’ve been involved in control theory: people who get all excited about the control theory model of behavior but continue to study behavior in the context of open-loop models and, therefore, fail to advance the science of living control systems.

I suppose you can keep trying to demonstrate the value of applying information theory to an undertsanding of living control systems. But I think you have as much chance of convincing me of its merits as conventional psychologists have of convincing me of the merits of understanding behavior by studying it using research methods based on the causal model. Unless the research is aimed at understanding what perceptual variables organisms control, how they control them and why, I’m just not interested anymore.

I wish I could get you to give up on information theory and just stick to PCT. But I know I can’t so I won’t try. If you want to show that there is information about the disturbance in perception, that’s fine with me. But I’ll keep trying to show that this is just not the case. The idea that there is information about the disturbance in perception is a problem for me, not just because it’s not true, but because it is misleading; it directs attention away from the study of controlled variables to the study of the relationship between independent variables (disturbances) and outputs; the kind of research that is the dead end street that the behavioral sciences have been traveling down for the last 200 + years.

Best regards

Rick

···

–
Richard S. Marken PhD
rsmarken@gmail.com
www.mindreadings.com

MartinT · February 23, 2013, 4:15am

[Martin Taylor 2013.02.22.23.00]

[From Rick Marken (2013.02.22.1900)]

What I still don't accept is that information theory has anything to contribute to an understanding of control.

That is about as sensible as saying that Fourier analysis has nothing to contrbute to an understanding og control.

Throughout all these interchanges over the years, you have shown that you simply do not understand what information theory is about, how it can be used, or what it implies and does not imply. I have tried to give you the opportunity to learn, but you don't take advantage of those opportunities.

Indeed, I think that looking at control in terms of information theory is as misleading as looking at it in terms of the causal (input-output) model. Like the causal model, information theory is applicable to open-loop systems: communication systems.

You see? Information theory is indeed applicable to open-loop communication systems, as it is to a wide world of other situations that you exclude by implication. You say nonsensical things because you refuse to learn. Well, that's your privilege, but I wish you would stop trying to persuade other people of the truth of your fundamentalist religion.

It is possible to learn to use new tools, even at your age -- and a tool is all information theory could ever be. In my present view, its best use in control analysis is to determine limits on control performance under different conditions. But one never knows when a good use for a tool will be found that we never thought of -- just as Fourier analysis, developed to help in understanding heat conduction, has major uses in wildly disparate areas of electronics, astronomy, crystallography.... You may not want to use a particular tool because it doesn't help you solve your particular problem, but I think it is wrong to try to persuade other people not to use it when it suits the problems they have.

Until you learn something about information theory (possibly by actually _reading_ my unfinished tutorial [Martin Taylor 2013.01.01.12.40]) I believe you have no academic right to comment on it or its potential value for PCT.

Martin

···

On 2013/02/22 9:56 PM, Richard Marken wrote:

amatic · February 23, 2013, 3:54pm

[From Adam Matic 2013.02.23.1640 cet]

···

I have a question. If the input function in the control loop is an input-output system could we use information theory to analyse how good the perceptual signal in the loop corresponds to a ‘real’ variable (a more accurate measurement ); or how control depends on the ‘quality’ of the input function?

Lets say we have a person with poor eyesight. He draws a straight line with glasses on and another with glasses off. I assume that, on average, if we repeat the measurements, lines drawn with glasses on would be more straight. To investigate this, we could make experiments with subjects and make computer models that reproduce this behavior.

Rick and Martin, is this problem in the ballpark of what you consider ‘information theory in PCT’ or am I missing something?

Adam

MartinT · February 23, 2013, 5:46pm

[Martin Taylor 2013.02.23.10.58]

[From Adam Matic 2013.02.23.1640 cet]

Yes.

I have been learning "Processing" (for which I thank you) by

creating just such an experiment (not a person with and without
glasses, though). It’s a simple tracking experiment with the target
and cursor separated by various distances. The basic form of it runs
right now, but I have to learn about files for data collection and
various other stuff before I let it out into the wild.

To make the answer to your question concrete, let's return to the

“Alice’s weighing measurement” example I used earlier [Martin Taylor
2012.12.08.11.32]. Here’s the diagram:

![AliceScales.jpg|801x849](upload://qv2O4k7njJpPr3KpcCPH62YesFz.jpeg)

Here is how I described in  [Martin Taylor 2012.12.08.11.32] what

Alice is doing:

  Alice wants to know how heavy is a stone she
has picked up. She has a balance scale and a set of weights
weighing … 2, 1, 1/2, 1/4 … kilos. She puts the stone on one
pan, and the scale tilts down to the side on which she put the
stone, so she puts a weight on the other pan. The tilt stays the
same, so she adds another weight and the scale tilts the other
way. Alice removes the last weight she added and adds the next
lighter one. She keeps adding and removing weights until the scale
stays level or she runs out of ever smaller weights…
Of course, there need be no human Alice in
this story. The perceptual function signals only the direction in
which the scales tilt, so the error is only a binary value, which
could be called “1” or “0”, “left” or “right”, “too heavy” or “too
light”, or any other contrasting labels. I will call the values
“left” and “right” according to which side of the scale is
heavier. Likewise, there is no need for Alice to provide the
output function. It could be a mechanical device that is provided
with the weights that have values in powers of two times 1 kilo,
with 2 kilos the heaviest. We can assume that the scale would
break if the stone was over 4 kilos!
  The output device would load and unload these weights onto and off
the scale pan according to the following algorithm. The stone is
on the left pan.
  1. Add to the  right-hand pan the heaviest weight not yet tried
(initially, since none have been tried, that means the 2 kilo
weight).
  2. If all the available weights have been tried, stop. Else...

  3a. If the error is "left" go to step 1 (there is not enough
weight in the right hand pan)
  3b, if the error is "right", remove the lightest weight in the pan
and add the heaviest weight not yet tried.
  At the end of this process, the balance is as close as the machine
(or Alice) can make it using the available weights. Anyone who
wanted to know the weight of the stone could simply read out the
weights in the pan as a binary number of kilos, with the units
starting at the 2 kilo weight. Those weights are the current
output value of the control system., which, without Alice, is a
perfectly standard control loop.

To address your question, let's substitute a digital indicator for

the tilt indicator needle. The digital indicator shows a light under
the heavier pan, or a central light if they are “about the same”
within its precision of measurement. Let’s suppose that the
heaviest object that can be accommodated is 4kg, and the lightest
weight is 1g. To simplify the maths, let’s do what is often done in
measuring file sizes, and say 1kg is 1024g, not 1000g.

The initial uncertainty about the weight of the stone is 12 bits

more than it would be if we knew it to the nearest gram. If we
follow the algorithm above, after 12 moves adding and removing
weights from the pans (in other words, adjusting the control
system’s output value), the stone will be balanced as closely as we
are able. We have 12 bits of control. If we had weights as small as
1mg (1/1024 g), we could control 10 bits better.

The preceding paragraph deals with the information available in the

precision of the effect of the output on the CEV (complex
environmental variable) that corresponds to the perceptual signal
(the indicator reading). If we have 12 bits of precision available
in the output, we can control to within 1g; if we have 22 bits, we
can control to within 1 mg.

Now let's consider your question, which asks about the perceptual

side. The perceptual signal is one of “too heavy on the left”, “too
heavy on the right”, or “about the right weight” (if you want a
numerical value set, it could be +1, 0, or -1). Let’s suppose that
the output side has 22-bit precision (can produce outputs graded to
the milligram).

How precise can control be if the perceptual input side indicates

“about the right weight” (or 0, numerically) whenever the two
balance pans are within 1g. The reference value for the perception
is “about the right weight” (or 0). If the actual weight of the
stone is, say, 42.373 g, the perceptual signal will match its
reference and produce zero error when the other scale pan has 42g on
it. By that time, the control process has achieved 12 bits of
control. Not coincidentally, the perceptual input system as
described has 12-bit capability, since it can accept weights as high
as 4kg and can tell when the two sides of the balance differ by more
than 1g.

It is true that in this artificial example, there is a trick the

system could use to get a better measure of the actual weight, by
adding and subtracting weights until adding (or subtracting) 1mg
makes one of the side-lights go on, but that isn’t the point. The
point is that when the stone and the weight pan are within 1g, the
error signal is zero and the output will not add and subtract these
milligram-scale weights. If the perceptual signal showed “about
equal” only when the two pans were within 1mg, the output signal
would keep changing until the pans were equal within 1mg, allowing
22 bits of control.

What if the perceptual input were accurate within 1 microgram? That

would allow for 32 bits of control, but since the output function
can achieve only 22 bits, we would still have only 22 bits of
control capability.

To complete this example, let's assume that the digital indicator

now provides a numerical value of the difference in weight between
the pans, accurate to within 1 microgram, but that the comparator
has a tolerance zone of 1g. If the difference between reference and
perception is less than 1g, the comparator produces no output. In
this case also, the control system will be unable to match the
weights better than 1g, and can achieve only 12 bits of control. In
this case it isn’t a question of ability to perceive or to act
accurately, but a question of “don’t care to make it more accurate
– 1g is close enough”, as might be true for a cook measuring flour
for a recipe.

The point of adding this last paragraph is to make clear that the

degree of control is limited by the point in the loop that has the
lowest precision. It doesn’t matter whether the restriction is in
the perceptual system, the environmental feedback path, or anywhere
else in the loop.

In the more general case of a continuous system, you can substitute

standard deviations of a distribution (or other suitable range
measure) for the weights. If you can perceive the CEV a factor of
1024 more precisely than its overall range of variation, and you
have very precise output, you could get 10 bits of control. In real
life, however, it is often the output or the “don’t care” comparator
that limits the degree of control.

As a comparison standard, we often hear "98% of the variance"

mentioned as being the kind of precision achievable in tracking
studies. This means that the residual standard deviation is about 1%
of the uncontrolled variation in the CEV, which represents about 7
bits of control. I wonder whether it would be possible to tease out
where in the loop the limit lies?

Is this the kind of answer you were looking for?

Martin

···

I have a question. If the input function in the control
loop is an input-output system could we use information
theory to analyse how good the perceptual signal in the
loop corresponds to a ‘real’ variable (a more accurate
measurement ); or how control depends on the ‘quality’ of
the input function?

          Lets say we have a person with poor eyesight. He draws

a straight line with glasses on and another with glasses
off. I assume that, on average, if we repeat the
measurements, lines drawn with glasses on would be more
straight. To investigate this, we could make experiments
with subjects and make computer models that reproduce this
behavior.

rsmarken · February 23, 2013, 6:36pm

[Rick Marken (2013.02.23.1040)]

Martin Taylor (2013.02.22.23.00)–

RM: What I still don’t accept is that information theory has anything to contribute to an understanding of control.

MT: That is about as sensible as saying that Fourier analysis has nothing to contrbute to an understanding og control.

RM: Tools like Fourier analysis, multiple regression and information theory can, indeed, contribute to our understanding when used to analyze the problems to which they are appropriate. But tools can be Procrustean beds that mangle our understanding of the phenomena to which they are inappropriately applied. This is certainly what has happened when the tool of multiple regression (the general linear model, GLM) has been used as a basis for studying behavior (control). A possible example of inappropriate use of Fourier analysis is as a model of visual pattern perception.

So it’s not so much that I don’t accept that information theory has anything to contribute to an understanding of control; what I think is that using information theory (like using the GLM) as the basis for understanding behavior (control) actually interferes with a correct understanding of behavior.

RM: Indeed, I think that looking at control in terms of information theory is as misleading as looking at it in terms of the causal (input-output) model. Like the causal model, information theory is applicable to open-loop systems: communication systems.

MT: You see? Information theory is indeed applicable to open-loop communication systems, as it is to a wide world of other situations that you exclude by implication.

RM: Not by implication but explicitly; I argue that information theory is not applicable to the study of closed loop systems because it is an open-loop model; therefore everything Powers said in his 1978 Psych Review article about the inapplicability of the causal model (the GLM being a special case) to the study of closed-loop systems applies, as well, to information theory, which is just a version of the causal model.

MT: You say nonsensical things because you refuse to learn. Well, that’s your privilege, but I wish you would stop trying to persuade other people of the truth of your fundamentalist religion.

RM: Other people will have to decide for themselves whether or not what I say is nonsensical. I’m sure most agree with you; yours is certainly the establishment point of view. So don’t worry; my continued attempts to persuade will probably be as effective as they always have been: not very;-)

MT: Until you learn something about information theory (possibly by actually reading my unfinished tutorial [Martin Taylor 2013.01.01.12.40]) I believe you have no academic right to comment on it or its potential value for PCT.

RM: I don’t think you will agree that I understand information theory until I agree with you that there is information about disturbances in perception. So I am afraid I will never understand information theory because all the complex mathematics (no pun intended) in the world will not convince me that there is information about variations in d(t) in the sum d(t) + qo(t). All that a control system knows is variations in p(t) which is proportional to d(t) + qo(t). The outputs of a control system are based on the difference between this sum and the reference signal; not on the difference between d(t) and the reference signal. The result, when system gain is high, is that qo(t) ~ - d(t) ; output is nearly a perfect mirror of d(t). But this is not because the system knows anything about d(t). The system cannot possibly know anything about d(t) while it is controlling p(t), the sum d(t)+qo(t).

Your impulse response equation

qo(t) = - d0*(1-e^(-G*(t-t0)))

is simply the open loop time varying response of qo(t) to an impulse disturbance, d0. It just shows how responsive the system with gain G is to an impulse disturbance. It’s what you would see if the impulse we applied to the input to a control system that had its output disconnected from it’s input so that d0 is equivalent to the effect of an impulse on the input with qo0 = 0, so p0 = d0. But in a closed loop system the instantaneous effect of a disturbance, d0, is always combined with the instantaneous effect of output, qo0 so the the impulse response at t0 is always a response to d0+qo0; it’s not a response to just d0.

In a closed loop system there is no information in the perception about the disturbance; imagining that there is is equivalent to the mistake of assuming that responses (qo(t)) are cause by stimuli (d(t)) as per the causal model. And down that path lies the behavioral illusion, the illusion that PCT reveals.

Best

Rick

···

–
Richard S. Marken PhD
rsmarken@gmail.com
www.mindreadings.com

MartinT · February 23, 2013, 7:19pm

[Martin Taylor 2013.02.23.13.48]

[Rick Marken (2013.02.23.1040)]
      Martin
Taylor (2013.02.22.23.00)–
        RM: What I
still don’t accept is that information theory has anything
to contribute to an understanding of control.
      MT: That is about as sensible as saying that Fourier analysis
has nothing to contrbute to an understanding og control.
      RM: Tools like Fourier analysis, multiple regression and
information theory can, indeed, contribute to our
understanding when used to analyze the problems to which they
are appropriate. But tools can be Procrustean beds that mangle
our understanding of the phenomena to which they are
inappropriately applied. This is certainly what has happened
when the tool of multiple regression (the general linear
model, GLM) has been used as a basis for studying behavior
(control). A possible example of inappropriate use of Fourier
analysis is as a model of visual pattern perception.

Oh, I do so agree with you (no sarcasm intended -- I really do). One

of the real problems with Information Theory has been that in the
1950s and possibly 1960s it was seen as the great panacea that could
explain everything. Because it wasn’t, there was a great backlash,
and the fashion became to say that information theory was good for
nothing. The whole problem was exactly what you say here: “tools can
be Procrustean beds that mangle our understanding of the phenomena
to which they are inappropriately applied.” Yes, yes, yes!!!

At this point in your message, we begin to diverge (big surprise).

      So it's not so much that I don't accept that information
theory has anything to contribute to an understanding of
control; what I think is that using information theory (like
using the GLM) as the basis for understanding behavior
(control) actually interferes with a correct understanding of
behavior.

Not much disagreement here, except that I don't agree with the

implication that I am using information theory as the basis for
understanding behaviour. I am using it as one tool for analyzing the
actual and possible performance of actual and possible control
systems given whatever constraints are involved.

        RM: Indeed, I think that looking at control in terms of
information theory is as misleading as looking at it in
terms of the causal (input-output) model. Like the causal
model, information theory is applicable to open-loop
systems: communication systems.

      MT: You see? Information theory is indeed applicable to
open-loop communication systems, as it is to a wide world of
other situations that you exclude by implication.
      RM: Not by implication but explicitly; I argue that
information theory is not applicable to the study of closed
loop systems because it is an open-loop model;

I really and truly don't see why you keep repeating this, when I

have so often pointed out that it isn’t. I wish you would read even
the introductory section of my tutorial, down to the point where
“information” is defined, to see why this statement is so very
wrong.

      MT: You say
nonsensical things because you refuse to learn. Well, that’s
your privilege, but I wish you would stop trying to persuade
other people of the truth of your fundamentalist religion.
      RM: Other people will have to decide for themselves whether or
not what I say is nonsensical. I’m sure most agree with you;
yours is certainly the establishment point of view.

I don't think it is, but who am I to say?

      MT: Until you learn something about information theory
(possibly by actually reading my unfinished tutorial [Martin
Taylor 2013.01.01.12.40]) I believe you have no academic right
to comment on it or its potential value for PCT.

  RM: I don't think you will agree that I understand information
theory until I agree with you that there is information about
disturbances in perception.

I think that's quite irrelevant to whether I will take your

disagreements seriously. I will accept that you understand
information theory when you show some evidence of being able to talk
sensibly about mutual information and related concepts. If then you
want to disagree with some technical aspect of an information
analysis, I’ll be happy to consider it seriously.

  So I am afraid I will never understand information
theory because all the complex mathematics (no pun intended) in
the world will not convince me that there is information about
variations in d(t) in the sum d(t) + qo(t).

And who has said there is? That's not the claim, as you really ought

to have understood by now, given the number of times I have tried to
correct you when you say it is.

So far, we haven't used mathematics that is either complex or

complicated. Nevertheless, if you don’t want to understand, you
won’t understand. Your choice.

Mathematics has nothing to do with the issue, anyway. Let's say it

once again, more simply than I said the same thing yesterday [Martin
Taylor 2013.02.22.13.56]
.

***        So long as the reference value for a control system is

constant, the only variable that can influence the values in the
control loop is the disturbance. Therefore ALL values of signals
in the loop must be functions of the HISTORY OF the disturbance
and the properties of the paths and functions in the loop.***

Can you see the difference between this and what you are opposing?

···

  The system cannot possibly know anything about d(t)
while it is controlling p(t), the sum d(t)+qo(t).

True.

  Your impulse response equation



   qo(t) = - d0*(1-e^(-G*(t-t0)))



  is simply the open loop time varying response of qo(t) to an

impulse disturbance, d0.

It is the response of a particularly simple control system to a step

disturbance. If you want to call a control system an open loop, feel
free, but I think to do so rather mangles the language.

  It just shows how responsive the system with gain G
is to an impulse disturbance. It’s what you would see if the
impulse we applied to the input to a control system that had its
output disconnected from it’s input so that d0 is equivalent to
the effect of an impulse on the input with qo0 = 0, so p0 = d0.

No you wouldn't. Do the math yourself.

What you would see for qo with an open loop for a step function

disturbance is a simple ramp with slope G*d0 extending up to
infinity as time goes to infinity. Even if the disturbance were an
impulse, you wouldn’t get an exponential approach to anything. You
have to have close the loop for that.

Why can't you accept that the equations are done using the

conventional control loop analysis, and work only if the loop is
closed? You only had to look at them to know it.

  But in a closed loop system the instantaneous effect
of a disturbance, d0, is always combined with the instantaneous
effect of output, qo0 so the the impulse response at t0 is always
a response to d0+qo0; it’s not a response to just d0.

True. That's what the equations say, is it not?

  In a closed loop system there is no information in the perception
about the disturbance; imagining that there is is equivalent to
the mistake of assuming that responses (qo(t)) are cause by
stimuli (d(t)) as per the causal model. And down that path lies
the behavioral illusion, the illusion that PCT reveals.

Just read the tutorial and see whether what you say here is at all

relevant. I find your comments increasingly desperate and, if I may
say so, increasingly far from the point. Reorganization must happen
sometime, and perhaps desperation is a precursor.

Martin

rsmarken · February 23, 2013, 9:30pm

[From Rick Marken (2013.02.23.1330)]

Martin Taylor (2013.02.23.13.48)–
      RM: So it's not so much that I don't accept that information
theory has anything to contribute to an understanding of
control; what I think is that using information theory (like
using the GLM) as the basis for understanding behavior
(control) actually interferes with a correct understanding of
behavior.

MT: Not much disagreement here, except that I don't agree with the
implication that I am using information theory as the basis for
understanding behaviour. I am using it as one tool for analyzing the
actual and possible performance of actual and possible control
systems given whatever constraints are involved.

RM: Oh, well then never mind;-)

      RM: Not by implication but explicitly; I argue that
information theory is not applicable to the study of closed
loop systems because it is an open-loop model;

MT: I really and truly don't see why you keep repeating this, when I
have so often pointed out that it isn’t. I wish you would read even
the introductory section of my tutorial, down to the point where
“information” is defined, to see why this statement is so very
wrong.

RM: That’s one long tutorial. Could you just tell me where you show that information theory is not an open loop model. I really would like to see that.

  RM: So I am afraid I will never understand information
theory because all the complex mathematics (no pun intended) in
the world will not convince me that there is information about
variations in d(t) in the sum d(t) + qo(t).

MT: And who has said there is? That's not the claim, as you really ought
to have understood by now, given the number of times I have tried to
correct you when you say it is.

RM: I thought that your whole point has been that there is information about the disturbance in perception; that that is how control works (by the system knowing, based on the information about the disturbance, what output to produce to counter it). If that’s not your point then, once again, never mind.

MT: Mathematics has nothing to do with the issue, anyway. Let's say it
once again, more simply than I said the same thing yesterday [Martin
Taylor 2013.02.22.13.56]
.

*** So long as the reference value for a control system is
constant, the only variable that can influence the values in the
control loop is the disturbance. Therefore ALL values of signals
in the loop must be functions of the HISTORY OF the disturbance
and the properties of the paths and functions in the loop.***
Can you see the difference between this and what you are opposing?

RM: As long as you add “and the output” after HISTORY OF the disturbance.

  RM: The system cannot possibly know anything about d(t)
while it is controlling p(t), the sum d(t)+qo(t).

MT: True.

RM: Cool!

RM: Your impulse response equation
   qo(t) = - d0*(1-e^(-G*(t-t0)))



  is simply the open loop time varying response of qo(t) to an
impulse disturbance, d0. It just shows how responsive the system with gain G
is to an impulse disturbance. It’s what you would see if the
impulse we applied to the input to a control system that had its
output disconnected from it’s input so that d0 is equivalent to
the effect of an impulse on the input with qo0 = 0, so p0 = d0.

MT: No you wouldn’t. Do the math yourself.

RM: You are absolutely right! It is not the impulse response of an open loop system; it is the response of a closed loop system to a step change in the disturbance.

MT: Why can't you accept that the equations are done using the
conventional control loop analysis, and work only if the loop is
closed? You only had to look at them to know it.

RM: Just that over-controlling thing. When you present equations with d on the right and qo on the left its a disturbance to my desire to see the emphasis on the importance of understanding control with r on the right and qi on the left. I'm impressed by your presentation of the mathematics of control theory and I'm sorry for suspecting errors and/or for misinterpreting the equations. And I have learned quite a bit from this interaction about the mathematics of control and I appreciate it.

But in my experience, knowing all the mathematics of control theory is no guarantee of understanding PCT (which is the application of control theory to understanding the behavior of living things); and, conversely, not knowing the mathematics is no guarantee of not understanding PCT. I think one does have to have a reasonably good quantitative understanding of control theory to understand PCT, which I believe one can get from doing the computer modeling. But knowing all the mathematics of control theory in detail does not guarantee an understanding of PCT, as has been demonstrated on this net (some early participants in CSGNet discussions were control engineers who did not get the idea that the behavior of living things involves the control of input perceptual variables) and in the works for various manual control psychologists, who treat the input to the human controller as the cause of the output, with the reference out in the environment.

I believe that an implicit view of behavior as an open loop causal process has led even people who understand control theory to apply it incorrectly to the behavior of living organisms. You can tell that control theorists are applying control theory in a way that is quite different than PCT when their research emphasizes the determination of S-R (d-qo) transfer functions rather than the discovery of controlled variables.

Since information theory is an open-loop model of behavior (that’s the way I learned it anyway, but you may be able to convince me that it’s not) it looks to me like it’s just another way of diverting the attention of potential PCT researchers away from understanding behavior as organized around the control of input perceptual variables. But you say you are not interested in using information theory to understand behavior so maybe it’s not an issue for you. But I do think it’s an issue for other control theorists who are interested in understanding behavior because an open-loop “frame of mind” just gets in the way of their understanding PCT – that behavior is the control of perception.

Best

Rick

···

–
Richard S. Marken PhD
rsmarken@gmail.com
www.mindreadings.com