## ···

[Martin Taylor

2013.01.01.12.40]

```
As you can see by the date stamp, I started this on New Year's
```

Day, expecting it to be quite short and to be finished that day.

I was wrong on both counts.

```
This message is a follow-up to [Martin Taylor 2012.12.08.11.32]
```

and the precursor of what I hope will be messages more suited to

Richard Kennaway’s wish for a mathematical discussion of

information in control. It is intended as part 1 of 2. Part 1

deals with the concepts and algebra of uncertainty, information,

and mutual information. Part 2 applies the concepts to control.

```
Since writing [Martin Taylor 2012.12.08.11.32], which showed the
```

close link between control and measurement, and followed the

information “flow” around the loop for a particular control

system, I have been trying to write a tutorial message that

would offer a mathematically accurate description of the general

case while being intuitively clear about what is going on. Each

of four rewrites from scratch has become long, full of

equations, and hardly intuitive to anyone not well versed in

information theory. So now I am trying to separate the

objectives. This message attempts to make information-theoretic

concepts more intuitive with few equations beyond the basic ones

that provide definitions. It is nevertheless very long, but I

have avoided most equations in an attempt to make the principles

intelligible. I hope it can be easily followed, and that it is

worth reading carefully.

```
My hope is that introducing these concepts with no reference to
```

control will make it easier to follow the arguments in later

messages that will apply the concepts and the few equations to

control.

```
This tutorial message has several sections, as follows:
Basic background
concepts
definitions
uncertainty
comments on "uncertainty"
information
Information as change in uncertainty
mutual information
Time and rate
Uncertainty
Information
Information and history
Some worked out example cases
1. A sequence like "ababababa..."
2. A sequence like "abacacabacababacab...."
3. A sequence in which neighbour is likely to follow
```

neighbour

4. A perfect adder to which the successive input values are

independend

5. A perfect adder to which the successive input values are

not independent

6. Same as (5) except that there is a gap in the history –

the most recent N input samples are unknown.

```
[Because I don't know whether subscripts or Greek will come out
```

in everyone’s mail reader, I use “_i” for subscript “i”, and

Sum_i for “Sum over i”. "p(v)’ means “the probability of v being

the case”, so p(x_i) means "the probability X is in state i"and

p_x_i(v) means “the probability that v is the case, given that x

is in state i”, which can also be written p(v|x_i).]

```
------------Basic equations and concepts: definitions-------
1. "Probability" and "Uncertainty"
In information theory analysis, "uncertainty" has a precise
```

meaning, which overlaps the everyday meaning in much the same

way as the meaning of “perception” in PCT overlaps the everyday

meaning of that word. In everyday language, one is uncertain

about something if one has some idea about its possible states

– for example the weather might be sunny, cloudy, rainy, snowy,

hot, … – but does not know in which of these states the

weather is, was, or will be now or at some other time of

interest. One may be uncertain about which Pope Gregory

instituted the Gregorian calendar, while feeling it more likely

that it was Gregory VII or VIII than that it was Gregory I or

II. One may be uncertain whether tomorrow’s Dow will be up or

down, while believing strongly that one of these is more

probable than that the market will be closed because of some

disaster.

```
If one knows nothing about the possibilities for the thing in
```

question (for example, “how high is the gelmorphyry of the

seventh frachillistator at the winter solstice?”) one seldom

thinks of oneself as “uncertain” about it. More probably, one

thinks of oneself as “ignorant” about it. However, “ignorance”

can be considered as one extreme of uncertainty, with

“certainty” as the other extreme. “Ignorance” could be seen as

“infinite uncertainty”, while “certainty” is “zero uncertainty”.

Neither extreme condition is exactly true of anything we

normally encounter, though sometimes we can come pretty close.

```
Usually, when one is uncertain about something, one has an idea
```

which states are more likely to be the case and which are

unlikely. For example, next July 1 in Berlin the weather is

unlikely to be snowy, and likely to be hot, and it is fairly

likely to be wet. It is rather unlikely that the Pope is

secretly married. It is very likely that the Apollo program

landed men on the moon and brought them home. One assigns

relative probabilities to the different possible states. These

probabilities may not be numerically exact, or consciously

considered as numbers, but if you were asked if you would take a

5:1 bet on it snowing in Berlin next July 1 (you pay $5 if it

does snow and receive $1 if it doesn’t snow), you would probably

take the bet. If you would take the bet, your probability for it

to snow is below 0.2.

```
Whether "uncertainty" is subjective or objective depends on how
```

you treat probability. If you think of probability as a limit of

the proportion of times one thing happens out of the number of

times it might have happened, then for you, uncertainty is

likely to be an objective measure. I do not accept that approach

to probability, for many reasons on which I will not waste space

in this message. For me, “probability” is a subjective measure,

and is always conditional on some other fact, so that there

really is no absolute probability of X being in state “i”.

```
Typically, for the kind of things of interest here, when the
```

frequentist “objective” version is applicable, the two versions

of probability converge to the same value or very close to the

same value. Indeed, subjective probability may be quite strongly

influenced by how often we experience one outcome or another in

what we take to be “the same” conditions. In other cases, such

as coin tossing, we may never have actually counted how many

heads or tails have come up in our experience, but the shape of

the coin and the mechanics of tossing make it unlikely that

either side will predominate, so our subjective probability is

likely to be close to 0.5 for both “heads” and “tails”. As

always, there are conditionals: the coin is fair, the coin

tosser is not skilled in making it fall the way she wants, it

does not land and stay on edge, and so forth.

```
In order to understand the mathematics of information and
```

uncertainty, there is no need to go into the deep philosophical

issues that underlie the concept of probability. All that is

necessary is that however probabilities are determined, if you

have a set of mutually exclusive states of something, and those

states together cover all the possible states, their

probabilities must sum to 1.0 exactly. Sometimes inclusivity is

achieved by including a state labelled “something else” or

“miscellaneous”, but this can be accommodated. If the states

form a continuum, as they would if you were to measure the

height of a person, probabilities are replaced by probability

density and the sum by an integral. The integral over all

possibilities must still sum to 1.0 exactly.

```
-------definitions-----
First, a couple of notations for the equations I do have to use:
I use underbar to mean subscript, as in x_i for x subscript i.
p(x_i) by itself means "Given what we know of X, this is the
```

probability for us that X is (was, will be) in state i" or, if

you are a frequentist, “Over the long term, the proportion of

the times that X will be (was) in state i is p(x_i)”.

```
I mostly use capital letters (X) to represent systems of
```

possible states and the corresponding lower-case letters to

represent individual states (x for a particular but unspecified

state, x_i for the i’th state).

```
p(x_i|A) means "If A is true, then this is our probability that
```

X is in state i" or “Over the long term, when A is true, the

proportion of the time that X will be in state i is p(x_i)”. In

typeset text, sometimes this is notated in a more awkward way,

using A as a subscript, which I would have to write as p_A(x_i).

```
I will not make any significant distinction between the
```

subjective and objective approaches to probability, but will

often use one or the other form of wording interchangeably (such

as “if we know the state of X, then …”, which you can equally

read as “If X is in a specific state, then …”). Depending on

which approach you like, “probability”, and hence “uncertainty”,

is either subjective or objective. It is the reader’s choice.

```
I use three main capital letter symbols apart from the Xs and Ys
```

that represent different systems. U( ) means “Uncertainty”, I( )

means Information, and M( : ) means mutual information between

whatever is symbolized on the two sides of the colon mark. The

parentheses should make it clear when, say, “U” means a system U

with states u_i and when it means “Uncertainty”.

```
-----definition of "uncertainty"----
If there is a system X, which could be a variable, a collection
```

of possibilities, or the like, that can be in distinguishable

states x_i, with probabilities p(x_i), the uncertainty of X is _* defined* _

by two formulae, one applicable to discrete collections or

variables

```
U(X) = - Sum_i (p(x_i) log(p(x_i)))
the other for the case in which "i" is a continuous variable. If
```

“i” is continuous, p(x_i) becomes a probability density and the

sum becomes an integral

```
U(X) = - Integral_i (p(x_i) log(p(x_i)) di).
That's it. No more, no less. Those two equations define
```

“uncertainty” in the discrete and the continuous case.

```
(Aside: Von Neumann persuaded Shannon to stop calling U(X)
```

“uncertainty” and to call it “Entropy” because the equations are

the same. This renaming was unfortunate, as it has caused

confusion between the concepts of uncertainty and entropy ever

since.)

```
If X is a continuous variable, the measure of uncertainty
```

depends on the choice of unit for “i”. If, for example, X has a

Gaussian distribution with standard deviation s, and we write H

for log(sqrt(2*PI*e)), the uncertainty of X is H + log(s). The

standard deviation is measured in whatever units are convenient,

so the absolute measure of uncertainty is undetermined. But

whatever the unit of measure, if the standard deviation is

doubled, the uncertainty is increased by 1 bit. Since we usually

are concerned not with the absolute uncertainty of a system, but

with changes in uncertainty as a consequence of some observation

or event, the calculations give well-defined results regardless

of the unit of measure. When the absolute uncertainty is of

interest, one convenient unit of measure is the resolution limit

of the measuring or observing instrument, since this determines

how much information can be obtained by measurement or

observation.

```
There are a few other technical issues with the uncertainty of
```

continuous variables, but calculations using uncertainty work

out pretty much the same whether the variable is discrete or

continuous. Unless it matters, in what follows I will not

distinguish between discrete and continuous variables.

```
--------comments on "uncertainty-----
Why is Sum(p*log(p)) a natural measure of "uncertainty"?
```

Shannon, in his 1947 monograph that introduced information

theory to the wider world, proved that this is the only function

that satisfies three natural criteria:

```
(1) When the probabilities vary continuously, the uncertainty
```

should also vary continuously.

(2) The uncertainty should be a maximum when all the

possibilities are equally probable.

(3) If the set is partitioned into two subsets, the uncertainty

of the whole should be the weighted sum of the uncertainties of

the two parts, the weights being the total probabilities of the

individual parts. Saying this in the form of an equation, if X

has possibilities x_1, x_2, … x_n and they are collected into

two sets X1 and X2, then

U(X) = p(X1)*U(X1) + p(X2)*U(X2),

where p(Xk) is the sum of the probabilities of the possibilities

collected into set Xk.

```
One important consequence of the defining equation U(X) =
```

Sum_i(p(x_i)*log(p(x_i))) is that the uncertainty U(X) is always

positive and ranges between zero (when one of the p(x_i) is 1.0

and the rest are zero) and log(N) (when there are N equally

probable states of X). The logarithm can be to any base, but

base 2 is conventionally used. When base 2 is used, the

uncertainty is measured in “bits”. The reason for using base 2

is convenience. If you can determine the actual state through a

series of yes-no questions to which every answer is equally

likely to be yes or no, the uncertainty in bits is given by the

number of questions.

```
I will use the term "global uncertainty of X" for the case when
```

nothing is assumed or known about X other than its long-term

probability distribution over its possible states. This is what

is meant by “U(X)” in what follows, unless some other

conditional is specified in words.

```
In certain contexts, uncertainty is a conserved quantity. For
```

example, if X and Y are independent, the joint uncertainty of X

and Y, symbolised as U(X,Y), is

```
U(X,Y) = U(X) + U(Y)
where U(X,Y) is the uncertainty obtained from the defining
```

equation if each possible combination of X state and Y state

(x_i, y_j) is taken to be a state of the combined system. The

concept can be extended to any number of independent variables:

```
U(X, Y, Z, W....) = U(X) + U(Y) + U(Z) + U(W) +...
Using the subjective approach to probability, "independent"
```

means “independent given what is so far known”. It is not clear

how the term can legitimately be used in a frequentist sense,

because any finite sample will show some apparent relationships

among the systems, even if an infinite set of hypothetical

trials would theoretically reduce the apparent relationships to

zero. A frequentist might say that there is no mechanism that

connects the two systems, which means they must be independent.

However, to take that into account is to concede the

subjectivist position, since the frequentist is saying that *** as

far as he knows*** , there is no mechanism. Even if

there was a mechanism, to consider it would lead the frequentist

into the subjectivist heresy.

```
Uncertainty is always "about" something. It involves the
```

probabilities of the different states of some system X, and

probabilities mean nothing from inside the system in question.

They are about X as seen from some other place or time. To say

this is equivalent to saying that all probabilities are

conditional. They depend on some kind of prior assumption, and

taking a viewpoint is one such assumption.

```
Consider this example, to which I will return later: Joe half
```

expects that Bob will want a meeting, but does not know when or

where that is to be if it happens at all. What is Joe’s

uncertainty about the meeting?

```
Joe thinks it 50-50 that Bob will want a meeting. p(meeting) =
```

0.5

```
If there is to be a meeting, Joe thinks it might be at Joe's
```

office or at Bob’s office, again with a 50-50 chance.

```
If there is to be a meeting, Joe expects it to be scheduled for
```

some exact hour, 10, 11, 2, or 3.

```
For Joe, these considerations lead to several mutually exclusive
```

possibilities:

```
No meeting, p=0.5
Bob's office, p = p(meeting)*p(Bob's office | meeting) = 0.5*0.5
```

= 0.25

```
Joe's office p = p(meeting)*p(Joe's office | meeting) = 0.5*0.5
```

= 0.25

```
Joe has 1 bit of uncertainty about whether there is to be a
```

meeting, and with probability 0.5, another bit about where it is

to be. Overall, his uncertainty about whether and where the

meeting is to be is

```
U(if, where) = -0.5log(0.5) - 2*(0.25log(0.25) = -0.5*(-1) -
```

0.5*(-2) = 1.5 bits

```
no meeting meeting&&either
```

office

```
Another way to get the same result is to use a pattern that will
```

turn up quite often

```
U(if, where) = U(if) + U(where|if) = U(if) + p(no
```

meeting)*U(where|no meeting) + p(meeting) U(where|meeting) = 1 +*1 = 1.5 bits

0.50+0.5

```
Joe is also uncertain about the time of the meeting, if there is
```

one. It could be at 10, 11, 2, 3 regardless of where the meeting

is to be. That represents 2 bits of uncertainty, but those two

bits apply only if there is to be a meeting.

```
U(time|meeting) = 2 bits
U(time) = U(time)*p(meeting) + U(time)*p(no meeting) = 0.5*2 +
```

0.5*0 = 1.

```
U(if, where, time) = 2.5 bits
Compare this with Joe's uncertainty about the meeting if he knew
```

there would be a meeting, but did not know whether it included

Alice. Then we would have three mutually independent things Joe

would be uncertainty about:

```
U(Alice) = 1 bit
U(office) = 1 bit
U(time) = 2 bits
Joe's uncertainty about the meeting would be 4 bits.
Notice what happened here. We added a dimension of possible
```

variation independent of the dimensions we had already, but

eliminated the dimension of whether there would be a meeting.

That dimension (“if”) interacted with the other two, because if

there was to be no meeting, there could be no uncertainty about

its time and place. Adding an independent dimension of variation

adds to the uncertainty; it does not multiply the uncertainty.

```
U(meeting_configuration) = U(Alice) + U(office) + U(time)
------
2. "Information" and related concepts
"Information" always means change in uncertainty. It is a
```

differential quantity, whereas “uncertainty” is an absolute

quantity. I(X) = delta(U(X)).

```
A change in the uncertainty about X may be because of the
```

introduction of some fact such as the value of a related

variable, or because of some event that changes the probability

distribution over the states of X. If it is sunny today, it is

more likely to be sunny tomorrow than if today had happened to

be rainy.

```
To continue the "meeting" example, imagine that Joe received a
```

call from Bob saying that the meeting was on and in his office,

but not specifying the time. Joe’s uncertainty would be 2 bits

after the phone call, but was 2.5 bits before the call. He would

have gained 0.5 bit of information about the meeting. If the

call had said the meeting was off, Joe would have no further

uncertainty about it, and would have gained 2.5 bits. His

expected gain from a call about whether the meeting was off or

on, and its location if it was on, would be 0.5*0.5 + 0.5*2.5 =

1.5 bits, which is the uncertainty we computed by a different

route above for U(if, where).

```
Often the "event" is an observation. If you observe a variable
```

quantity whose value you did not know exactly, your uncertainty

about it is likely (but not guaranteed) to be lower after the

observation than before. Again, thinking of Joe’s meeting, if

the call had simply told Joe the meeting was on, without

specifying the location, Joe’s uncertainty about the meeting

would have increased from 1.5 bits to 3 bits (1 bit for where

the meeting is to be held and 2 bits for when). In that case his

information gain would have been negative. However, averaged

over all the possible messages and taking account of their

probabilities, the information gain from a message or

observation is always zero or positive, even though there may be

individual occasions when it is negative.

```
The information gained about X from an observation is:
I_observation(X) = U_before-observation(X) -
```

U_after-observation(X)

```
---------
Mutual Information
The Mutual information between X and Y is the reduction in the
```

joint uncertainty of two systems X and Y due to the fact that X

and Y are related, which means that knowledge of the state of

one reduces uncertainty about the state of the other. If the

value of Y is observed, even imprecisely, more is known about X

than before the observation of Y.

```
When two variables are correlated, fixing the value of one may
```

reduce the variance of the other. Likewise, we define “mutual

information” as the amount by which uncertainty of one is

reduced by knowledge of the state of the other. If U(X|Y) means

the uncertainty of X when you know the value of Y, averaged over

all the possible values of Y according to their probability

distribution, then the mutual information between X and Y is

```
M(X:Y) = M(Y:X) = U(X) - U(X|Y) = U(Y) - U(Y|X)
Here is another generic equation for mutual information, using
```

the reduction in the joint uncertainty of the two systema as

compared to the sum of their uncertainties:

```
M(X:Y) = U(X) + U(Y) - U(X,Y)
from which it is obvious that M(X:Y) = M(Y:X) and that M(X:Y)
```

cannot exceed the lesser of the two uncertainties U(X) and U(Y).

If X is a single-valued function of Y (meaning that if you know

the value of Y, you can thereby know the value of X) then M(X:Y)

= U(X).

```
M(X:Y) is always positive, though there may be specific values
```

of one of the variables for which the uncertainty of the other

is increased over its average value, as happened when Joe was

told the meeting was on without being told were, when, or who

was attending.

```
Going back again to Joe's office meeting, suppose Joe knows
```

there is to be a meeting, but does not know where or when. It

might be at Bob’s office or at Bob’s house. If it is at the

office, it might be at 10 or at 2. If at the house, it might be

at 6 (with a dinner invitation) or at 9 (after dinner).

```
U(where) = 1 bit
U(when) = 2 bits
U(when) + U(where) = 3 bits
But U(when, where) is not 3 bits, because there are actually
```

only four equiprobable combinations of place and time, 10 and 2

at the office, 6 and 9 at the house, so U(where, when) = 2 bits.

Therefore M(where:when) = 1 bit.

```
-----time and rates----
Since "information" is a difference quantity, the concept of
```

“information rate” becomes useful. It is the rate at which

uncertainty changes as a result of ongoing conditions or events.

```
Consider I(X|observation), the information gained about X as the
```

result of an observation. Imagine that a series of observations

are made at regular intervals. if X does not change state over

time, then each successive observation is likely to further

reduce U(X). Because of the definition of “information” as

“change in uncertainty”, the observer gains information at a

rate that is the average reduction in uncertainty per

observation divided by the time between observations.

```
If X is a discrete system, there is a limit to how much
```

information can be gained about X by observing it. If the global

uncertainty of X (what you believe about X before observing it)

is U(X), no amount of observation can reduce that uncertainty

below zero – the value of U(X) when you know the state of X

exactly. Accordingly, the information gained by continued

observation of a static quantity plateaus at the global

uncertainty of that variable – the amount that the observer did

not know about it before starting the series of observations.

```
(Figure 1)
This is one place where the difference between continuous and
```

discrete variables matters. If the system X has a finite number,

N, of distinguishable states, its uncertainty cannot be greater

than log(N), so log(N) is the maximum amount of information that

could be obtained about it by continued or repeated observation.

But if the states of X form a continuum, it has an infinite

number of states, meaning that in theory there is no limit to

the amount of information about it that could be gained by

continued observation. In practice, of course, the resolution of

the observing system imposes a limit. Precision is never

infinite, even if the limit is imposed by quantum uncertainty.

```
What about the situation in which X is or may be changing? If a
```

changing X is observed once, the uncertainty about X becomes a

function of time since the observation. X may change state

continuously or abruptly, slowly or fast. However it may change,

the “information rate” of X is defined as dU(X)/dt provided that

the observation timing is independent of anything related to a

particular state or state change of X.

```
-----------
[Aside: Here we have another example of the kind of
```

complementarity between two measures that is at the heart of the

Heisenberg Uncertainty Principle of quantum mechanics. If you

know exactly where something is, you know nothing about its

velocity, and vice versa. The rate at which you get information

from observing something can tell you about both its value and

the rate of change of value, but the more closely define one of

these, the less you can know about the other. The information

rate determines the size of the “precision*velocity” cell.]

```
-----------
If an observation is made at time t0 and never thereafter,
```

immediately after the observation at time t0+e (“e” represents

“epsilon”, a vanishingly small quantity) the uncertainty of X is

U_(t0+e)(X) . From that moment, U_t(X) increases over time. The

rate at which U_t(X) increases depends, in the casual

tautological sense, on how fast the value of X changes. If the

variable is a waveform of bandwidth W, this can be made a bit

more precise. The actual rate depends on the spectrum of the

waveform, but for a white noise (the worst case) the rate dU/dt

= W*log(2*pi*e*N) where N is the noise power (which again

depends on the choice of units).

```
However, just as continued observation cannot yield more
```

information than the initial uncertainty of X, so continued

external influences or other cause of variation cannot increase

U_t(X) beyond its global value (its value determined from the

long-term statistics of X).

```
(Figure 2)
----------
3. Information and history
In this section, we are interested in the mutual information
```

between the preceding history and the following observation,

which is symbolized M(X|Xhistory). In everyday language, we want

to know how much we can and cannot know about what we will

observe from what we have observed.

```
Mutual information need not be just between the states of system
```

X and the states of a different system Y. It can be between the

states of system X at two different times, which we can call t0

and t1. If t0 and t1 are well separated, the state of X at time

t0 tells you nothing about its state at time t1. This would be

the case at any time after Tc in Figure 2. But if t1 follows t0

very closely (i.e. near the left axis in Figure 2), it is very

probable that the state of X has not changed very much, meaning

that the Mutual Information between the two states is large.

They are highly redundant, which is another way of saying that

the mutual information between two things is close to its

maximum possible value.

```
Suppose X is the weather. If it is sunny at 10 o'clock, it is
```

likely to be sunny at one second past ten, or one minute past

ten. The probability of it remaining sunny is still pretty high

at 11, and still moderately high at ten tomorrow. But by this

time next week, the observation that it is sunny now will not

tell you very much more about whether it will be sunny then than

will a book about the local climate at this time of year.

```
As the example suggests, M(X_t:X_(t-tau)) -- the complement of
```

U_(t-tau)(X|observation at t0) shown in Figure 2-- is a function

of tau that usually decreases with increasing tau. I say

“usually”, because it is not true of periodic or nearly periodic

systems. The temperature today is likely to be nearer the

temperature this time last year than to the temperature at some

randomly chosen date through the year. For now we will ignore

such cyclic variation, and treat M(X_t:X_(t-tau)) as if it

declined with increasing tau, until, at some value of tau, it

reaches zero.

```
------footnote-------
For a signal of bandwidth W, M(X_t:X_(t-tau)) reaches zero at
```

tau = 1/2W. No physical signal is absolutely restricted to a

precisely defined bandwidth, but the approximation can be pretty

close in some cases. Good audio sampling is much faster than

1/2W because the physical signals of musical instruments have

appreciable power at frequencies well about the listener’s

hearing bandwidth.

```
-----end footnote-----
Suppose a signal is sampled at regular intervals separated by
```

tau seconds, with tau small enough that knowledge of the signal

value at one sample reduces the uncertainty of the state of the

next and previous samples. The probability distribution of the

next sample is constrained by the values of the previous

samples. One might think of the value of the sample as being

that of a perception, but that is not necessary.

```
(Figure 2)
Figure 2 suggests how knowledge of the past values may constrain
```

the probability distribution of the next sample value when the

inter-sample time, tau, is short compared to the rate of change

of the variable.

```
The first panel suggests the uncertainty of the next sample if
```

one knows nothing of the history of the variable. If the

variable is from the set of possible states of X, its

uncertainty is U(X). The second and third panels show how the

uncertainty is reduced if the values of the previous one or two

samples are known. In the middle panel, one possible value of

the previous sample is shown, and in the third panel the values

of the two previous samples are taken into account.

```
The figure is drawn to show the changes in a slowly changing
```

waveform, but that is just one possibility. It could equally

easily represent the probabilities in a sequence of letters in a

text. For example, the left panel might represent the

possibility of observing any letter of the alphabet, the middle

the distribution of the next character after observing a “t”,

and the right panel the distribution of the following character

after observing “th”. Since we assume that the statistics of X

are stationary (i.e. do not depend on when you sample them,

provided that choice of the sampling moment is independent of

anything happening in X), M(X0:X1) = M(X1:X2) = M(X2:X3)…

```
We want to know the uncertainty of X at a given moment t0 given
```

its entire previous history. If X is a slowly varying waveform,

its uncertainty knowing the previous history right up to t0 is

zero, which is uninteresting. In that case we have to ask about

its history only up to some time t0-tau. We will actually

consider the history to consist of successive samples separated

by tau seconds. In the case of a discrete system such as the

letters of a text, no such restriction is required, but by

making the restriction on the sampling of the continuous

waveform we can treat both kinds of system similarly.

```
Formally, one can write
U(X0|History) = U(X0) - M(X0:History)
but that is not much use unless one can calculate the mutual
```

information between the next sample value x0 and the history of

system X. It is possible again to write a general formula, but

its actual calculation depends on the sequential statistics of

X.

```
The History of X can be written as a sequence of samples
```

counting back from the present: X1, X2, X3, … and

M(X0:History) can be written M(X0:X1, X2, X3,…)

```
M(X0:History) = M(X0|X1) + M(X0:X2|X1) + M(X0:X3|X2,X1) +
```

M(X0:X4|X1,X2,X3) + …

```
In words, the mutual information between the next sample and the
```

history can be partitioned into the sum of an infinite series of

mutual information elements, each one being the amount one

sample contributes to the total mutual information, given the

specific values of all the samples between it and the most

recent sample.

```
To make this concrete, let's consider a few specific examples
--------
Example 1: X is a sequence of symbols "a" and "b", that
```

alternate “…ababab…”

```
If one looks at an arbitrary moment, one has an 0.5 chance of
```

the next sample being a or b: U(X0) = 1 bit. If one has seen one

sample, which might have been either a or b, there is no

uncertainty about the next sample: U(X0|X1) = 0, which gives

M(X0:X1) = U(X).

```
What then of M(X0:X2|X1)? Remember that M(A:B) cannot exceed the
```

lesser of U(A) and U(B). In this case, U(X2|X1) = 0 because if

X1 was a X2 must have been b, and vice-versa. So M(X0:X2|X1) = 0

and all subsequent members of the series are also zero.

```
One can extend this example to any series or waveform that is
```

known to be strictly periodic with a known period.

```
-------
Example 2: X is a sequence of symbols a, b, and c with the
```

property that a is followed equally probably by b or c, b by c

or a, and c by a or b. No symbol is ever repeated immediately,

but apart from that, each other symbol is equally likely at any

sample moment.

```
U(X0) = log(3) = 1.58 bits
U(X0|X1) = 1 bit, which gives
M(X0:X1) = 0.58 bits
It is irrelevant how X1 took on its value from X2. If x1 = a, x2
```

could have equally well been b or c. Therefore

```
M(X0:X2|X1) = 0
--------
Example 3. X is a sequence of five different symbols labelled 1,
```

2, 3, 4, 5. Considering 5 and 1 to be neighbours, each symbol

can be followed by itself with probability 0.5 or one of its

neighbours with probabiity 0.25. Over the long term, each symbol

appears with equal probability.

```
U(X0) = log(5) = 2.32 bits
U(X0|X1) =-0.5 log 0.5 - 2* (0.25 log 0.25) (because the
```

probability is 0.5 its predecessor was itself and 0.25 that it

was either of its two neighbours)

```
= 0.5 + 0.5 = 1 bit
M(X0:X1) = 2.32 - 1 = 1.32 bits
M(X0:X2|X1) depends on the relationship between X2 and X0,
```

unlike the situation in example 2. There are five possibilities

for x0 (the value of X0). the probability of these five depends

on the relationship between x0 and x2. We must consider each of

these five, and partition M(X0:X1) according to the

probabilities for each value of X2. In a formula:

```
M(X0:X2|X1) = Sum_k(p(X2=xk)*M(X0:X2|X1=xk))
Since the probability p(X2=xk) is the same for each value of xk
```

(all symbols are equally likely over the long term), this sum

comes down to

```
M(X0:X2|X1) = (1/N)*Sum_k(M(X0:X2|X1=xk))
We consider all these possibilities individually. Since all the
```

relationships are symmetric (meaning that it doesn’t matter

which value we select as the possible next sample, x0), we need

deal only with the relationships themselves.

```
The sequence could be {x0, x0, x0}, {x0, neighbour, x0}, {x0,
```

neighbout, neighbour} or {x0, neighbour, non-neighbour}.

Examples of each might be {3.3.3}, {3,4,3}, {3,4,4} and {3,4,5},

which would have probabilities 0.5*0.5 = 0.25,
2*0.25

*0.25=0.125, 2*0.25

*0.5=0,25, and 0.25*0.25=0.125

respectively.

```
There is no need to do the actual calculations here. They are
```

easy but time-consuming. The point is that we can calculate

M(X0:X2), M(X0:X2|X1) and the rest directly from the conditional

probabilities.

```
---------
Example 4: X is built by summing a succession of samples of
```

another system Y, such that xj = Sum(History(X)) + yj, x0 = 0

for some specified value of k, and yj is a sequence of positive

and negative integer values, samples from a system Y, that

average zero. In this example, the numbering scheme goes in the

direction of time, later samples being labelled with larger

numbers. At the k’th sample the expression M(X|Xhistory)

becomes

```
M(Xk:Yk|X_(k-1), X_(k-2), ... X(0))
Each sample of Y contributes equally to the value of X, so there
```

is no need to consider the sequential order of the y values.

Assuming that Y is a series of numerical values and the samples

of Y are independent, the variance of X increases linearly with

the number of samples, so its standard deviation increases

proportionally to the square root of that number.

```
Let us trace the contribution of the first few samples of Y to
```

the uncertainty of X.

```
After one sample, y0, of Y, x0 will equal y0, which means U(X0)
```

= U(Y0).

```
M(X0:Y0) = U(X) + U(Y) - U(X,Y) = U(X) = U(Y)
After the next sample, U(X1) still is given by the formula Sum(p
```

log p), but the probabilities are no longer the same as they

were before the first sample. There is a wider range of

possibilities, from twice the minimum value of Y to twice the

maximum value of Y, and the probabilities of those possibilities

are not equal. Because the probability distribution over the

possible values of X is not uniform, U(X) < 2*U(Y).

```
As more and more samples are included in the Sum, the range of X
```

increases, but values near zero come to have higher and higher

relative probabilities. The distribution of probabilities over

the possible values of X approaches a Gaussian with variance

that increases linearly with k, so U(X) approaches U(Y)*sqrt(k).
The ratio U(X)/(k*U(Y) decreases with increasing k, approaching

1/sqrt(k) as k becomes large .

```
We want to consider the contribution of the k'th sample of Y.
```

How much does the kth sample of Y reduce the uncertainty of the

k+1th value of the Sum, as compared to knowing only the History

up to the previous sample? That is the mutual information

between the kth sample of Y and the k+1th value of X.

```
For large k, (U(X_k) - U(X_(k-1)))/U(Y) approaches
```

sqrt(k)-sqrt(k-1), an increasingly small number. Yet it remains

true that U(X_k|History) - U(X_k|History, Yk) = U(Y). The

uncertainty of the value of X after the next sample of Y is

always the uncertainty of Y, because if you know the current

value of X, the next value is determined by the value of the Y

sample. How can these two statements be reconciled? The key is

the conditional of knowing the History. Without the prior

History (in this case the Sum), the contribution of each

successive Y sample to the uncertainty of X becomes less and

less, but if you do have the prior History, the value of the

next X sample will have an uncertainty that is the uncertainty

of the Y sample.

```
The situation is the same if the system X is normalized, meaning
```

that its variance is independent of the number of prior samples

after a sufficient number of Y values have occurred. The

magnitude of the contribution of the k’th Y sample is reduced

proportionally to sqrt(k), but its contribution to the various

uncertainties are the same, provided that the different possible

values of X are resolved.

```
--------
Example 5: The same as Example 4 except that the samples of Y
```

are not independent. If Y is a continuous waveform, the samples

are taken more closely than the Nyquist limit (samples separated

by 1/2W, where W is the bandwidth of the waveform). If Y is a

discrete set of possibilities, successive samples are related by

some kind of grammar. The two kinds of possibility converge when

the discrete set of possibilities is linearly arranged and the

grammar merely makes transitions among nearby values more likely

than across distant values.

```
The new issue here is that if the history of Y is known, U(Yk)
```

< U(Y). Therefore M(Xk:History(X), yk) < U(Y). Each

successive sample of Y reduces the uncertainty of X less than

would be the case if the samples of Y were independent of each

other. The difference is subsumed by the fact that

M(Xk|History(X)) is greater than it is in the case when the Y

samples are independent, by the same amount.

```
---------
Example 6: System X is a function of the current and past
```

samples of system Y. We want to know the uncertainty of sample

xk when the values of successive Y samples are known only up to

sample y_(k-h). For example, in the grammar of example 3, we may

have observed the sequence 1,2,1,5, (samples y1, y2, y3, y4) and

want to know the uncertainty of sample x6, which is a function

of y1, y2, y3, y4, y5, and y6. We do not yet know the values of

y5 or y6. As a real-world example, a trader in the days of

sailing ships might have known the market conditions in China

and England several months before his ship captain actually

bargains for tea in Canton, but the earlier conditions are the

conditions on which he must base his instructions to the

captain.

[Martin Taylor 2012.12.08.11.32]

`.... We consider the "basic-standard" control system that has a pure`

integrator as its output function.

`First, let's treat the effect of one step change in the`

disturbance value. We assume that the error is initially zero,

and at time t0 the disturbance suddenly goes from whatever value

it may have had previously to d0. We treat this single step, and

then consider the effects of a series of steps. We start by

showing how the integrator rate parameter controls how fast

information about the disturbance arrives at the output – how

fast the control system measures the disturbance value.`Let us trace the signal values at various places in the loop as`

time passes after the step at time t0. At time t0+e (e means

epsilon, an indefinitely small amount), the error value is -d

and the output value qo is zero. At this early time the output

value is whatever it was before the step, and has an unchanged

influence on the value of the input quantity qi. As time passes,

the output value changes exponentially until after infinite time

it reaches -d. Assuming q0 was initially zero, then at time t:`qo(t) = -d0*(1-e^(-G*(t-t0))). At the same time, the input value qi approaches zero by the same`

exponential:

`qi(t) = 1 - e^(-G*(t-t0)) Exponential changes in value are quite interesting in`

information theory, as they often indicate a constant bit rate

of information transfer. To see this, you must understand that

although the absolute uncertainty about the value of a

continuous variable either before or after a measurement or

observation is always infinity, the information gained through

the measurement or observation is finite. If the range of

uncertainty about the value (such as standard deviation of

measurement) is r0 before the measure and r1 after the measure,

the information gained is log2(r0/r1) bits.`Measurements and observations are never infinitely precise. When`

someone says their perception is subjectively noise-free and

precise, they nevertheless would be unable to tell by eye if an

object changed location by a nanometer or declined in brightness

by one part in a billion. There is always some limit to the

resolution of any observation, so the issue of there being an

infinite uncertainty about the value of a continuous variable is

never a problem. In any case in which a problem of infinite

uncertainty might seem to be presenting itself, the solution is

usually to look at the problem from a different angle, but if

that is not possible it is always possible to presume a

resolution limit, a limit that might be several orders of

magnitude more precise than any physically realizable limit, but

a limit nevertheless.`(If you want to get a more accurate statement, and see the`

## derivation of all this, I suggest you get Shannon’s book from

Amazon at the link I provided earlier ).

In the above example, q0 could have started at any value

whatever, and would still have approached -d exponentially with

the same time constant. The output value qo has the character of

a continuous measurement of the value of the disturbance, and

this measurement gains precision at a rate given by

e^(-G*(t-t0)). (The units of G are 1/seconds).

What is this rate of gain in precision, in bits/sec? One bit of

information has been gained when the uncertainty about the value

of d0 has been halved. In other words, the output has gained one

bit of information about the disturbance by the time tx such

that

e^(-G*(tx-t0)) = 0.5

Taking logarithms (base 2) of both sides, we have

log2(e)(-G(tx-t0)) = -1

tx-t0 = 1/(Glog2(e)) = 1/(G1.443…) seconds

That is the time it takes for q0 to halve its distance to its

final value -d0 no matter what its starting value might have

been, which means it has gained one bit of information about the

value of the disturbance. The bit rate is therefore G1.443…1.443 as the bit rate of the

bits/sec. That is the rate at which the output quantity q0 gains

information about the disturbance, and also is the rate at which

the input quantity loses information about the disturbance. The

input quantity must lose information about the disturbance,

because it always approaches the reference value, no matter what

the value of the disturbance step. I suppose you could look upon G

loop considered as a communication channel, but it is probably

not helpful do do so, as the communication is from the input

quantity back to itself. Without the complete loop, q0 does not

act as a measure of d0, but continues negatively increasing

linearly without limit at a rate Gd0. It is better just to1.443 as the rate at which information from the

think of G

disturbance appears at the output quantity.

As a further matter of clarification, I suppose we ought to

mention the behavioural illusion here. If the environmental

feedback path is not a simple connector with a gain of unity,

the output quantity differs from the disturbance value. The

mismatch does not affect the information relationship between

the disturbance and the output quantity, since the environmental

feedback path provides a single-valued environmental feedback

function. It is always possible to compute the influence of the

output on the input quantity if you know the output value q0.

This means that q0 and the influence of the output on the input

quantity are informationally identical, whatever their

functional relationship, providing the environmental feedback

function does not change. If it does change, the effect is

similar to adding a noise source to a communication channel, a

situation that was Shannon’s primary concern and provides no

particular complication to the analysis. (Or in the case of

Marken’s demo that involves switching the sense of the

environmental feedback path, the communication analogy is the

band-switching secret communication procedure patented by the

movie star Hedy Lamarr; in both cases, the switching must be

separately detected or otherwise known).

As a specific example, if the environmental feedback path has a

gain of 2, the output quantity will converge to a value of d/2.

The convergence to d/2 is at the same exponential rate as would

have been the case for en environmental feedback path gain of

1.0, providing an information gain rate that is determined only

by the integration rate parameter G. Information from the

disturbance still arrives at the output quantity (and is lost

from the input quantity) at a rate G1.443… From now on, we will give the label “T” to the rate at which the1.443… bits/sec.

output gains information about the disturbance. T = G

The rate of information gain about the disturbance value at the

output quantity is independent of the disturbance waveform and

of the environmental feedback function. It is a property of the

internal circuitry of the control system. Accordingly, we can

make some estimates about the achievable precision of control

for disturbances that are not simple step functions. First,

however, we should look at the effect of a second and subsequent

step change in the disturbance value.

The disturbance value changed abruptly to d0 at time t0. Now at

time t1 the disturbance value jumps to d1. The input quantity

adds d1-d0 to whatever value it had arrived at by t1. The error

value jumps by the same amount, but the output value qo does not

change immediately, because of the integrator output function.

It starts changing exponentially toward (the negative of) the

new value of the disturbance.

At t1+e, the output quantity has a value that is unknown,

because ever since t0 it has been in the process of approaching

-d0 exponentially from some unknown starting point. The

uncertainty of its current value is T*(t1-t0) bits less than it

had been at time t0, because it was performing a process

functionally equivalent to measuring the value d0. Now it has to

start forgetting the value d0 and start measuring the value d1.

Less colloquially, the mutual information between the output

quantity and the disturbance value is instantly decreased by the

uncertainty of the change in disturbance value, while the mutual

information between the disturbance value and the input value is

increased by the same amount.

By how much are the mutual information values changed by the new

step? Since the absolute uncertainty of a continuous variable is

infinite, can that change be measured? Shannon shows again that

it can. The amount is log2(r_new/r_old) where r_new is the new

range of uncertainty (such as standard deviation) and r_old is

the old range of uncertainty. The new range of uncertainty

depends on the probability distribution of values that can be

taken by the disturbance at that time, which may be unknown. It

is not a new unknown, however, since the original disturbance

step came from the same distribution, and the original value of

the output quantity probably came from the same distribution

(modified, of course, by the inverse of the environmental

feedback function) if the system had been controlling prior to

the first step that we analyzed.

After the second step change in the disturbance value, the

output quantity is gaining information, still about the

disturbance value, but not about d0. Now it is gaining

information about d1 and losing information about d0. At the

same time, the input quantity is losing information about d1, as

well as about any remanent information about d0 that might still

have been there at time t1. All this is happening at a rate of T

= G*1.44.3… bits/sec.

The effects of third and subsequent steps in the value of the

disturbance are the same. The output continues to gain

information about the most recent value of the disturbance at T

bits/sec while losing information about prior values at the same

rate, and while the input also loses information about the new

and old values at T bits/sec. Making the steps smaller and

closer in time, we can approximate a smooth waveform with any

desired degree of precision. No matter what the disturbance

waveform, the output is still getting information about its

value at T bits/sec, and losing information about its history at

the same rate. So long as the disturbance variation produces less than T

bits/sec, control is possible, but if the disturbance variation

is greater than T bits/sec, control is lost. The spike in the

input value at a step change in the disturbance is an example of

that, where control is completely lost at the moment of the

step, and gradually regained in the period when the disturbance

remains steady, not generating uncertainty.This is getting rather too long for a single message, so I will

stop there. I hope it explains just how information gets from

the disturbance to the output, and how fast it does so in the

case of the ideal noiseless control system with no lag and a

pure integrator output function.

Martin

http://www.amazon.com/Mathematical-Theory-Communication-Claude-Shannon/dp/0252725484/ref=sr_1_1?s=books&ie=UTF8&qid=1354736909&sr=1-1&keywords=Shannon%2C+C.+E.