# Some words in "information"

[Martin Taylor 960703 10:30]

Since I'm off for a couple of weeks in a few minutes, and won't get back
to this for some time after that, I thought I might post some of the
text I've drafted (but not edited) in preparation for making a Web
page on information in control. What follows may contain any number
of typos and even egregious errors, but I hope the main ideas will
not be too distorted. In a way, part 2.1 is an answer to Bill Powers'
questions on regular and irregular waveforms and how they relate to
uncertainty.

I hope there aren't too many specialized characters left in that will
display as some kind of garbage. Anyway, here it is, for what it's worht.
I hope it is more help than hindrance to an understanding of what I am
talking about when I say "information."

1. Probability

1.1 Discrete events

Probability is avery tricky concept. Intuitively, it is quite simple, but when
one looks at it more deeply, it tends to get slippery, and to lead to massive
confusion.

Intuitively, one thinks of highly probable events as happening more frequently
than improbable events. But this cannot be right, because any event happens
only once over the whole life of the universe. Each event is unique. So what
do we mean by "frequent" when we think of events? We really are talking about
a set of unique events that seem to belong to a class of events. The identity
of the class is in our own heads. We say that one toss of a given coin is like
another of the same coin, and that a result of heads is as probable as a
result of tails. We say this because we have no reason to believe that any of
the factors that matter to whether the coin lands on one side or the other
have changed between tosses. Sure, the wind may have shifted, a car may have
driven down the street, but these are seen as irrelevant to the coin. We say
that the coin tosses are repeated events.

When we assert that the probabilities a coin lands head or tail are equal, we
are not talking about a fact of the world. We are talking about a perception
we have of the world, a perception based, perhaps, on observing that coin's
past behaviour, perhaps on observing many coins that were previously tossed,
perhaps on an intuition about the nature of the forces on nearly symmetrical
bodies. Whatever our perception of the probability is based on, it is a
personal perception, no more veridical with respect to a property of the
factual world than is any other perception.

To assess a probability, the range of possible results of an event must be
determined. Is the result of a coin toss a "head" or "tail", or is the
location at which the coin comes to rest part of the result? After all, if the
coin comes to rest in a crack in the floor, it may be standing near vertical.
Should that be part of the set of reasonable results, or should a coin toss be
discounted if that happens? What should one do if the coin bounces out of the
window, or is otherwise lost?

probability estimates are conditional. The probability that the coin will land
"heads" is conditional on the toss being "fair", fair being defined at least
as not losing the coin or having it land on edge. Only if the coin lies
reasonably flat will the event be counted. (There are other elements of a fair
toss, of course, and a legal-minded person would probably want to make them
all explicit before tossing the coin).

Another basic statement about probability is a mathematical one. The
subjective feeling of probability ranges from being seenentially certain that
the event will happen to being essentially certain that it will not.
Mathematics can't deal very well with "essentially certain" but it deals well
with numbers. So we assign numbers with conventionalized meanings. Zero
probability means certainty that the event will not have the designated
result, and unity means certainty that it will. If we know that the coin is
double-headed, then we can say that (conditional on the toss being fair) the
probability of heads is unity, and of tails is zero.

If the coin is a normal one, with a proper head and tail face, one can talk
about the probability it will land head or tail, without differentiating them.
That probability is 1.0, conditional on the toss being fair. We normally make
that kind of combination when we say that a head with the coin landing "here"
is the same result as a head with the coint landing "there." If we combine all
the results compatible with the condition (that the toss is fair), the
probability that one of them will occur is unity. But remember that p = 1.0 is
only a way of saying "I am absolutely certain."

Another mathematical requirement for working with probability is that if the
belief that one result will happen is stronger than that another will happen,
the numeric value assigned to the first is higher than that assigned to the
second. And if they are mutually exclusive, the sum of their probabilities is
no greater than unity. The sum is unity if and only if they are the only
possible results (as are "head" and "tail" in a fair coin toss).

1.2 Continuous ranges of possibility

If the set of possible results is continuous, such as that the measured height
of a child will be 1.3456234...m, then the probability of any particular value
occurring is vanishingly small, and the concept of "probability density" has
to be introduced. Even if, by some magic, you knew absolutely that the correct
height was exactly 1.3456324408, how strongly would you believe that the
measurement would show that value rather than 1.34563244080001? But on the
other hand, how strongly would you believe that the measurement would show a
value between 1 and 1.5m? If you specify a range of values, it is reasonable
for you to believe, strongly or weakly, that the measurement will fall within
that range. It is more likely that the child will be found to be between
1.3 and 1.4m than between 1 and 1.1m.

If the probability that the measurement of x lies between a and b is pab(x),
the average probability density of x over the range a to b is pab(x)/(b-a).
As (b-a) approaches zero, this value approaches the "probability density at
x."

Returning to the example of the coin toss, the location at which the coin
falls has a continuous range of possibility. We should perhaps be talking
about the probability density of a "head at x0, y0." But we can integrate the
probability density over all x, y, leaving only the discrete probabilities of

Sometimes we can't find a natural set of discretely different results, and we
have to get probabilities simply by selecting regions of the result space and
integrating the probability density over them. This is what we do when we make
a measurement. The measuring instrument gives a reading on a continuous scale,
which we arbitrarily reduce to membership in a bin such as "the child is 1m 34
tall." The probability of getting that measurement is the integral of the
probability distribution over the range of measurements we put into that bin.

A probability distribution is simply the way the probability varies over all
the discrete alternatives, or the way a probability density varies over the
range of possibilities. The sum over all the alternatives, or the integral
over all the possibilities, must sum to unity. And nowhere is the probability
or the probability density negative.

2. Information and Uncertainty

2.1 Single variables

Uncertainty is a property of a probability distribution, which is to say any
distribution that is always non-negative and sums to unity over the full range
of the variable over which the distribution is taken. The variable may be
single-dimensional, multi-dimensional, or even infinite-dimensional. The value
of the uncertainty U(x) of a probability distribution p(x) is defined as
U(x)=sum over all x of (p(x) log (p(x))
(or as the integral over all x of pd(x) log(pd(x),
where pd(x) is the probability density over x).
The rationale for using this logarithmic function was presented by Shannon in
Shannon and Weaver (1947; The Mathematical Theory of Communication). It is
unique in several respects that relate to the everyday concept of
"uncertainty."

Since uncertainty is a measure characterizing a probability distribution, it
may be conditional on whatever might affect the probability distribution. In
particular, if the probability distribution over x changes after an
observation of y, U(x) may differ from U(x|y) (read: the uncertainty of x
given knowledge of y).

For any single variable x(t), there are two uncertainties of interest about
the value of x(t0), the value of x(t) at time t0: the raw uncertainty U(x(t0))
and the uncertainty given knowledge of the waveform up to time t1<t0
U(x(t0)|x(from t= -infinity to t1)).
(For notational convenience, this latter will be written Ut1(x(t0)).

Ut1(x(t0)) ranges from zero for t1=t0 to U(x(t0)) for t1 = minus infinity for
all physically realizable signals. As a function of t1, Ut1(x(t0)) is an
informational analogue of the autocorrelation function of x(t). Unlike the
autocorrelation function, it is never negative. For t1>t0, it is always
zero, since the value of x(t0) would then be included in the part of the

Information is defined as reduction in uncertainty. The Information about
x(t0) provided by x(from t= -infinity to t1) is
U(x(t0)) - Ut1(x(t0)) = It1(x(t0)).
It is an even closer analogue to the autocorrelation function of x(t), since
for t1 = t0 it takes on its maximum value, and declines monotonically to zero
for t1 approaching minus infinity.

The Information Rate R(x(t)) of x(t) can be taken as the slope of It1(x(t0))
as t1 approaches t0. R(t)=dI(t)/dt. It indicates how much new information is
gained per second about the value at t0, by observation of the preceding
portion of the waveform. The integral of the information rate is obviously the
total information available up to time t1 about what value the waveform will
take on at time t0.

If one knows a waveform to be periodic with a certain period, then knowledge
of the waveform over any complete period serves to define it for eternity.
There is no remaining uncertainty as to the value x(t) for any specific t,
because the periodic waveform need only be replicated until the desired value
of t is reached. There is no uncertainty about the value at t, and the
information rate of the signal is zero after one cycle has been observed.

In most cases of so-called "periodic" waveforms, the replication is not exact
from cycle to cycle, if only because of noise. In such cases, information is
gained about x(t), if only slowly, by continuing to observe the waveform until
time t, and the waveform does have some information rate, perhaps very low,
but not zero. A waveform that is nearly periodic because it is a modulated
carrier also allows x(t) to be approximated by extrapolation from prior cycles
of the carrier, but information is still gained by observing x up to time t.
In this case, the information rate (apart from noise effects) is that of the
modulation signal.

Always, the amount of information gained about the value at x(t) from
observing earlier parts of the waveform depends on the model the observer has
of the waveform. If the observer believes that the waveform is really random,
and that the periodicity so far observed is only happenstance, then no
information about x(t) will be gained from prior observation. The observer's
uncertainty as to the value of x(t) will remain unchanged until time t, when
an actual observation will reduce the uncertainty to zero.

The information gained by that one observation will be equal to the original
uncertainty. In contrast, if the observer believes the waveform to change only
smoothly, then the uncertainty of x(t) will be continuously reduced as time t
approaches, and the actual observation at x(t) will provide an infinitesimal
amount of information. Observation of the prior signal provided it all.

"The observer believes" can be mapped directly into "the filter has the form
that...". The observer believing that the waveform is periodic is mappable
into a filter with zero bandwidth. The observer believing that the
waveform is purely random is mappable onto a filter with infinite bandwidth,
and the observer believing that the waveform is a modulated carrier is
mappable onto a filter that has a finite bandwidth centred on some non-zero
frequency representing the carrier. The "observer" need not be a person; it
depends only on the way in which the signal as observed or filtered will
change over time. More on this in the section on filters and functions.

2.2 Pairs of variables

The situation is more complex when there are two variables x(t), y(t). It may
be that there is some commonality in the sources of the two waveforms, such
that U(x(t0) > U(x(t0)|y(from t= -infinity to t1)). In this case knowledge of
the waveform of y before t1 reduces the uncertainty of x(t0). One can talk
I(x(t0)|y(from t= -infinity to t1)).
A non-zero value of I does not imply any physical connection between y and x,
and even if there is a connection, it could be in either direction. The
relationship is analogous to the correlation between two variables. A non-zero
correlation does not imply a causal relationship between the variables, and
nor does a non-zero amount of information provided by one about the other.

Even if y affects x directly, such that x(t) is a function of y(tau), the
effect might be delayed or spread out over time, as it would be if
x = integral(ydt). If the effect of y on x is delayed by deltat, then the
slope of I(x(t0)|y(from t= -infinity to t1)) as a function of t1
will be zero for t1 > t0-deltat. One cannot, therefore, use dI(yx)/dt at t1=t0
as the information rate of y about x in the same way one can use dI(xx)/dt at
t1=t0 as the information rate of x(t). Instead, one can use the maximum of
dI(yx)/dt as a function of t0-t1 as the "channel rate" between y and x.

It is important to remember two things: (1) that the assumed knowledge of y
extends over its whole waveform prior to t1, and (2) that the existence of a
non-zero channel rate between y and x does not imply that there is a physical
connection between them. There is a non-zero channel rate between my activity
level in North America and the activity level of a person in Australia, even
though we have never communicated with each other, because we both are likely
to be more active in the daytime than at night.

3. Filters and Functions

If x(t) is the output of a filter or function whose input is y(t), then, apart
from noise, bandwidth, or resolution limits, x(t) is fully determined by y(t),
in the sense that knowing the filter function and y(t) from minus infinity to
t0 would be sufficient to specify x(t0).
U(x(t0)|y(from t= -infinity to t1)) -> 0 as t0-t1 approaches zero.
But physically realizable filters do have limits on all these parameters, so
U(x(t0)|y(from t= -infinity to t1)) has a lower bound greater than zero. The
input to the filter cannot completely specify the output.

The input may have frequency components outside the acceptance bandwidth of
the filter, in which case an infinite number of different possible input
signals would provide the same output signal. The filter may be able to
resolve only down to some finite limit, either because of internal
quantization or because of physically unavoidable noise, in which case again
several different signals might provide the same output signal. Either way,
given the filter output, the uncertainty of the input may be much reduced, but
it is not reduced to zero.

ยทยทยท

-------------------------

See you in 2 or 3 weeks.

Martin

[Hans Blom, 960705b]

Nice work, Martin. Just a few critical remarks, as always ;-).

For any single variable x(t), there are two uncertainties of interest
about the value of x(t0), the value of x(t) at time t0: the raw
uncertainty U(x(t0)) and the uncertainty given knowledge of the
waveform up to time t1<t0 > U(x(t0)|x(from t= -infinity to t1)).

This needs to be written in capitals and repeated frequently. It is
the basis of all human knowledge and of all modelling! Understanding
that the second may be much smaller than the first will eliminate a
lot of confusion. One of the things it says is that, although we may
have a measurement inaccuracy of +/- 100, I can still know the value
of an observed variable with a accuracy of, say, +/- 1 or +/- 0.0001
or even much better, as in some physics constants. This is due to a
(predictive) "model" of the signal that is based on the signal's past
and that greatly reduces the "raw" uncertainty. This assumes, of
course, that some of the signal's properties do not vary over time.

In most cases of so-called "periodic" waveforms, the replication is
not exact from cycle to cycle, if only because of noise.

Here you need to discriminate between two possible causes/noises. The
first is, that the waveform itself is not 100% periodic. The second
is that, although the waveform is repeating exactly, our observations
of it have an irreplicable component. In control engineering, the
first is called "system noise", the second "observation noise". The
difference is crucial. Observation noise can be easily gotten rid of,
but system noise keeps changing the signal and makes previously
acquired knowledge inaccurate, because outdated. If the system noise
is large, the waveform is hardly periodic anymore, and its previous
values tell little about its current value.

That's all for now.

Greetings,

Hans