[Martin Taylor 960703 10:30]

Since I'm off for a couple of weeks in a few minutes, and won't get back

to this for some time after that, I thought I might post some of the

text I've drafted (but not edited) in preparation for making a Web

page on information in control. What follows may contain any number

of typos and even egregious errors, but I hope the main ideas will

not be too distorted. In a way, part 2.1 is an answer to Bill Powers'

questions on regular and irregular waveforms and how they relate to

uncertainty.

I hope there aren't too many specialized characters left in that will

display as some kind of garbage. Anyway, here it is, for what it's worht.

I hope it is more help than hindrance to an understanding of what I am

talking about when I say "information."

1. Probability

1.1 Discrete events

Probability is avery tricky concept. Intuitively, it is quite simple, but when

one looks at it more deeply, it tends to get slippery, and to lead to massive

confusion.

Intuitively, one thinks of highly probable events as happening more frequently

than improbable events. But this cannot be right, because any event happens

only once over the whole life of the universe. Each event is unique. So what

do we mean by "frequent" when we think of events? We really are talking about

a set of unique events that seem to belong to a class of events. The identity

of the class is in our own heads. We say that one toss of a given coin is like

another of the same coin, and that a result of heads is as probable as a

result of tails. We say this because we have no reason to believe that any of

the factors that matter to whether the coin lands on one side or the other

have changed between tosses. Sure, the wind may have shifted, a car may have

driven down the street, but these are seen as irrelevant to the coin. We say

that the coin tosses are repeated events.

When we assert that the probabilities a coin lands head or tail are equal, we

are not talking about a fact of the world. We are talking about a perception

we have of the world, a perception based, perhaps, on observing that coin's

past behaviour, perhaps on observing many coins that were previously tossed,

perhaps on an intuition about the nature of the forces on nearly symmetrical

bodies. Whatever our perception of the probability is based on, it is a

personal perception, no more veridical with respect to a property of the

factual world than is any other perception.

To assess a probability, the range of possible results of an event must be

determined. Is the result of a coin toss a "head" or "tail", or is the

location at which the coin comes to rest part of the result? After all, if the

coin comes to rest in a crack in the floor, it may be standing near vertical.

Should that be part of the set of reasonable results, or should a coin toss be

discounted if that happens? What should one do if the coin bounces out of the

window, or is otherwise lost?

Answering these questions leads to a basic statement about probability: All

probability estimates are conditional. The probability that the coin will land

"heads" is conditional on the toss being "fair", fair being defined at least

as not losing the coin or having it land on edge. Only if the coin lies

reasonably flat will the event be counted. (There are other elements of a fair

toss, of course, and a legal-minded person would probably want to make them

all explicit before tossing the coin).

Another basic statement about probability is a mathematical one. The

subjective feeling of probability ranges from being seenentially certain that

the event will happen to being essentially certain that it will not.

Mathematics can't deal very well with "essentially certain" but it deals well

with numbers. So we assign numbers with conventionalized meanings. Zero

probability means certainty that the event will not have the designated

result, and unity means certainty that it will. If we know that the coin is

double-headed, then we can say that (conditional on the toss being fair) the

probability of heads is unity, and of tails is zero.

If the coin is a normal one, with a proper head and tail face, one can talk

about the probability it will land head or tail, without differentiating them.

That probability is 1.0, conditional on the toss being fair. We normally make

that kind of combination when we say that a head with the coin landing "here"

is the same result as a head with the coint landing "there." If we combine all

the results compatible with the condition (that the toss is fair), the

probability that one of them will occur is unity. But remember that p = 1.0 is

only a way of saying "I am absolutely certain."

Another mathematical requirement for working with probability is that if the

belief that one result will happen is stronger than that another will happen,

the numeric value assigned to the first is higher than that assigned to the

second. And if they are mutually exclusive, the sum of their probabilities is

no greater than unity. The sum is unity if and only if they are the only

possible results (as are "head" and "tail" in a fair coin toss).

1.2 Continuous ranges of possibility

If the set of possible results is continuous, such as that the measured height

of a child will be 1.3456234...m, then the probability of any particular value

occurring is vanishingly small, and the concept of "probability density" has

to be introduced. Even if, by some magic, you knew absolutely that the correct

height was exactly 1.3456324408, how strongly would you believe that the

measurement would show that value rather than 1.34563244080001? But on the

other hand, how strongly would you believe that the measurement would show a

value between 1 and 1.5m? If you specify a range of values, it is reasonable

for you to believe, strongly or weakly, that the measurement will fall within

that range. It is more likely that the child will be found to be between

1.3 and 1.4m than between 1 and 1.1m.

If the probability that the measurement of x lies between a and b is pab(x),

the average probability density of x over the range a to b is pab(x)/(b-a).

As (b-a) approaches zero, this value approaches the "probability density at

x."

Returning to the example of the coin toss, the location at which the coin

falls has a continuous range of possibility. We should perhaps be talking

about the probability density of a "head at x0, y0." But we can integrate the

probability density over all x, y, leaving only the discrete probabilities of

the results "head" and "tail."

Sometimes we can't find a natural set of discretely different results, and we

have to get probabilities simply by selecting regions of the result space and

integrating the probability density over them. This is what we do when we make

a measurement. The measuring instrument gives a reading on a continuous scale,

which we arbitrarily reduce to membership in a bin such as "the child is 1m 34

tall." The probability of getting that measurement is the integral of the

probability distribution over the range of measurements we put into that bin.

A probability distribution is simply the way the probability varies over all

the discrete alternatives, or the way a probability density varies over the

range of possibilities. The sum over all the alternatives, or the integral

over all the possibilities, must sum to unity. And nowhere is the probability

or the probability density negative.

2. Information and Uncertainty

2.1 Single variables

Uncertainty is a property of a probability distribution, which is to say any

distribution that is always non-negative and sums to unity over the full range

of the variable over which the distribution is taken. The variable may be

single-dimensional, multi-dimensional, or even infinite-dimensional. The value

of the uncertainty U(x) of a probability distribution p(x) is defined as

U(x)=sum over all x of (p(x) log (p(x))

(or as the integral over all x of pd(x) log(pd(x),

where pd(x) is the probability density over x).

The rationale for using this logarithmic function was presented by Shannon in

Shannon and Weaver (1947; The Mathematical Theory of Communication). It is

unique in several respects that relate to the everyday concept of

"uncertainty."

Since uncertainty is a measure characterizing a probability distribution, it

may be conditional on whatever might affect the probability distribution. In

particular, if the probability distribution over x changes after an

observation of y, U(x) may differ from U(x|y) (read: the uncertainty of x

given knowledge of y).

For any single variable x(t), there are two uncertainties of interest about

the value of x(t0), the value of x(t) at time t0: the raw uncertainty U(x(t0))

and the uncertainty given knowledge of the waveform up to time t1<t0

U(x(t0)|x(from t= -infinity to t1)).

(For notational convenience, this latter will be written Ut1(x(t0)).

Ut1(x(t0)) ranges from zero for t1=t0 to U(x(t0)) for t1 = minus infinity for

all physically realizable signals. As a function of t1, Ut1(x(t0)) is an

informational analogue of the autocorrelation function of x(t). Unlike the

autocorrelation function, it is never negative. For t1>t0, it is always

zero, since the value of x(t0) would then be included in the part of the

waveform that is already known.

Information is defined as reduction in uncertainty. The Information about

x(t0) provided by x(from t= -infinity to t1) is

U(x(t0)) - Ut1(x(t0)) = It1(x(t0)).

It is an even closer analogue to the autocorrelation function of x(t), since

for t1 = t0 it takes on its maximum value, and declines monotonically to zero

for t1 approaching minus infinity.

The Information Rate R(x(t)) of x(t) can be taken as the slope of It1(x(t0))

as t1 approaches t0. R(t)=dI(t)/dt. It indicates how much new information is

gained per second about the value at t0, by observation of the preceding

portion of the waveform. The integral of the information rate is obviously the

total information available up to time t1 about what value the waveform will

take on at time t0.

If one knows a waveform to be periodic with a certain period, then knowledge

of the waveform over any complete period serves to define it for eternity.

There is no remaining uncertainty as to the value x(t) for any specific t,

because the periodic waveform need only be replicated until the desired value

of t is reached. There is no uncertainty about the value at t, and the

information rate of the signal is zero after one cycle has been observed.

In most cases of so-called "periodic" waveforms, the replication is not exact

from cycle to cycle, if only because of noise. In such cases, information is

gained about x(t), if only slowly, by continuing to observe the waveform until

time t, and the waveform does have some information rate, perhaps very low,

but not zero. A waveform that is nearly periodic because it is a modulated

carrier also allows x(t) to be approximated by extrapolation from prior cycles

of the carrier, but information is still gained by observing x up to time t.

In this case, the information rate (apart from noise effects) is that of the

modulation signal.

Always, the amount of information gained about the value at x(t) from

observing earlier parts of the waveform depends on the model the observer has

of the waveform. If the observer believes that the waveform is really random,

and that the periodicity so far observed is only happenstance, then no

information about x(t) will be gained from prior observation. The observer's

uncertainty as to the value of x(t) will remain unchanged until time t, when

an actual observation will reduce the uncertainty to zero.

The information gained by that one observation will be equal to the original

uncertainty. In contrast, if the observer believes the waveform to change only

smoothly, then the uncertainty of x(t) will be continuously reduced as time t

approaches, and the actual observation at x(t) will provide an infinitesimal

amount of information. Observation of the prior signal provided it all.

"The observer believes" can be mapped directly into "the filter has the form

that...". The observer believing that the waveform is periodic is mappable

into a filter with zero bandwidth. The observer believing that the

waveform is purely random is mappable onto a filter with infinite bandwidth,

and the observer believing that the waveform is a modulated carrier is

mappable onto a filter that has a finite bandwidth centred on some non-zero

frequency representing the carrier. The "observer" need not be a person; it

depends only on the way in which the signal as observed or filtered will

change over time. More on this in the section on filters and functions.

2.2 Pairs of variables

The situation is more complex when there are two variables x(t), y(t). It may

be that there is some commonality in the sources of the two waveforms, such

that U(x(t0) > U(x(t0)|y(from t= -infinity to t1)). In this case knowledge of

the waveform of y before t1 reduces the uncertainty of x(t0). One can talk

about the information provided by y(t) about x(t),

I(x(t0)|y(from t= -infinity to t1)).

A non-zero value of I does not imply any physical connection between y and x,

and even if there is a connection, it could be in either direction. The

relationship is analogous to the correlation between two variables. A non-zero

correlation does not imply a causal relationship between the variables, and

nor does a non-zero amount of information provided by one about the other.

Even if y affects x directly, such that x(t) is a function of y(tau), the

effect might be delayed or spread out over time, as it would be if

x = integral(ydt). If the effect of y on x is delayed by deltat, then the

slope of I(x(t0)|y(from t= -infinity to t1)) as a function of t1

will be zero for t1 > t0-deltat. One cannot, therefore, use dI(yx)/dt at t1=t0

as the information rate of y about x in the same way one can use dI(xx)/dt at

t1=t0 as the information rate of x(t). Instead, one can use the maximum of

dI(yx)/dt as a function of t0-t1 as the "channel rate" between y and x.

It is important to remember two things: (1) that the assumed knowledge of y

extends over its whole waveform prior to t1, and (2) that the existence of a

non-zero channel rate between y and x does not imply that there is a physical

connection between them. There is a non-zero channel rate between my activity

level in North America and the activity level of a person in Australia, even

though we have never communicated with each other, because we both are likely

to be more active in the daytime than at night.

3. Filters and Functions

If x(t) is the output of a filter or function whose input is y(t), then, apart

from noise, bandwidth, or resolution limits, x(t) is fully determined by y(t),

in the sense that knowing the filter function and y(t) from minus infinity to

t0 would be sufficient to specify x(t0).

U(x(t0)|y(from t= -infinity to t1)) -> 0 as t0-t1 approaches zero.

But physically realizable filters do have limits on all these parameters, so

U(x(t0)|y(from t= -infinity to t1)) has a lower bound greater than zero. The

input to the filter cannot completely specify the output.

The input may have frequency components outside the acceptance bandwidth of

the filter, in which case an infinite number of different possible input

signals would provide the same output signal. The filter may be able to

resolve only down to some finite limit, either because of internal

quantization or because of physically unavoidable noise, in which case again

several different signals might provide the same output signal. Either way,

given the filter output, the uncertainty of the input may be much reduced, but

it is not reduced to zero.

## ยทยทยท

-------------------------

See you in 2 or 3 weeks.

Martin