Uncertainty, information, and mutual information (was Ashby's Law of Requisite Variety)

[Martin Taylor 2012.12.11.11.18]

Yes.

No. The theorem is correct, but the conclusion is not, because your
two equations show that D is not integral O.
O = k(R-P) = k(R-D - integral(O))
O+kintegral(O)= kR - kD
If R=0
D = - (integral(O) + O/k)
The correlation between D and O is not identically zero, though the
correlation between O/k and integral(O) is. The correlation between
D and either O or its integral is indeterminate, apart from the fact
that the sum of the squares of the two correlations cannot exceed
1.0.
The notation here is a little misleading. Typically on CSGnet, we
use “O” to represent the output from the output function. Here you
use it to represent a constant times the error variable, which would
be the input to the integrator output function except for the fact
that the multiplier “k” is usually included as the gain rate of the
integration rather than as an external multiplier. I do note that
you put the integral(O) in the"Environment" as though you mean it to
be in the environmental feedback function, but that makes no
difference to the maths, so I will ignore that and assume that you
intended a bog-standard control system with an integrator output
rather than a proportional controller with an integrator
environmental feedback function. The output of the standard control
system itself is integral(O), which tracks -D pretty well if control
is good. I’m not sure why you are asking about the relationship between the
disturbance and the error (or using a proportional control system),
because this has not been at issue, but since you ask …
I’ll answer these three questions together at the end of this
message, since they seem to be all the same question using different
boundary conditions. But first, most of this message consists of
some discussion about the nature of “information” and “mutual
information” to be sure we are talking about the same thing.
The key concept is “Uncertainty”, or to use the word Shannon came to
use, “Entropy”. I prefer “Uncertainty”, symbolized “U”, because it
connotes that the probabilities involved are always conditional on
some prior understanding, such as that a variable has a random
Gaussian distribution with specified variance, or that the values
come from a known symbol set. without that prior understanding, U
would always be infinite, because anything at all could be true.
U is computed using Shannon’s formula U = - Sum
pi log(pi)), where the i represent the possible cases to which
probabilities may be assigned. If “i” is the value of a continuous
variable, the probability is replaced by a probability density
(probability per unit region of the variable value) and the Sum by
Integral. As Shannon noted, the value of the integral depends on the
choice of unit of measure, and can be made as large as one wishes by
making the unit interval indefinitely small.
In the case of the communication channels of interest to Shannon,
the receiver has a certain prior understanding, such as that
whatever is received will represent one or more alphabetic letters
and nothing else, or that what is received will be a voltage that
does not exceed some value that would blow out the receiver, or that
what is received will be the sounds of speech in a known language.
The level of prior understanding may be no more than this, or it may
include some expectations that some of the possible receptions are
more likely than others. In other words, the probability
distribution over the possible receivable events may be non-uniform.
The receiver’s uncertainty is maximum if the probability
distribution is uniform, and is less if the probability distribution
is non-uniform: U ≤ Umax.
When the transmitter actual transmits something, be it a voltage, a
symbol, or a dinner invitation, the receiver’s distribution of
probability about what was to be transmitted has changed. Here we
change the wording slightly, replacing “transmitter” and “receiver”
by “observed” and “observer”. The receiver has made an “observation”
of the transmitter. Any observation may change the observer’s probability distribution
about the thing observed. The change in probability distribution
usually means that the uncertainty U has also changed, usually by
decreasing since the observation usually increases the probability
that the observed is in one particular state rather than other
possible states. It is possible that before an observation, the
observer was fairly sure of what would be observed (had a strong
belief), but the observation did not conform to expectations,
leaving the observer’s probability distribution more uniform than
before. In such a case, U is increased by the observation. Shannon defined “information” as the “change in uncertainty” (though
he used “entropy”). I = U(before)-U(after)
On average, an observation decreases uncertainty about the observed,
so the information obtained by an observation is usually positive.
Now we come to “Mutual Information”. Suppose our observer wants to
know something about X, but can observe only Y. Before observing Y,
he has uncertainty Ux(beforeY) about the state of X. After observing
Y sufficiently that the uncertainty about Y is zero, the uncertainty
about X is Ux(afterY). The information such an observer would get
about X from observing Y is Ux(beforeY) - Ux(afterY). This value is
the mutual information between X and Y, which we can label M(X:Y).
As can be shown by working through the probability algebra, M(X:Y)
is symmetric: M(X:Y) = M(Y:X). If X and Y are single-valued
functions of each other (i.e if you have the value of one, you can
compute the value of the other), then U(X) = U(Y) = M(X:Y).
Of course, there is no need for an actual observer in order to make
these calculations. All you need is the relevant probability
distributions. It often is useful to imagine one place in a network
as “observing” another place in the network, even though there
actually is no “observing act”, just an influence of what happens at
one place on the statistics of the state at another place in the
network. As a control system is a kind of network, we might
occasionally use the metaphor, but it must be remembered “observing”
in this context is just a metaphor.
U(X) = Sum (piX log (piX)) where piX is the probability of the ith
possibility for the state of X
U(Y) = Sum (piY log (piY))
[It’s hard to write all this to be readable in plain text without
Greek or subscripts, but I will try to make the formulae
intelligible by using small letters often as subscripts and capitals
for observables that have a range of possible states. I hope the
context will make clear when they are intended this way. I think it
would be pretty well impossible to read the derivations from
probability distributions and integrals or sums under these
conditions, but they can be found in Shannon and I think lots of
other places].
Uyj(X) is the uncertainty of X if Y has the value yj
Uyj(X) = Sum over i (pyj,iX log (pyj,i(X))
The average value over all yj of the uncertainty of X given a value
of y = yj is
Uy(X) = Sum over j (pyj * (Uyj(X)))
The uncertainty Uy(X) represents how much is NOT known about X if
it is known that Y has a particular value, without specifying what
that value of Y might be. If X is a single valued function of Y,
then if the value of Y is known (pyj = 1 for some value of j and 0
for all other values), value of X is also known, and Uy(X) = 0.
Uy(X) is the complement of the mutual information, in the sense that
U(X) = Uy(X) + M(Y:X)
U(Y) = Ux(Y) + M(X:Y)
It is often useful conceptually to write these equations with M as
the dependent variable.
M(X:Y) = U(X) - Uy(X) = U(Y) - Ux(Y)
If one considers X and Y to be a single composite variable (X,Y)
with states denoted (xi, yj), the uncertainty of this joint state is
the usual
U(X,Y) = Sum over i, j ((p(i,j) log p(i,j)
Working through the probability algebra gives
U(X,Y) = U(X) + U(Y) - M(X:Y)
From this equation we see that if X and Y are single-valued
functions of each other, U(X,Y) = U(X) = U(Y) = M(X:Y).
If Y is a single valued function of X but X is not a single-valued
function of Y (as would be the case if it was a function of Y and
some other variables) then U(X) > U(Y) = M(X:Y). More typically
when two variables seem related, both variables have other
influences, or both are separately influenced by partially
overlapping sets of influences. In that case, U(X) and U(Y) have no
necessary relation to each other, and M(X:Y) is no greater than the
lesser of the two.
If X influences Y directly, the existence of non-zero M(X:Y) can
easily be seen as the transfer of information from X to Y, but if a
non-zero M(X:Y) occurs without direct influence between them, there
must be a common influence on both. If one wants to see mutual
information as the consequence of an information flow – and it is
often helpful to do so – then the flow of information would be from
the common influence to both X and Y independently.
I hope this answers the first part of Richard’s Question1, which
asks what measure of mutual information is being used. The second
part of Question 1 has to wait until we look at the uncertainty
relationships among three variables, since in the equation D = -
(integral(O) + O/k), D is a function of two linearly uncorrelated
variables.
Richard’s questions 2 and 3 also involve three variables, so for
both reasons we have to extend the analysis to cover X, Y, and Z.
The total uncertainty of the three variables is U(X, Y, Z) = Sum
over k (Sum over j (Sum over i (pxi,yj,zk)
log (pxi,yj,zk))).
First, we can take the joint variable (X,Y) to be a single variable
W whose value takes on all the values of combinations of x and y; if
x1, y2 = 3, w1,2 = 3. In that case the analysis for W and Z is as
above for X and Y. Usually, however, we want to keep separate the
relationships of X and Z, and Y and Z, and then we must consider
also any relationship between X and Y.
First, we keep the condition that X and Y are independent. That
doesn’t mean that their effects on Z are independent. Suppose, for
example, that Z = XY. The effect of X changing from 1 to 2 is
vastly different if Y=1 or if Y=100. When we treated each
combination of x and y as a unique value w, this didn’t matter, but
when we are teasing out the uncertainty relations among the three
variables, it does matter.
Most of the next section is taken straight from the Garner and
McGill paper I attached to a previous message, so I will not go into
detail. I will just cite a few equations, some of which might help
answer Richard’s questions.
M(Z:X,Y) = U(Z) - Ux,y(Z) (The mutual information between Z and
individual combinations of X and Y values is the uncertainty of Z
before observing X or Y and its uncertainty if the values of X and Y
are known).
M(Z:X) = U(Z) - Ux(Z) (the Uncertainty of Z less the uncertainty of
Z if you know the value of X)
My(Z:X) = Uy(Z) - Ux,y(Z) (the same if y is known, averaged over
the different possible values of y)
It’s worth pausing a moment here, to think again about the example Z
= X
Y. If you just know the value of x, you have hardly affected how
much you know about Z. If x=1, Z is distributed as Y. If x = 100, Z
is distributed as 100Y. Your probabilities for small values of Z
are greater if x = 1 than they are if x = 100, but the uncertainty
of Z is in both cases equal to U(Y). You haven’t learned very much
about Z by observing X, though you have learned something; M(Z:X) ~
0.
The situation is very different for the second equation above. If y
is known, whatever its value, as soon as you also observe X to have
the value x, you know the value of Z = X
Y exactly. This means that
Ux,y(Z) = 0 and My(Z:X) = Uy(Z). Knowing Y, X contains all the
information about Z that there is to know; My(Z:X) ~ min(U(X),
U(Z)).
In a control system, there are always two independent input
variables, the reference R and the disturbance D. M(R:D) = 0, so if
there are no other external influences and no system noise, the last
pair of equations above are likely to be the ones of interest when
analyzing the information relationships of the circuit. At this
point, therefore, we should be in a position to answer all three of
Richard’s questions, repeated here to save you going back to look
them up.
[“This way” refers to the control system Richard defined, and the
“two variables” are the disturbance waveform and the error waveform
multiplied by a constant k.]
I think that the above discussion answers the direct questions about
what the measures are (they are the same in all three questions,
with the addition of R in the third), but the implied questions are
about the values of the measures, so I will try to answer those
implied questions.

  1. M(O:D) = U(O) - Ud(O) = U(D) - Uo(D) (Remember “O” is k*error,
    not the usual output from an integrating output function)
    If R = 0, we have D = -integral(O) + O/k
    The two components on the right-hand side are linearly
    uncorrelated, and it is impossible to reduce the uncertainty of
    either by knowing the current value of the other, so their current
    values are informationally independent as well. There is no way,
    from this statement, to get an answer to the value of M(O:D), in
    general, but we can limit it to the lesser of U(O) and U(D). Typically, what one wants is not an absolute measure, but some
    kind of rate function if the system is statistically stationary
    (i.e. it doesn’t matter when you sample, the measures will be taken
    from the same distribution) or some kind of a time function if it
    isn’t (as is the case when the disturbance waveform is a step
    function). Richard’s question seems to suppose a statistically
    stationary system, in which the disturbance waveform has a low
    bit-rate so that integral(O) ~ -D. In that case, M(D:integral(O)) is
    very near U(D) or U(integral(O)), which means M(D:O) is very near
    zero.
  2. In case 2, M(D:integral(O)) is near zero, so the upper bound on
    M(D:O) is near U(D). Since the question asserts that D changes
    quickly enough that it is highly correlated with O, M(D:O) will
    actually be near U(D). As the rate of change of D decreases, M(D:O)
    decreases while M(D:integral(O)) increases toward the situation in
    question 1.
  3. Let’s examine the assertion that O correlates with R when
    everything moves slowly. The equations for the system are given as:
    Controller: O = k(R-P)
    Environment: P = D + integral(O)
    O = kR - kP = kR - kD - kintegral(O)
    O + k
    integral(O) = kR - kD
    O/k + integral(O) = R-D
    Unless I have done something really silly with the equations, it
    looks as though R and D have exactly the same relationships with O
    and integral O apart from the minus sign. They will have the same
    correlations and the same mutual information. Intuitively, if the
    controller is controlling well, the error (O/k) will be very small,
    and since it is influenced by variations in both the disturbance and
    the reference equally, there seems to be no intuitive reason why it
    should be more correlated with one than the other.
    To answer the question as posed, the measures of two-way mutual
    information are:
    M(R:D) = 0
    M(R:O) = M(D:O) The implied question asks about the values of these last two mutual
    information measures under conditions of good control. We know from
    question 1 that if R remains static, M(D:O) is near zero. The
    addition of variability in R means that M(D:O) is even smaller.
    Since the relationships of R and D to O are the same, M(R:O) will
    also be near zero.
    I know there is lots more to be said about multiple mutual
    information and the relation between changing mutual information
    relationships and apparent information flow, but I think this is
    long enough and dense enough to be going on with. I hope Richard
    feels that I have answered his questions. Since the message to which
    this is an response asked “How would the participants in this
    discussion handle some questions about the following system?” I look
    forward to seeing some of the other answers.
    Martin
···

I think this thread has gone a long way from Ashby, so I changed
the subject line.

  [From Richard Kennaway (2012.12.10 16:54 GMT)]




  How would the participants in this discussion handle some

questions about the following system?

      Controller:   O = k(R-P)


      Environment:  P = D + integral(O)




  When the controller controls well and the reference is constant,

then the perception will also be nearly constant. D and the
integral of O will be equal and opposite.

  However, this implies zero correlation between D and O.  (Because

of this theorem: a bounded variable and its first derivative have
zero correlation.)

  (1)  What definition of mutual information are you using, and what

answer does it give when applied to two variables related in this
way?

  (2)  If D varies much faster than the controller can handle (i.e.

on a timescale much smaller than 1/k), then P will closely track
D. O always tracks P (when R is constant) and there will
therefore be a high correlation between D and O. What is the
measure of mutual information in this case?

  (3)  When R and D vary randomly and independently, both on

timescales much longer than 1/k (so the controller controls well),
it will be found that O correlates well with R and not at all with
D, while the integral of O correlates well with D and not at all
with R. What are the measures of mutual information among O, R,
and D?

(

  (1)  What definition of mutual information are you using, and what

answer does it give when applied to two variables related in this
way?

  (2)  If D varies much faster than the controller can handle (i.e.

on a timescale much smaller than 1/k), then P will closely track
D. O always tracks P (when R is constant) and there will
therefore be a high correlation between D and O. What is the
measure of mutual information in this case?

  (3)  When R and D vary randomly and independently, both on

timescales much longer than 1/k (so the controller controls well),
it will be found that O correlates well with R and not at all with
D, while the integral of O correlates well with D and not at all
with R. What are the measures of mutual information among O, R,
and D?