A wasp-bitten elephant

[Martin Taylor 960707 18:20]

I'm beginning to think that this talk of model-based control, models and
non-models, neural networks and models, real-time versus model-based
control..., is just another case of the blind men and the elephant. Maybe
it isn't, but bear with me for a (long) moment.

Some days ago, I mentioned to Shannon the notion of the "wasp-waisted"
multi-layer perceptron (MLP). In that message I used it just as an
example of a neural net that learns without a teacher. But now I think
it can be used to make a deeper contribution, so first I should explain
what it is and does.

First a sketch of a "classic" MLP that needs a teacher. In this drawing,
each "X" represents one node that has inputs from many, perhaps all nodes
at the level below (or if it is a bottom level node, inputs from many,
perhaps all sensors). The top level outputs to the environment, and is
the only place where the actions of the MLP are observable. Continuously
changing input patterns (or a succession of discrete ones, but I'll deal
with continuous ones) arrive at the bottom, and continuously changing
output patterns are output to the world at the top. I've selected the
numbers of nodes in each layer more or less randomly.


In this MLP there are 4 layers, or two "hidden layers". The MLP is "taught"
by comparing the output at any given moment with the "desired" output--
desired, that is, by a teacher. The differences are propagated backward
through the network in the form of changes of weighting of the connections
between the layers. The changes are small, and of the kind that would
reduce the deviation between desired and obtained output. Eventually,
if the desired input-to-output mapping is reasonable and the input patterns
are not too complex for the size of network, the outputs come to match the
desired outputs for almost all input patterns.

So far, so good. Such networks have been heavily studied and their abilities
and limitations are reasonably well known. I'm not going to go into that
here, since the same arguments can be used no matter what the form of the
neural network used as an example. MLP's are simple.

There are two things to note: (1) the number of nodes in the hidden
layers matters. Too few and the "problem" cannot be solved, too
many and it takes forever and a day for the net to settle; (2) the
number of hidden layers matters. An MLP with more than three hidden layers
can "solve" any partitioning of the input data space, in the sense that
if the teacher desires a "classification" of the input, it doesn't matter
whether the regions called "A" are convex, interlaced with "non-A" patterns
or even scattered all over the input space. Of course, remembering (1),
there must be enough nodes in each hidden layer to deal with the complexity
of the space, but given that, any partitioning problem is soluble with
three hidden layers. (I think it's three. It might be two or four, and
I can't find my copy of Lippman's paper with the proof).

Now let's introduce the wasp-waisted network. It is a "classic" MLP of
a particular configuration:

        Figure 2 XXXXXXXXXXXX

There are exactly as many nodes in the top layer as in the bottom, and
fewer in the intervening layers. The limitation on what it can learn is
obviously imposed by the narrow wasp-waist, where the pattern of weights
must be sufficient to permit the desired mapping of output to input.

Now comes the trick. Instead of a teacher who provides the "correct"
output for each input, the network is provided with its own input as
the teacher. For each input, it is asked to produce the identical pattern
as its output. If the wasp-waist were not there, the problem would be trivial,
since each input node could be connected directly to an output node, perhaps
through a series of intermediary nodes.

But the wasp-waist prohibits this trivial solution. The input patterns
have to be _represented_ by patterns of signals output by the nodes at
the narrowest level. This representation must be efficient, the more so
the narrower the waist. The only way it can do this is to take account
of regularities in the input patterns, and that is what actually happens in
such networks. If there's a regular stripe pattern, for example, it might
be that one of the nodes outputs something an analyst might call "stripe
spatial frequency", another "stripe orientation," and a third "stripe
phase." Together, these three values could permit any input pattern of
regular stripes to be reproduced at the output, no matter how many pixels
of input and output were involved in the pattern.

The wasp waist "encodes" the input regularities, and any variations in the
input that are not encoded at the wasp waist cannot be reproduced in the

So now we have a situation in which an input function I(x1,...,xn) is
represented in a wasp-waist output function W(w1,...,wk) where k<<n. The
wasp-waist function is a non-linear projection from the high-dimensional
input space to a much lower-dimensional representation (or model) space.
The pattern of stripes is "modelled" explicitly as "3 stripes per inch,
at 30 degrees left of vertical, with the first black edge at 0.03 inches
from centre". It would be conceivable to tap in to the signals output
from the wasp-waist nodes, and use them to generate just such linguistic
representations of the input, if that turned out to be the representation
that was learned.

If we split apart the wasp-waisted MLP at the waist, and remove the bottom
half, and then we inject signals representing "3 stripes per inch,
at 30 degrees left of vertical, with the first black edge at 0.03 inches
from centre" the output at the top will be just such a set of stripes.
The "model" representation is complete, so far as it goes. "So far
as it goes" is to ignore things such as possible irregularities in the
stripes, or curvatures, and stuff like that. To represent that kind of
thing would have required a different coding to have been learned at the
wasp-waist, perhaps at the cost of losing something in the representation
of regular stripes.

The split wasp-waisted MLP can be reconnected and redrawn like this, with
the upper half turned upside down, so that both input and output of the
whole net are at the bottom:

Figure 3 ---->-----
                 > -------- |
                 >> >>
                XXXX XXXX
               XXXXXX XXXXXX
              ^^^^^^^^ ||||||||
              >>>>>>>> VVVVVVVV

It's the same network as before, but perhaps now its implications may be
a little different when drawn this way. The environment is now at the
bottom of both pyramids, and the input side and the output side are
kept separate.

Let's change the training a little, and instead of asking the network to
produce at its output a reproduction of its input, we ask it to produce the
negative of its input. That's just as easy, so why not?

And as a final step, let's connect the outputs directly back to the
corresponding inputs.

Figure 4 ---->-----
                 > -------- |
                 >> >>
                XXXX XXXX
               XXXXXX XXXXXX
              ^^^^^^^^ ||||||||
              >>>>>>>> VVVVVVVV
              >>>>>>> -- |||||||
              >>>> ---<---- ||||

Now, whatever is applied at the input is annulled by the reconnected
output. If there is some other place from which input is derived, its
effects are "immediately" countered. We have a control system that
nulls out the influence of disturbances. (But remember that word
"immediately" because we will return to it later.)

What kind of a control system is it? It has multiple scalar inputs
and multiple corresponding scalar outputs. Internally, it has scalar
signals running around. But nowhere is there anything corresponding to an
"Elementary Control Unit". A simple MLP has produced a model representation
of the world as its "perception" and used that "model" to generate the
output that opposes the external input (disturbance) _pattern_, using the
regularities (redundancies) in the input patterns to which it was exposed
while it was learning. But it won't effectively oppose patterns of different
kinds that it did not see while learning.

Now let's talk about "immediately." There is always some delay through
every node, whether the network is simulated in a computer or exists in
a real analogue world. So the output cannot match or oppose an input that
is changing irregularity faster than the time it takes signals to pass
through from input to output. However, if there are regularities over
time in the input, it is possible that the network could learn to produce
the appropriate output, more or less, despite the throughput delay. It
would seem to predict, but it is a strange kind of "prediction," which
does not occur explicitly anywhere in the network.

Having seen that the wasp-waist network can turn into a strange kind of
control system, let's vary it some more. Now we will fix the upper half
of the original network in an arbitrary configuration that does not
participate in the learning procedure. It is what it is, and there's an
end to it.

                        YYYYYYYY (fixed and "mysterious")


                        XXXXXXXX (adaptable and able to learn)

The MLP that is left is a pyramid with a wide base and a narrow
top, but the learning job is still to make the output (at the wide top of
an inverted pyramid of "Y" symbols) match the inverse of the input. This
is obviously possible, if the "Y" connections are appropriate, meaning it
permits the existence of an encoding that can be decoded back into
the input. The "X" weights can learn throught a standard back-propagation
algorithm, even if the "Y" weights don't change, and what they must learn
is hardly any more difficult than before, provided that the "Y" connections
are such that that a "model" _can_ exist at the "cut-mark" waist.

Now we can re-invert the top part, and even go further than in Figure 4,
as in the next picture. Again the X part learns to compensate for the
relationship between its output and the effects of the Y part on its
own inputs.

Figure 6 | -------- |
                           >> >>
                          XXXX ||
                        XXXXXXXX || (adaptable and able to learn)
                      XXXXXXXXXXXX ||
                      ^^^^^^^^^^^^ ||
                      >>>>>>>>>>>> VV
                      YYYYYYYYYYYY ||
                        YYYYYYYY || (Unknowable "world")
                          YYYY ||
                           >> >>
                           > -------- |

These "fixed" Y connections might be in the outer world completely or in part,
so far as the X (learning) part of the network is concerned. The way the
wasp-waist "model" is constructed depends on the way that the X outputs
affect the Y outputs that affect the X outputs...

Why should the Y pyramid not _be_ the outer world, and the top XXXX be the
muscular effectors that act, in a low-dimensional way, on this high-dimensional
world? Or perhaps some of the Y part is output musculature, and some of it
is in the outer world.

Is this "model-based" control, "neural network" control, or what?

This picture lacks several things. It is just a waspish view of the control
elephant. There is no "reference" input pattern, and no "disturbance,"
though we have mentioned them. And, perhaps most importantly, we haven't
even considered hierarchy. Because what we will arrive at in the end is
the normal HPCT hierarchy, or something very like it, even though at
this point we seem to have gone far from it.

This message is long enough to be going on with, and I should be going
home. I hope that it is enough to point up a direction through the fog,
and to at least indicate how a scalar control hierarchy can look sometimes
like a perceptually model-based complex one-level control system, sometimes
like an output-modelled (Artificial Cerebellum) system, sometimes like
Shannon's neural-net-based system, and that these ways of looking are
not mutually inconsistent. They are views of the same elephant.