[Martin Taylor 960707 18:20]

I'm beginning to think that this talk of model-based control, models and

non-models, neural networks and models, real-time versus model-based

control..., is just another case of the blind men and the elephant. Maybe

it isn't, but bear with me for a (long) moment.

Some days ago, I mentioned to Shannon the notion of the "wasp-waisted"

multi-layer perceptron (MLP). In that message I used it just as an

example of a neural net that learns without a teacher. But now I think

it can be used to make a deeper contribution, so first I should explain

what it is and does.

First a sketch of a "classic" MLP that needs a teacher. In this drawing,

each "X" represents one node that has inputs from many, perhaps all nodes

at the level below (or if it is a bottom level node, inputs from many,

perhaps all sensors). The top level outputs to the environment, and is

the only place where the actions of the MLP are observable. Continuously

changing input patterns (or a succession of discrete ones, but I'll deal

with continuous ones) arrive at the bottom, and continuously changing

output patterns are output to the world at the top. I've selected the

numbers of nodes in each layer more or less randomly.

Figure 1 XXXXXXXXXXXXXX

XXXXXXXXXXXXXXXXXX

XXXXXXXXXXXXXX

XXXXXXXXXXXXXXXXXXXX

In this MLP there are 4 layers, or two "hidden layers". The MLP is "taught"

by comparing the output at any given moment with the "desired" output--

desired, that is, by a teacher. The differences are propagated backward

through the network in the form of changes of weighting of the connections

between the layers. The changes are small, and of the kind that would

reduce the deviation between desired and obtained output. Eventually,

if the desired input-to-output mapping is reasonable and the input patterns

are not too complex for the size of network, the outputs come to match the

desired outputs for almost all input patterns.

So far, so good. Such networks have been heavily studied and their abilities

and limitations are reasonably well known. I'm not going to go into that

here, since the same arguments can be used no matter what the form of the

neural network used as an example. MLP's are simple.

There are two things to note: (1) the number of nodes in the hidden

layers matters. Too few and the "problem" cannot be solved, too

many and it takes forever and a day for the net to settle; (2) the

number of hidden layers matters. An MLP with more than three hidden layers

can "solve" any partitioning of the input data space, in the sense that

if the teacher desires a "classification" of the input, it doesn't matter

whether the regions called "A" are convex, interlaced with "non-A" patterns

or even scattered all over the input space. Of course, remembering (1),

there must be enough nodes in each hidden layer to deal with the complexity

of the space, but given that, any partitioning problem is soluble with

three hidden layers. (I think it's three. It might be two or four, and

I can't find my copy of Lippman's paper with the proof).

Now let's introduce the wasp-waisted network. It is a "classic" MLP of

a particular configuration:

Figure 2 XXXXXXXXXXXX

XXXXXXXX

XXXX

XXXX

XXXXXXXX

XXXXXXXXXXXX

There are exactly as many nodes in the top layer as in the bottom, and

fewer in the intervening layers. The limitation on what it can learn is

obviously imposed by the narrow wasp-waist, where the pattern of weights

must be sufficient to permit the desired mapping of output to input.

Now comes the trick. Instead of a teacher who provides the "correct"

output for each input, the network is provided with its own input as

the teacher. For each input, it is asked to produce the identical pattern

as its output. If the wasp-waist were not there, the problem would be trivial,

since each input node could be connected directly to an output node, perhaps

through a series of intermediary nodes.

But the wasp-waist prohibits this trivial solution. The input patterns

have to be _represented_ by patterns of signals output by the nodes at

the narrowest level. This representation must be efficient, the more so

the narrower the waist. The only way it can do this is to take account

of regularities in the input patterns, and that is what actually happens in

such networks. If there's a regular stripe pattern, for example, it might

be that one of the nodes outputs something an analyst might call "stripe

spatial frequency", another "stripe orientation," and a third "stripe

phase." Together, these three values could permit any input pattern of

regular stripes to be reproduced at the output, no matter how many pixels

of input and output were involved in the pattern.

The wasp waist "encodes" the input regularities, and any variations in the

input that are not encoded at the wasp waist cannot be reproduced in the

output.

So now we have a situation in which an input function I(x1,...,xn) is

represented in a wasp-waist output function W(w1,...,wk) where k<<n. The

wasp-waist function is a non-linear projection from the high-dimensional

input space to a much lower-dimensional representation (or model) space.

The pattern of stripes is "modelled" explicitly as "3 stripes per inch,

at 30 degrees left of vertical, with the first black edge at 0.03 inches

from centre". It would be conceivable to tap in to the signals output

from the wasp-waist nodes, and use them to generate just such linguistic

representations of the input, if that turned out to be the representation

that was learned.

If we split apart the wasp-waisted MLP at the waist, and remove the bottom

half, and then we inject signals representing "3 stripes per inch,

at 30 degrees left of vertical, with the first black edge at 0.03 inches

from centre" the output at the top will be just such a set of stripes.

The "model" representation is complete, so far as it goes. "So far

as it goes" is to ignore things such as possible irregularities in the

stripes, or curvatures, and stuff like that. To represent that kind of

thing would have required a different coding to have been learned at the

wasp-waist, perhaps at the cost of losing something in the representation

of regular stripes.

The split wasp-waisted MLP can be reconnected and redrawn like this, with

the upper half turned upside down, so that both input and output of the

whole net are at the bottom:

Figure 3 ---->-----

> -------- |

>> >>

XXXX XXXX

XXXXXX XXXXXX

XXXXXXXX XXXXXXXX

^^^^^^^^ ||||||||

>>>>>>>> VVVVVVVV

It's the same network as before, but perhaps now its implications may be

a little different when drawn this way. The environment is now at the

bottom of both pyramids, and the input side and the output side are

kept separate.

Let's change the training a little, and instead of asking the network to

produce at its output a reproduction of its input, we ask it to produce the

negative of its input. That's just as easy, so why not?

And as a final step, let's connect the outputs directly back to the

corresponding inputs.

Figure 4 ---->-----

> -------- |

>> >>

XXXX XXXX

XXXXXX XXXXXX

XXXXXXXX XXXXXXXX

^^^^^^^^ ||||||||

>>>>>>>> VVVVVVVV

>>>>>>> -- |||||||

>>>> ---<---- ||||

>>------<-------||

Now, whatever is applied at the input is annulled by the reconnected

output. If there is some other place from which input is derived, its

effects are "immediately" countered. We have a control system that

nulls out the influence of disturbances. (But remember that word

"immediately" because we will return to it later.)

What kind of a control system is it? It has multiple scalar inputs

and multiple corresponding scalar outputs. Internally, it has scalar

signals running around. But nowhere is there anything corresponding to an

"Elementary Control Unit". A simple MLP has produced a model representation

of the world as its "perception" and used that "model" to generate the

output that opposes the external input (disturbance) _pattern_, using the

regularities (redundancies) in the input patterns to which it was exposed

while it was learning. But it won't effectively oppose patterns of different

kinds that it did not see while learning.

Now let's talk about "immediately." There is always some delay through

every node, whether the network is simulated in a computer or exists in

a real analogue world. So the output cannot match or oppose an input that

is changing irregularity faster than the time it takes signals to pass

through from input to output. However, if there are regularities over

time in the input, it is possible that the network could learn to produce

the appropriate output, more or less, despite the throughput delay. It

would seem to predict, but it is a strange kind of "prediction," which

does not occur explicitly anywhere in the network.

Having seen that the wasp-waist network can turn into a strange kind of

control system, let's vary it some more. Now we will fix the upper half

of the original network in an arbitrary configuration that does not

participate in the learning procedure. It is what it is, and there's an

end to it.

Figure 5 YYYYYYYYYYYY

YYYYYYYY (fixed and "mysterious")

YYYY

## ···

----

XXXX

XXXXXXXX (adaptable and able to learn)

XXXXXXXXXXXX

The MLP that is left is a pyramid with a wide base and a narrow

top, but the learning job is still to make the output (at the wide top of

an inverted pyramid of "Y" symbols) match the inverse of the input. This

is obviously possible, if the "Y" connections are appropriate, meaning it

permits the existence of an encoding that can be decoded back into

the input. The "X" weights can learn throught a standard back-propagation

algorithm, even if the "Y" weights don't change, and what they must learn

is hardly any more difficult than before, provided that the "Y" connections

are such that that a "model" _can_ exist at the "cut-mark" waist.

Now we can re-invert the top part, and even go further than in Figure 4,

as in the next picture. Again the X part learns to compensate for the

relationship between its output and the effects of the Y part on its

own inputs.

----------

Figure 6 | -------- |

>> >>

XXXX ||

XXXXXXXX || (adaptable and able to learn)

XXXXXXXXXXXX ||

^^^^^^^^^^^^ ||

>>>>>>>>>>>> VV

YYYYYYYYYYYY ||

YYYYYYYY || (Unknowable "world")

YYYY ||

>> >>

> -------- |

----------

These "fixed" Y connections might be in the outer world completely or in part,

so far as the X (learning) part of the network is concerned. The way the

wasp-waist "model" is constructed depends on the way that the X outputs

affect the Y outputs that affect the X outputs...

Why should the Y pyramid not _be_ the outer world, and the top XXXX be the

muscular effectors that act, in a low-dimensional way, on this high-dimensional

world? Or perhaps some of the Y part is output musculature, and some of it

is in the outer world.

Is this "model-based" control, "neural network" control, or what?

This picture lacks several things. It is just a waspish view of the control

elephant. There is no "reference" input pattern, and no "disturbance,"

though we have mentioned them. And, perhaps most importantly, we haven't

even considered hierarchy. Because what we will arrive at in the end is

the normal HPCT hierarchy, or something very like it, even though at

this point we seem to have gone far from it.

This message is long enough to be going on with, and I should be going

home. I hope that it is enough to point up a direction through the fog,

and to at least indicate how a scalar control hierarchy can look sometimes

like a perceptually model-based complex one-level control system, sometimes

like an output-modelled (Artificial Cerebellum) system, sometimes like

Shannon's neural-net-based system, and that these ways of looking are

not mutually inconsistent. They are views of the same elephant.

Martin