An argument for hierarchy

[Martin Taylor 960304 17:20]

I've been having a sporadic private conversation with Hans Blom, in the
course of which he pointed out that the generic Kalman Filter model with
N inputs would presumably start with full connectivity (N^2 connections)
even though after learning it would be quite likely that many of the
connections would be found to have weights very near zero. However, it
would be nearly impossible to specify in advance which connections should
be left out.

Ignoring the control aspects of the situation, I treated the problem as
being to determine from incoming data x1,...,xn and the value of a function
F(x1,...xn) what is the form of that function. The result seems to be
a description of the perceptual side of an HPCT hierarchy, along with an
explanation of why the hierarchy should not be expected to be composed
simply of linear perceptual functions. As follows (slightly edited):

ยทยทยท

----------------
In many cases, a good first approximation to F(x1, ..., xn) is
F = f1(x1) + f2(x2) + ... fn(xn). To find an optimum solution of that kind
is easier than to deal with F itself. There are n connections rather than
(n^2)/2 + n, as there would be if quadratic interconnections were to be
permitted, or n + (n^2)/2 + ... (n^n)/n! --> exp(n) in a full representation
of F. Of course, the fk have an undefined number of degrees of freedom
themselves, but again, a linear first approximation is often reasonable.

In what follows, I'm treating the perception of the world, which includes
the output-to-input relations of control systems, but to do things properly
one ought to analyze the output function separately as well as together
with the input function. I'm ignoring that complication. What follows is
complex enough for one posting;-)

If F is treated as "model" of the universe in the input of a control
system, then the fk can be treated as the perceptual input functions of
a one-level array of scalar elementary control units. But if they are,
then the on-line perceptual control action of the ECUs compensates pretty
well for any nonlinearities that do not affect the monotonicity of the
function. fk(x) ~= c*x works almost as well as fk(x) itself. So would
fk(x) ~= c*log(x), which is physiologically often reasonable. So, for the
purposes of moderately good (but non-optimal) control, the learning process
should be able to ignore the form of fk, and should concentrate on the
errors inherent in the modularization of F into sum(fk), ignoring the
interactions among the fk.

There are two main sources of possible error. Firstly, each xk is a projection
from a space of unknown dimensionality and complexity onto a single axis,
and secondly there are unknown possibilities for interactions among the
dimensions. It may be, for example, that f10(x10) = 3*x10 when f35(x35) has
the value 21, but f10(x10) = -200*x10 when f35(x35) has the value 20 (Example,
putting something "on" a table when your hand is just above the table top
or when your hand is just below the table top. The behaviour of the world
when you let go of the object is quite different in the two cases.)

The learning solutions to the two sources of error are intermixed. If the
fk are linear, as, by assumption above is not a bad first step, then the
second approximation to F is to include terms in fi*fj. But the analytic
solution to this problem is to do a principal components rotation of the
space, eliminating the quadratic terms by redefining the xk within the
original space. In PCT terms, this means altering the perceptual input
functions of the single-level array of ECUs so that they tend toward
mutual orthogonality.

In fact, I argued as long ago as 1973 that developing data-based orthogonality
exactly what peripheral sensory systems do, and I think other people have
made similar points, though I don't know what the present thinking is among
sensory physiologists. Bill Powers, similarly, has argued that the
reorganization process will inevitably lead to an orthogonalization of the
perceptual variables within a control level, because that is the situation
in which there is minimum conflict among the ECUs within the level, and
therefore it is the situation in which reorganization is slowest.

The process of orthogonalizing the first-level perceptual variables will
not eliminate all quadratic contributions to F(x1,...,xk,...,xn), but it will
eliminate the ones related to the linear components of fk(xk). Those that
remain will have no greater a sum than in the initial arbitrary partitioning
of the universe into the xk, and probably will be a great deal smaller. But
they will exist. Let us call them "intrinsic" quadratic interactions that are
inherent in the environment.

The set of {xk} spans the perceptible universe. Rotating that set into a
principal components configuration does not increase or decrease the
dimensionality of the space being observed, but it reduces the interactions
among the terms, reducing the interference of one scalar control on another
if the fk are taken to be the perceptual functions of individual ECUs. The
"intrinsic" interactions represent interference between individual ECUs
in the one-level array. Since the linear compnents have been orthogonalized
in this one-level array, such individual interferences can only be handled
by controlling fk*fj directly -- or more properly, by controlling the two-way
interactions, quadratic or not, fkj=f(fk, fj).

In a monolithic model, the interactions can be introduced by modifying
F ~= f1(x1) + ... + fn(xn) to add terms in fij(fi(xi),fj(xj) for
those i, j, that have appreciable intrinsic interactions. Using scalar
control systems, the interaction term fij _could_ be introduced by
adding a new scalar control system in parallel with the initial array,
but such an added control system would conflict directly with the fi
and fj control units already existing. To avoid such a conflict, the new
control system would have to supplant fi and fj, and be vector rather than
scalar valued. It would have as inputs both xi and xj, with two outputs
and a control function that incorporated fi, fj, and fij in one monolithic
vector function. And even that approach would not work if fi had an intrinsic
interaction with fk as well as with fj. The fik control unit and the fij
control unit could not both comfortably supplant fi without conflict.

In a system of scalar control units, the intrinsic interactions are
more naturally handled, by controlling fij directly, treating fij as
the function of fi and fj that it is. In other words, the fij control
is a scalar control unit whose inputs are the results of functions fi
and fj, and whose output uses the outputs of the fi and fj control units
to affect xi and xj, rather than conflicting with the fi and fj controls.
Using this approach, it does not matter if fi has an additional intrinsic
interaction with fk. The fij and the fik control units act independently,
and neither "sees" or acts upon the outer world directly. Their "seeing"
and acting are through the existing maximally orthogonalized first-level
controls on xi, xj, and xk.

Used in conjunction with the first level array, this second-level array
provides a second approximation to the original and true (unknowable)
function F.

Of course, this second-level array of scalar control units, may itself
prove to be non-orthogonal, and therefore may experience conflict within
the layer. But if we treat each fij as a new function g(n) of a new set of
variables yn, where yn is the output of an arbitrary one of the original
first-level functions fk, we can see the second level by itself as
representing an approximation to a quite new function G(x1,..,xn), where
G represents the failure of the first approximation F ~= (f1,...,fn).
We have created a linear approximation G ~= (g1,...,gn), where the gk
represent the quadratic intrinsic interactions that could not be handled
by principal components rotation of the original xk space.

It is natural to continue this process. At level G, the functions gk can
be "naturally" rotated into a principal components configuration that
minimizes their conflict, but this rotation will probably leave still some
error in the approximation to the original F that is a true (unknowable)
representation of the environment spanned by the sensory variables xk.

It seems to me that this process, carried to n levels, would produce an
exact representation of the unknowable F, and would do it in a learnable
fashion. How long it would take, though, would be a strong function of
n, in fact an exponential function of n. The development of level k would
depend largely on the stability and accuracy of level k-1. There would be no
point in developing complex structures that control for interactions among
elements of a level that has not been at least approximately orthogonalized,
and the initial orthogonalization takes some time. The time to develop
each level will be longer than the time to develop the previous one, in
what is presumably a self-similar process level by level. I assume that
much of the work at the lower levels has been done by evolution, and is
pretty well fixed in an individual organism, but it doesn't matter in
principle whether this is so.

The difference between this approach and the Kalman Filter approach is
that the K-F starts with all exp(n) (or at least n^2) connections and then
learns which ones to ignore, whereas the development of a set of scalar
layers starts with n units and expands when there are interconnections
that should not be ignored. Increments on each layer can be done locally,
and conflicts between individual functions within a layer can be reduced
by orthogonalization, either analytically (if you are an omnipotent
designer) or through perceptual reorganization (if you are an ordinary
organism growing toward maturity).

Martin