Kalman Filtering vs. ANOVA methods

------------------- ASCII.ASC follows --------------------
[Hans Blom, 950531]

Short abstract for the psychologists amongst us:
Think of the Extended Kalman Filter as a real-time ANOVA method that, at
every point in time, gives you the best information (within the con-
straints -- number of degrees of freedom -- of the model) about how the
free variable u influences the dependent variable xt.
Think of Extended Kalman Filter based control as using that information,
again at every point in time, to compute precisely that value of the free
variable u that drives the value of the dependent variable xt toward the
reference level value xopt at the next sample time.

(Bill Powers (950530.1315 MDT))

I need a little help here. I increased the value of the disturbance of y
by modifying a program step as follows:

             >>>
             vvv
   y := xt + 2.0*vt * (sin (run / 50) + sin (run / 155))

Don't. Set vt to twice its value instead at the place where it gets its
value. Also set pvv to the value vt*vt, or rather, in this case, to 4*
vt*vt. Theory assumes the term pvv to be the square of the RMS (maximum
is fine, too) value that the observation noise can have. These two, vt
and pvv, ought to go together always! If not, funny results appear. In my
last demo, I made this easier: you can initialize both vto and pvvo at
the top of the program. The term pvv tells the filter how much observa-
tion noise variance to expect. Setting pvv to vt*vt is correct informa-
tion. Setting pvv far too low or far too high gives interesting (but
incorrect) results: set pvv too low and you get instability; set pvv too
high and you get very slow convergence.

Also, vt is expected by the filter to be zero on average. If you supply a
vt which does not have this characteristic, interesting (but incorrect)
results appear again.

Experiments which violate these assumptions will generally lead to
results which will be very difficult to interpret ;-).

I was wondering why it is necessary to specify in advance the amplitude
of the disturbances. Does the program need to know them in advance?

Yes, the program needs to know the expected values of the variances of
both noise terms, system noise and observation noise, in order to be able
to discriminate between the two, so that it can reject observation noise
but use system noise (not yet modelled dynamics) in order to tune the
parameters.

It is, however, possible to design methods to estimate the proper values
of pvv and pff if these cannot be pre-specified. In practice, it appears
to be usually possible to pre-specify pvv, because it represents a sensor
property. A method to estimate pff "on the fly" was given in my first
demo.

I'm going to have to go back to the negative feedback control system and
see what would actually happen with a perceptual disturbance. I thought
I had it figured out, but I don't understand the results your program
gives. ... Can you explain how the Kalman filter works in words of no
more than three syllables?

It is difficult to get to understand Extended Kalman Filter theory from
playing with a demonstration, especially when you are convinced that you
can explain everything in terms of a different theory (PCT versus SR, for
instance :-). Basically, it is an on-line multiple regression estimation
method, very much like the ANOVA that psychologists are familiar with.
The difference is that in the Extended Kalman Filter it is not necessary
to collect a whole batch of data before correlations can be computed;
rather, the values of the correlations are recomputed as soon as a new
sample y (or, generally, a new set of observations) comes in. The advan-
tage is, of course, that from the very start the results can be USED, in
our case to control. So basically we have an on-line multiple regression
estimation method, and only as an afterthought its results are used for
the purpose of control. If you understand ANOVA, you are close to under-
standing Extended Kalman Filters.

But let me try a short explanation again, although I cannot guarantee the
three syllables :-). Basically, the algorithm is:

1. assume (i.e. model) the form of the equations that govern the world;
   my demo models a very, very simple world that is described by these
   equations:

    xt := ct + at * xt + bt * u + ft (the dynamics)
    at := at (meaning: at is constant)
    bt := bt
    ct := ct

   ft stands for everything that we cannot (or do not want to) model and
   is, lacking more information, assumed to be best represented as nor-
   mally distributed zero average white noise (usually this works; if
   more is known about ft, use a more advanced model in combination with
   a more advanced estimation method)

   initialize the model's parameters x, a, b and c to the values of xt,
   at, bt, and ct; because the latter are generally not accurately known,
   assign sufficiently large uncertainties (pxx, paa, etc.) to the model
   parameters in such a way that e.g. paa := worst_case [(at-a) * (at-a)]

2. use the model of the system to predict, using the equations of step 1,
   what the values of the observable variables of the system will be
   (predicted x, a, b, and c); this prediction takes the form of a
   probability distribution, of which we take into account only terms of
   the first order (x, a, b, c) and second order (pxx, pax ... pcc)

   example: the system has one observable dimension xt, represented by
   the model variable x; the model predicts that at the next observation
   x is going to have the value 10 +/- 3

3. do an observation of the observable variables of the system (xt, ob-
   servable through y); this observation may be contaminated by noise
   (vt); in that case, the observation results in a probability distri-
   bution as well; the equation that governs the observation is

    y := xt + vt

   where vt stands for observation noise/inaccuracy and is, lacking more
   information, assumed to be best represented as normally distributed
   zero average white noise (usually this works; if more is known about
   vt, use a more advanced model in combination with a more advanced
   estimation method)

   example: xt can be observed with an accuracy of +/- 2; the observation
   y has the value 11; this means that x = 11 +/- 2

   note that in the demo an observation is ALWAYS performed; a missing
   observation is modelled by a y of any (reasonable) value combined with
   a very large vt; the advantage is that e.g. sensor degradation can be
   modelled by an increase of pvv

4. check whether the observation y was accurately predicted by the
   model's x, i.e. whether the two probability distributions are
   consistent; if not, increase the modelled uncertainties so much that
   enough overlap ensues

   in the above example x = 10 +/- 3 corresponds nicely with x = 11 +/-
   2; but if the observation y had been 16 +/- 2, we would adjust the
   model's prediction to x = 10 +/- 4 at this point: 10 + 4 'touches' 16
   - 2; we call this adjustment 'forgetting': we sacrifice some of our
   accumulated certainty in order to keep our internal model in line with
   our observations

   this step is required only if the parameters of the system (a, b, and
   c) can change; if they do not, the correspondence never gets lost if
   the model's parameters are initialized correctly, and forgetting is
   neither required nor wanted

5. process the observation y, i.e. combine the two probability distribut-
   ions into one

   example: given both x = 10 +/- 3 (from the prediction) and x = 11 +/-
   2 (from the observation), the result might be something like x = 10.7
   +/- 1.5

   note that the inaccuracies (of x, a, b, and c) after processing at
   this step are both smaller than the inaccuracy resulting from the
   prediction AND smaller than the inaccuracy due to the observation;
   this increase in accuracy is called convergence

6. use the model to (somehow) compute a control that will bring the most
   probable values of the system's observable dimensions (xt, observable
   through y) to their reference levels at the next sample moment; this
   computation is based on the prescribed reference levels (xopt) and on
   (known) model parameters x, a, b, and c ONLY, NOT on the (unknown)
   true parameters xt, at, bt, and ct of the 'world'

7. goto step 2

You may need to touch up on the terms information, representation, dis-
tribution, accuracy, uncertainty, and parameter, all having more than
three syllables ;-).

(Bill Powers (950530.1130 MDT))

I have finally realized why the disturbance of the variable y has so
little effect. Cancel all my previous guesses. It is because the
difference x - xt has a very small effect on the world-model.

You have discovered convergence! See step 5 above. Let's change the
example:

   example: given both x = 10.0000 +/- 0.0010 (from the prediction, which
   is by now very accurate) and x = 11.1 +/- 2.0 (from the observation,
   which is quite noisy), the result might be x = 10.0001 +/- 0.0009

A noisy observation helps very little in further tuning the model para-
meters if the latter are already very accurate. This explains

... But soon the values of the coefficients become very small, and then
both y and disturbances of y have only very tiny effects except over
thousands of iterations.

And I would change your

... the oscillating disturbance of the perception is simply not
perceived by the model-based control system. It is averaged out.

into

... the oscillating disturbance of the perception, although perceived by
the model-based control system [in y], is averaged out [as it carries no
information about how u influences xt].

When a _constant_ disturbance of +1.0 is added, rather than an
oscillating disturbance with an average value of zero ...

This contradicts the assumptions on which the model is based. Substitute
a zero-average square wave and I'm willing to talk again.

... If you set the disturbance of y so that
y = xt + 2.0*vt*(sin(run/1900))

Have you thought of setting pvv to 4*vt*vt in this case? Better yet, set
vt to twice its value and pvv to vt*vt.

This makes it clear that "unmodeled dynamics" whether at the input or
the output are unopposed by this version of the adaptive control system.

Have you taken into account that a slow phenomenon can only be learned
slowly as well?

The best way to think of the Kalman-filter adaptive system is as a
pattern generator which can be gradually altered to drive an external
system in a desired way, open-loop.

If that helps you, fine. To a psychologist I would say: think of it as an
ANOVA method that at every point in time gives you the best information
(within the constraints -- number of degrees of freedom -- of the model)
about how the free variable u influences the dependent variable xt, and
use that information to compute a value of the free variable u such that
the value of the dependent variable xt is driven toward the value xopt at
the next sample time.

The behavior of this system does NOT control perceptions ...

OK. We have found a system ("pseudo-controller"?) that does NOT control
perceptions.

... except at exceedingly low frequencies.

That is an artifact. I have just checked this in my last demo, as you
suggested, by setting y = xt + vt*(sin(run/1900)) and pvv to vt*vt (any
reasonble vt, ft = 0, pff = 0.000001). As expected, learning cq. con-
vergence takes a long time, but the outcome is the same as at lower fre-
quencies. This becomes noticeable starting from run 30.000 or so: by then
xt closely tracks xopt, whereas y still fluctuates around xopt as usual.
Try it, it only takes a few minutes on a 486/50...

... The original program controls perceptions when the disturbance is
simply added to y without changing anything else in the program.

Are you sure that you observed the rules concerning ft, pff, vt, and pvv?

When I checked the value of pxx/(pxx+pvv) it came out 0.99, instead of
0.0003 as in the "pseudo" demo. The reason was that the observation
noise was 1e-6 instead of 0.5.

Explanation: the filter is told that there is (practically) no observa-
tion noise. If yet there is, things may go haywire. No observation noise
implies that the observation can be fully used for further tuning of the
parameters. In procedure process_observation x is tuned as follows:

  x := x + pxx * (y - x) / (pxx + pvv) reduces to
  x := x + pxx * (y - x) / pxx for pvv = 0, so
  x := x + y - x so
  x := y

as it should be, because y is a fully accurate measurement of x, regard-
less of the value of pxx.

I have noticed something in your "normal" routine. What it does is to
give the same value of random noise twice in a row, then change to the
new value twice in a row, and so on.

You cannot have noticed that, because it is not true. Make a small test
program to print 10 successive values of calls of function "normal" and
verify that the same value is never repeated.

Can you explain why it is necessary to tell the model what magnitude of
disturbance to expect?

The model assumes TWO types of disturbances: one type (the observation
noise) must be disregarded, the other type (the not yet modelled dyna-
mics) is used to (further) tune the model parameters. Very different
things, which must be discriminated between.

Hope all this helps a little...

Greetings,

Hans