------------------- ASCII.ASC follows --------------------

[Hans Blom, 950531]

Short abstract for the psychologists amongst us:

Think of the Extended Kalman Filter as a real-time ANOVA method that, at

every point in time, gives you the best information (within the con-

straints -- number of degrees of freedom -- of the model) about how the

free variable u influences the dependent variable xt.

Think of Extended Kalman Filter based control as using that information,

again at every point in time, to compute precisely that value of the free

variable u that drives the value of the dependent variable xt toward the

reference level value xopt at the next sample time.

(Bill Powers (950530.1315 MDT))

I need a little help here. I increased the value of the disturbance of y

by modifying a program step as follows:

>>>

vvv

y := xt + 2.0*vt * (sin (run / 50) + sin (run / 155))

Don't. Set vt to twice its value instead at the place where it gets its

value. Also set pvv to the value vt*vt, or rather, in this case, to 4*

vt*vt. Theory assumes the term pvv to be the square of the RMS (maximum

is fine, too) value that the observation noise can have. These two, vt

and pvv, ought to go together always! If not, funny results appear. In my

last demo, I made this easier: you can initialize both vto and pvvo at

the top of the program. The term pvv tells the filter how much observa-

tion noise variance to expect. Setting pvv to vt*vt is correct informa-

tion. Setting pvv far too low or far too high gives interesting (but

incorrect) results: set pvv too low and you get instability; set pvv too

high and you get very slow convergence.

Also, vt is expected by the filter to be zero on average. If you supply a

vt which does not have this characteristic, interesting (but incorrect)

results appear again.

Experiments which violate these assumptions will generally lead to

results which will be very difficult to interpret ;-).

I was wondering why it is necessary to specify in advance the amplitude

of the disturbances. Does the program need to know them in advance?

Yes, the program needs to know the expected values of the variances of

both noise terms, system noise and observation noise, in order to be able

to discriminate between the two, so that it can reject observation noise

but use system noise (not yet modelled dynamics) in order to tune the

parameters.

It is, however, possible to design methods to estimate the proper values

of pvv and pff if these cannot be pre-specified. In practice, it appears

to be usually possible to pre-specify pvv, because it represents a sensor

property. A method to estimate pff "on the fly" was given in my first

demo.

I'm going to have to go back to the negative feedback control system and

see what would actually happen with a perceptual disturbance. I thought

I had it figured out, but I don't understand the results your program

gives. ... Can you explain how the Kalman filter works in words of no

more than three syllables?

It is difficult to get to understand Extended Kalman Filter theory from

playing with a demonstration, especially when you are convinced that you

can explain everything in terms of a different theory (PCT versus SR, for

instance :-). Basically, it is an on-line multiple regression estimation

method, very much like the ANOVA that psychologists are familiar with.

The difference is that in the Extended Kalman Filter it is not necessary

to collect a whole batch of data before correlations can be computed;

rather, the values of the correlations are recomputed as soon as a new

sample y (or, generally, a new set of observations) comes in. The advan-

tage is, of course, that from the very start the results can be USED, in

our case to control. So basically we have an on-line multiple regression

estimation method, and only as an afterthought its results are used for

the purpose of control. If you understand ANOVA, you are close to under-

standing Extended Kalman Filters.

But let me try a short explanation again, although I cannot guarantee the

three syllables :-). Basically, the algorithm is:

1. assume (i.e. model) the form of the equations that govern the world;

my demo models a very, very simple world that is described by these

equations:

xt := ct + at * xt + bt * u + ft (the dynamics)

at := at (meaning: at is constant)

bt := bt

ct := ct

ft stands for everything that we cannot (or do not want to) model and

is, lacking more information, assumed to be best represented as nor-

mally distributed zero average white noise (usually this works; if

more is known about ft, use a more advanced model in combination with

a more advanced estimation method)

initialize the model's parameters x, a, b and c to the values of xt,

at, bt, and ct; because the latter are generally not accurately known,

assign sufficiently large uncertainties (pxx, paa, etc.) to the model

parameters in such a way that e.g. paa := worst_case [(at-a) * (at-a)]

2. use the model of the system to predict, using the equations of step 1,

what the values of the observable variables of the system will be

(predicted x, a, b, and c); this prediction takes the form of a

probability distribution, of which we take into account only terms of

the first order (x, a, b, c) and second order (pxx, pax ... pcc)

example: the system has one observable dimension xt, represented by

the model variable x; the model predicts that at the next observation

x is going to have the value 10 +/- 3

3. do an observation of the observable variables of the system (xt, ob-

servable through y); this observation may be contaminated by noise

(vt); in that case, the observation results in a probability distri-

bution as well; the equation that governs the observation is

y := xt + vt

where vt stands for observation noise/inaccuracy and is, lacking more

information, assumed to be best represented as normally distributed

zero average white noise (usually this works; if more is known about

vt, use a more advanced model in combination with a more advanced

estimation method)

example: xt can be observed with an accuracy of +/- 2; the observation

y has the value 11; this means that x = 11 +/- 2

note that in the demo an observation is ALWAYS performed; a missing

observation is modelled by a y of any (reasonable) value combined with

a very large vt; the advantage is that e.g. sensor degradation can be

modelled by an increase of pvv

4. check whether the observation y was accurately predicted by the

model's x, i.e. whether the two probability distributions are

consistent; if not, increase the modelled uncertainties so much that

enough overlap ensues

in the above example x = 10 +/- 3 corresponds nicely with x = 11 +/-

2; but if the observation y had been 16 +/- 2, we would adjust the

model's prediction to x = 10 +/- 4 at this point: 10 + 4 'touches' 16

- 2; we call this adjustment 'forgetting': we sacrifice some of our

accumulated certainty in order to keep our internal model in line with

our observations

this step is required only if the parameters of the system (a, b, and

c) can change; if they do not, the correspondence never gets lost if

the model's parameters are initialized correctly, and forgetting is

neither required nor wanted

5. process the observation y, i.e. combine the two probability distribut-

ions into one

example: given both x = 10 +/- 3 (from the prediction) and x = 11 +/-

2 (from the observation), the result might be something like x = 10.7

+/- 1.5

note that the inaccuracies (of x, a, b, and c) after processing at

this step are both smaller than the inaccuracy resulting from the

prediction AND smaller than the inaccuracy due to the observation;

this increase in accuracy is called convergence

6. use the model to (somehow) compute a control that will bring the most

probable values of the system's observable dimensions (xt, observable

through y) to their reference levels at the next sample moment; this

computation is based on the prescribed reference levels (xopt) and on

(known) model parameters x, a, b, and c ONLY, NOT on the (unknown)

true parameters xt, at, bt, and ct of the 'world'

7. goto step 2

You may need to touch up on the terms information, representation, dis-

tribution, accuracy, uncertainty, and parameter, all having more than

three syllables ;-).

(Bill Powers (950530.1130 MDT))

I have finally realized why the disturbance of the variable y has so

little effect. Cancel all my previous guesses. It is because the

difference x - xt has a very small effect on the world-model.

You have discovered convergence! See step 5 above. Let's change the

example:

example: given both x = 10.0000 +/- 0.0010 (from the prediction, which

is by now very accurate) and x = 11.1 +/- 2.0 (from the observation,

which is quite noisy), the result might be x = 10.0001 +/- 0.0009

A noisy observation helps very little in further tuning the model para-

meters if the latter are already very accurate. This explains

... But soon the values of the coefficients become very small, and then

both y and disturbances of y have only very tiny effects except over

thousands of iterations.

And I would change your

... the oscillating disturbance of the perception is simply not

perceived by the model-based control system. It is averaged out.

into

... the oscillating disturbance of the perception, although perceived by

the model-based control system [in y], is averaged out [as it carries no

information about how u influences xt].

When a _constant_ disturbance of +1.0 is added, rather than an

oscillating disturbance with an average value of zero ...

This contradicts the assumptions on which the model is based. Substitute

a zero-average square wave and I'm willing to talk again.

... If you set the disturbance of y so that

y = xt + 2.0*vt*(sin(run/1900))

Have you thought of setting pvv to 4*vt*vt in this case? Better yet, set

vt to twice its value and pvv to vt*vt.

This makes it clear that "unmodeled dynamics" whether at the input or

the output are unopposed by this version of the adaptive control system.

Have you taken into account that a slow phenomenon can only be learned

slowly as well?

The best way to think of the Kalman-filter adaptive system is as a

pattern generator which can be gradually altered to drive an external

system in a desired way, open-loop.

If that helps you, fine. To a psychologist I would say: think of it as an

ANOVA method that at every point in time gives you the best information

(within the constraints -- number of degrees of freedom -- of the model)

about how the free variable u influences the dependent variable xt, and

use that information to compute a value of the free variable u such that

the value of the dependent variable xt is driven toward the value xopt at

the next sample time.

The behavior of this system does NOT control perceptions ...

OK. We have found a system ("pseudo-controller"?) that does NOT control

perceptions.

... except at exceedingly low frequencies.

That is an artifact. I have just checked this in my last demo, as you

suggested, by setting y = xt + vt*(sin(run/1900)) and pvv to vt*vt (any

reasonble vt, ft = 0, pff = 0.000001). As expected, learning cq. con-

vergence takes a long time, but the outcome is the same as at lower fre-

quencies. This becomes noticeable starting from run 30.000 or so: by then

xt closely tracks xopt, whereas y still fluctuates around xopt as usual.

Try it, it only takes a few minutes on a 486/50...

... The original program controls perceptions when the disturbance is

simply added to y without changing anything else in the program.

Are you sure that you observed the rules concerning ft, pff, vt, and pvv?

When I checked the value of pxx/(pxx+pvv) it came out 0.99, instead of

0.0003 as in the "pseudo" demo. The reason was that the observation

noise was 1e-6 instead of 0.5.

Explanation: the filter is told that there is (practically) no observa-

tion noise. If yet there is, things may go haywire. No observation noise

implies that the observation can be fully used for further tuning of the

parameters. In procedure process_observation x is tuned as follows:

x := x + pxx * (y - x) / (pxx + pvv) reduces to

x := x + pxx * (y - x) / pxx for pvv = 0, so

x := x + y - x so

x := y

as it should be, because y is a fully accurate measurement of x, regard-

less of the value of pxx.

I have noticed something in your "normal" routine. What it does is to

give the same value of random noise twice in a row, then change to the

new value twice in a row, and so on.

You cannot have noticed that, because it is not true. Make a small test

program to print 10 successive values of calls of function "normal" and

verify that the same value is never repeated.

Can you explain why it is necessary to tell the model what magnitude of

disturbance to expect?

The model assumes TWO types of disturbances: one type (the observation

noise) must be disregarded, the other type (the not yet modelled dyna-

mics) is used to (further) tune the model parameters. Very different

things, which must be discriminated between.

Hope all this helps a little...

Greetings,

Hans