Karl dialogue

This is the dialogue that I think inspired Martin’s diagram:

Dear Warren,

W:,“Presumably there is also ascending input at each level from which the error is calculated with reference to the ‘prediction’?â€?

Yes, you could have separate units encoding the prediction of expectations at the lower level – such that the prediction error was the difference between expecctations and predictions at each level. This would entail three representations: (i) an expectation at each level, (ii) a prediction based on that expectation at the level below and (iii) a prediction error at each level. Usually, in biological formulations of predictive coding, the prediction is encoded implicitly in the descending connections from the expectation units at one level to the prediction errors at the level below. In other words, the mapping from expectations to predictions is in the connectivity expectation to error units – as opposed to having a separate encoding of the prediction.

With very best wishes – Karl


Dear Warren,

Many thanks for this interesting e-mail. I am not sure what format you wanted this discussion to pursue; however, I have made some comments below. I hope these are what you had in mind.

With very best wishes – Karl

Friston: Prediction means that what I choose to make happen will happen. His downward going “prediction” seems to be an output from a higher level that, as we would see it in PCT, is or contributes to a lower level reference signal. I get that from the following:

[KF] “In these schemes, neuronal representations in higher or deeper levels of neuronal hierarchies generate predictions of representations in lower levels. These descending predictions are compared with lower-level representations to form a prediction error (usually associated with the activity of superficial pyramidal cells). This mismatch or difference signal is passed back up the hierarchy, to update higher representations (usually associated with the activity of deep pyramidal cells).”

This doesn’t sound like PCT at first sight. The circuitry described is certainly not the same as the Powers circuitry, but I will argue below that circuitry described this way could be functionally equivalent, and would resolve a nagging problem about the imagination loop that has come up from time to time on CSGnet, as well as a problem with our ability to perceive error consciously.

I don’t accept all the talk about “predictive coding” that follows this quote, at least not yet, though who knows, it might make PCT sense after a bit more thought. I suspect not, but I’ll wait and see.

I agree entirely that there is a formal equivalence between PCT and predictive coding. Within the confines of predictive coding, prediction refers to the prediction of the current sensory states (or expectations about hidden states in the level below in hierarchical predictive coding). When these descending predictions predict the sensory consequences of moving, they correspond to references that play the same role in PCT. These references are then fulfilled by peripheral reflexes.

[WM]: I don’t really know what Warren thinks “prediction” means, but clearly he doesn’t think it means “reference values”. I get the impression that because KF relates it to inference, Warren’s “prediction” might be the result of some analytic process.

[JK]: “Prediction” means “knowing what the future will bring, and not being wrong when that time arrives.” I suspect the background is the idea of directing behaviour so that actions will have a predicted effect on the changing environment, one of the “models” demolished in Power and Bourbon “Models and their worlds”.

Strictly speaking, this form of prediction is not prediction in the sense of predictive coding. It is more about forecasting and prediction of the future. Clearly, this form of prediction is very important for PCT and optimal control theory but is not the prediction in predictive coding (although there is a formal equivalence between linear quadratic control and Kalman filter formulations of predictive coding). Clearly, there is a trivial sense in which one can predict the future in predictive coding if one can predict the*current *rate of change of hidden states accuracy.

MT: “Prediction” means using observation A to reduce uncertainty about the result of future observation B, or equivalently, to use observation of variable X to reduce uncertainty about unobserved variable Y. This is an engineering style definition, which I prefer, perhaps because of my background. Using the word in this sense, I profoundly disagree with JK’s “My interpretation of BP’s aversion to “prediction” is that it does not allow for “many means to the same ends”, namely control.”

From the perspective of predictive coding, this sort of prediction is much more sophisticated. To accommodate this sort of prediction within a predictive processing framework, one would have to move to active inference. Active inference rests on reduction of uncertainty. The following paper expresses the reduction of certainty in terms of minimising expected free energy. This has formal links with (KL) optimal control theory and many other information theoretic formulations of optimal behaviour. One twist here is that it is easier to see these relationships using discrete state space models (as opposed to the continuous state space is of predictive coding proper).


Prediction is an aid to control whenever there is a finite loop delay, or an equivalent integrator delay. How the hierarchy implements it has been a bone of contention on CSGnet, because it could be implemented in several ways that would be hard to disentangle by experiment: (1) by a higher-level control loop that takes variables X and Y (possibly X and dx/dt) as input, (2) by an explicit prediction applied to the perceptual or reference signal in the form of a multiplied version of the derivative of the perceptual signal added into the perceptual signal before the comparator or subtracted from the reference signal, or (3) by an output function that incorporates Bill Powers’s “Artificial Cerebellum” or something similar.

[As an aside, I don’t know why JK thinks “Bayesian Prediction” is an oxymoron. To me, it seems a completely natural word pair. Bayesian inference produces a statistical “model” which produces a “prediction” given input data. What is oxymoronic about that? Incidentally, when I was first transitioning as a graduate student from Engineering to Psychology (1958), a friend and I were asked to give a seminar talk on the then new Bush and Mosteller theory of memory. In our research for the talk, we found that the Hebbian kind of learning and similar data-based approaches to stable states did seem to mimic Bayesian inference. So I’m not averse to Bayesian Brain talk, except when it is extended to suggest that the brain does the Bayesian analysis explicitly rather than implicitly through its circuit structure.]

I agree entirely. Most work in my field does not deal with explicit Bayesian computations – but the impliicit Bayes optimality of perceptual processing and action selection.

Why is the KF circuit functionally equivalent to the Powers circuit, while resolving an issue about the imagination loop? If the KF “prediction” is actually the reference value, and what is reported up is (r-p), the “prediction error” which is “e” in many diagrams of the control loop, the upgoing error signal might well be merged with the “prediction” to form (r-e) which is actually p, just as in the Powers hierarchy. An alternate way of coming to the same result would be to send e up as input into the next level perceptual function but also include r as another input to that perceptual function, once again effectively passing p up the hierarchy. I use this version in the diagram below.

Now, let’s suppose that the lower-level controller is switched off, producing zero values for p and r. Then the effective perceptual value fed into the higher perceptual input function is just r. In the Powers proposal in B:CP (and I think unchanged subsequently), the “imagination loop” is implemented by two switches, one that diverts the higher-level output into the imagination connection instead of the lower-level reference, and one that switches the input to the higher perceptual function away from the lower perception to the imagination connection (top pair of panels in the diagram). There has to be some invisible controller that has the function of throwing these switches.

As described by Powers, this setup has another problem , because the lower-level loop also has to be switched off so that it does not produce its normal signal values. If it were not, it would now act like a top-level control loop with a fixed zero reference value and would continue to influence physical output to the external environment. So there are three switches to be flipped by this unmodelled controller, at least.

As I imagine it, the Friston connection requires no connector switches. It needs is the lower control unit to be switched off by some controller, so the lower level unit does not influence anything above or below it. None of the rest of the circuit changes, which simplifies the operation considerably when there are a lot of different perceptions that might be imagined or be derived from sensory data. The upper unit sees the reference value as its input, which is what Powers proposes should be the imagined value.

Yes, that sounds very sensible. In predictive coding, one can have connector switches. These are conceived of as precision weighting. This means that one can switch off ascending prediction errors by attenuating their precision. This corresponds to sensory attenuation:


There are many applications of this sensory attenuation. For example, during imagination and sleep. It is also seen in things like the weight sleep algorithm. We have made much of this in considering the functional role of sleep and imagination in reducing model complexity:


In relation to the architectures of PCT and hierarchical predictive coding, I see a deep similarity. Looking at the graphics that you sent us, I would conceive of the switches in PCT as attenuating or augmenting the precision of ascending prediction errors. The architectures in the graphics are a little bit redundant, from point of view of predictive coding, because there are only two elements at each level (expectations pand error unitse). However, if one merges the triangles and ovals, then there seems to be a very close correspondence. I think the key insight from the point of view of hierarchical predictive coding is that the perceptual expectation (p) at one level provides the prediction in the form of a reference (r) for the level below. In other words, if there were some function ggenerating predictions of p at the leveli one would have:

p^(i) = r^(i) +e^(i)

r^(i) = g(p^(*i *+ 1))

These functions at each level constitute the generative model. At the lowest level, they generate predictions or references for reflexes to control the perceptual input – much along the lines of PCT. In other words, the role of the hierarchical generative model is to provide deeply informed predictions or references based upon a model of how the outside world generates perceptual input. I hope that this helps.

With very best wishes – Karl