The E. coli reinforcement models

[From Bill Powers (950610.0300 MDT)]

Bruce Abbott (9506xx) --

If you remember that your Ecoli programs were a successful model of
reinforcement theory, then either I didn't communicate properly or you
are misremembering. I think I can do better now.

Here is a block diagram of Ecoli3:

                             dN* = 0 (reference signal)
                              > S+
                S - --------------------------
      ---------------------->| | S- |
     > - ---------------- |
     > > >
     > > >
     > > >
S = | dN/dt [inc PTS-] [dec PTS+]
     > > >
     > > >
[Input F = d/dt] PTS- -> PTS+ ->
     > delay delay
     > \ /
     > N \ /
     > TUMBLE
     > >
     > direction
     > >
      --------<------[environment]--------<----------

PTS+ == probability of tumble given S > reference (S+),
PTS- == probability of tumble given S < reference (S-).

If the nutrient rate of change dN/dt is negative, then PTS-, each time
it is used, is made larger and PTS- is also used to determine the delay
before the next tumble. If dN/dt is positive, then PTS+, each time it is
used, is made smaller and is also used to determine the delay. So PTS-
always goes toward max and PTS+ always goes toward min, and PTS- is
always chosen when going down the gradient and PTS+ is always chosen
when going up the gradient. The delay if going the wrong way is always
shorter, and if going the right way is always longer. In other words, if
there is any "learning" in this system, what is to be learned is already
built in, and the same thing will be "learned" regardless of
circumstances. The changes in PTS+ and PTS-, and their limits, are
predetermined, not learned. All you did was start PTS+ and PTS- at equal
levels, and then made sure they could only change, and would in fact
change, toward the required levels.

In going to Ecoli4, you may have realized this deficiency, and also that
there was only a discriminative stimulus and no reinforcer in this
model. The block diagram of Ecoli4 now includes a "reward-or-punish"
routine, which uses the second derivative of N as the reinforcer R.

Here is a block diagram of Ecoli4:

                             dN* = 0
                              > S+
                S - ----------->--------------
      ---------------------->| | S- |
     > - ----------->---- |
     > > >
     > R- * * R-
     >-->--[d/dt]-> R = (dS/dt > 0)--> --- / switch \ ---
     > operates | | R+ R+ | |
S = | dN/dt switch dec inc dec inc
     > PTS- PTS- PTS- PTS-
     > \ | | /
     > \ | | /
[Input F = d/dt] Output [PTS- -> [PTS+ ->
     > Function delay] delay]
     > \ /
     > N \ /
     > TUMBLE
     ^ |
     > direction
     > >
      ---------------[environment]-------------------

S = discriminative stimulus
R = reinforcement: + = rewarding, - = punishing
N = nutrient

When both poles of the switch are vertical, the overall function is
exactly the same as in Ecoli3. In other words, it is predetermined that
delays will be short when going the wrong way, and long when going the
right way. Reaching the maximum effect is slowed, but reaching the
correct effect is inevitable and built in beforehand.

When the change in dN/dt across a tumble is negative, however, the
switch will be thrown so that PTS- will be decremented just before being
used to determine the delay, and PTS+ will be incremented just before
being used. These changes are in the wrong direction for progressing up
a gradient. Therefore the added features of Ecoli4 reduce the speed with
which it will approach the fastest progress up the gradient. The
reinforcement effect makes "learning" slower than it would be without
the reinforcement effect.

The only reason that adding the reinforcement path does not destroy
control completely is that the environmental geometry is centered on a
point-source of nutrient. If the gradient did not converge toward a
point (if it behaved as for a line source or a very distant point
source), the probability of the second derivative being positive would
be 50% regardless of the value of the first derivative. Then the switch
would spend as much time in the wrong position as the right position,
and PTS+ and PTS- would wander at random.

Ecoli3, on the other hand, would progress up a nonconvergent or even a
divergent gradient as usual.

What makes your model work is the fact that it varies the delay
appropriately to control the rate of change of nutrient, keeping it
positive. The reason it does that is that it was modeled directly on the
successful control-system model. The logic of Ecoli3 is identical to the
logical structure of the control system model, although the mechanisms
for converting errors into appropriate delays are unneccesarily complex.
And in that successful model, there is nothing that corresponds to the
idea of reinforcement.

Your judgment that the reinforcement model "works" was based only on the
fact that the resulting behavior was correct: E. coli did approach the
target. But I have spent a lot of time in careful analysis of the logic
of your model, trying to understand why it does work, not just that it
does work. And I have found that what makes your model work has nothing
to do with reinforcement theory: adding explicit reinforcement in the
way you did _worsens_ the performance of the model.

···

---------------------------------------
When you were arguing that your model did work, you did not go through
the model as I have done to see whether it worked as you said it worked.
You went rapidly through some verbal arguments, but the clincher for you
was that the right result occurred: E. coli approached the target. I
think it would be instructive to speculate about why your verbal
arguments seemed sufficient, when in fact they glossed over fundamental
defects in the logic. I think it would be reasonable to say that you
simply couldn't believe that reinforcement theory would not work.
Assuming that it had to work, you didn't see any reason to go through
the details of your system and figure out what it would actually do
according to its own structure, instead of according to what you
expected and wanted it to do. Your reasoning, in fact, was driven by the
goal, being adjusted to make the perception match the goal-perception.

Obviously, I don't consider this to be a sin. It is a very common
phenomenon that goes a long way toward explaining the phenomenon of
belief. Belief is not just a passive perceptual phenomenon; it's an
active control process in which inputs are selected that will support
the belief. When we have reason to want a conclusion to be true, we can
construct logical arguments that are just complete enough to support the
desired result, but not so complete as to risk disproving it. We find
this phenomenon everywhere in human affairs, in science and everywhere
else. The only thing that permits science to exist at all is the fact
that there are some who wish for different conclusions, and select
different logic to support their belief. In the ensuing conflicts and
arguments, all sides are forced to look at their models in more detail
to see what they would actually do rather than what they are believed to
do. So all sides are eventually brought under the discipline of
mathematically correct logic, whether they want to be or not.
---------------------------------------------------------------------
Best,

Bill P.

[From Rick Marken (950610.0900)]

Bill Powers (950610.0300 MDT) to Bruce Abbott (9506xx) --

If you remember that your Ecoli programs were a successful model of
reinforcement theory, then either I didn't communicate properly or you
are misremembering. I think I can do better now.

You are obviously at your best VERY early in the morning, Bill. This was a
very clear and helpful post -- both about the "reinforcement" model and
about the nature of belief itself. In the interests of keeping the dialog
on this VERY important topic as clear as possible, I will try to restrain
myself until Bruce has had a chance to reply. And I am still looking
forward to seeing his reinforcement model of the ratio schedule data you
posted (Bill Powers (950609.0910 MDT)). By the way, the first table of
numbers in that post was not labeled. I think it should have read:

Ratio reinforcement rate behavior rate

1 210 210

40 90 3000

Best

Rick (The Kid) Marken