E. coli "reinforcement" model revisited

[From Bruce Abbott (970813.0640)]

I have changed the title as it would appear that we have digressed to a very
old thread.

Bill Powers (970812.2212 MDT) --

Bruce Abbott (970812.1845 EST)

It's pointless to argue about who said what when if we don't dig up the
archives and document the claims. Better to go to the source and see what
we're talking about.

I agree. Let's look in the archives, as I suggested we do.

Here is the description of one of your operant models:

* Created: 11/14/94 *

So we need to be searching the archives in late '94 and possibly early '95.
Why don't you go ahead and do that? I've been called to potentially serve
as a juror and must appear in court today beginning at 8 am for the
selection process. I expect this to take the whole day; meanwhile you could
be checking on the facts.

Regards,

Bruce

[From Bill Powers (970813.0729 MDT)]

While you're involved with the jury system, I'll look at some of the
archives as you suggest.

Starting with posts around 941114, I find that the program with this date
follows an insight you had (while driving somewhere) that corrected many of
the problems you were having with previous versions of your E. coli operant
conditioning program. After trying this program, I wrote (941116):

···

--------------------------------------------------------------------------
Got your new code for ecoli4a, compiled it, and ran it. An interesting
feature that you may have noticed: ps+ (pardon my shorthand) approaches
zero (actually 0.005 but let's call it zero) and ps- approaches 1. This
happens relatively quickly.

Suppose we start with these probabilities at their limits. Then we can
understand the following code segment very easily:

if dNut > 0 then { S+ present; tumble probability determined by S+ }
    begin
      if (Random < pTumbleGivenSplus) then DoTumble
      else JustTumbled := false;
    end
  else { S- present; tumble probability determined by S- }
    begin
      if (Random < pTumbleGivenSminus) then DoTumble
      else JustTumbled := false;
    end;
  NutCon := NewNut;

Note that in the limit, pTumbleGivenSplus goes essentially to zero, and
pTumbleGivenSminus goes to 1. This means that in the first clause of the
overall "if" statement,

if dNut > 0 then { S+ present; tumble probability determined by S+ }
    begin
      if (Random < pTumbleGivenSplus) then DoTumble
      else JustTumbled := false;
    end

... there will never be a tumble (that is, "random" will never be less than
0). Thus if dNut > 0 there will never be a tumble -- until the path passes
a right angle to the target and dNut becomes negative.

In the second clause

  else { S- present; tumble probability determined by S- }
    begin
      if (Random < pTumbleGivenSminus) then DoTumble
      else JustTumbled := false;
    end;

.... pTumbleGivenSminus is 1 in the limit, so when dNut <= 0, there will
always be a tumble immediately.

Therefore, when the probabilities reach their limits, the above code
segment is closely equivalent to

if dNut <= 0 then DoTumble.

Now we can understand why the final approach to the target is so rapid and
the final position stays so close to the target: the model has approached
the condition in which there is a tumble for any movement down the gradient
and none for any movement up the gradient, as in the simplest PCT model.
------------------------------------
The question is now why the two probabilities tend so rapidly and
systematically toward 0 and 1. The implication is that NutSave, the
previous value of dNut, predicts the next value of dNut. But we know that
this is not true. Random means random: whatever the current value of dNut,
the next value can be anything from the maximum positive to the maximum
negative, and the most probable value is zero.

The logic here is extremely complex, but there is a simple way to see
whether the previous value of dNut is really acting as a reinforcer. Change
the sign of dNut that is saved as NutSave. That is, in

  procedure DoTumble;
  begin
    Tumble(Angle);
    JustTumbled := true;
    NutSave := dNut; { NutSave is nutrient rate of change
                          immediately }
  end; { after a tumble }

change the last statement to NutSave := -dNut.

We see the probabilities change in the same directions as before, although
now they do not come as close the the limits as before. The model still
progresses toward the target.

Going even farther, we can write

   NutSave := dNut * 2.0* (random - 0.5);

... with the same result, only now the probabilities do eventually reach
the limits of 0.005 and 1.0.

The quickest results of all come from simply randomizing NutSave:

  NutSave := random - 0.5;

So what is making the probabilities change is not any systematic effect of
NutSave, but something else about the nonlinear and circular geometry of
this situation, combined with the complex logic. I really don't have the
faintest idea why the net effect is swimming up the gradient. The situation
is too complex for me to see how to apply control theory. But it is clear
that the reason is NOT an effect of the previous value of dNut, or of the
change in value across a tumble.
---------------------------------------------------------------------------
[Me writing now]
As you can see, I accepted that your model ran and produced the right
results (qualitatively). At the time I still didn't understand the logic
fully, but even then, by experimenting with the program, I was able to
demonstrate that the "reinforcement" was just getting in the way. The model
converged more quickly when the reinforcing effect was randomized out. At
that time I didn't see that eliminating the reinforcing effect altogether
would work best of all.

Why was I doing this? You had a model that ran and produced the E. coli
effect. Wasn't that enough to validate the reinforcement model? Obviously,
for me it wasn't enough. I had to understand how the model worked. In an
empirical way, I played around with the parts of the model to see which
parts were essential and which weren't. I came close to the right answer,
but still didn't see exactly what the reason was. Now I can say that the
reinforcement aspect of the model is superfluous; the model works even
better without it (including learning).

This leads me to venture a generalization: in _all_ applications of
reinforcement theory, the concept of reinforcement is excess baggage. Given
a behavior that produces a consequence of a certain type, we may observe
that the frequency of that behavior increases, and therefore that the
frequency of the consequence increases. What we do not observe is WHY the
frequency of the behavior and the consequence increases. All we really
observe is that the probability density of the behavior increases, and as a
result the probability density of all its consequences increases. That, and
not reinforcement, is the empirical observation. Reinforcement is a
theoretical notion offered to explain WHY these probabilities increase, and
it is not supported by any evidence.

In the Ecoli model I cited yesterday (which seems to be still a later
version of the one mentioned above) it is clear that if you leave out the
so-called reinforcing condition (Dnut - NutSave positive etc.), convergence
will be faster. This suggests that a different model of learning, in which
the "wrong" cases are minimized or eliminated, would work even better. Once
we open the door to other theories of learning, reinforcement theory can be
seen for what it is: an attempt to keep control of behavior in the
environment where, according to certain philosophical positions, it belongs.

Best,

Bill P.

[From Rick Marken (970813.0930)]

Bill Powers (970813.0729 MDT) --

In the Ecoli model I cited yesterday...reinforcement theory can
be seen for what it is: an attempt to keep control of behavior
in the environment where, according to certain philosophical
positions, it belongs.

A gorgeous observation. And it also suggests some nice experiments
that could be done to compare reinforcement to control models
of people doing the "E. coli" (random consequences) experiment.
I leave the development of such experiments as an exercise for
stalwart opponents of reinforcement theory -- like Bruce Abbott.

Best

Rick

···

--
Richard S. Marken Phone or Fax: 310 474-0313
Life Learning Associates e-mail: rmarken@earthlink.net
http://home.earthlink.net/~rmarken

[From Bruce Gregory (9708131245 EDT)]

Rick Marken (970813.0930)]

Bill Powers (970813.0729 MDT) --

> In the Ecoli model I cited yesterday...reinforcement theory can
> be seen for what it is: an attempt to keep control of behavior
> in the environment where, according to certain philosophical
> positions, it belongs.

A gorgeous observation. And it also suggests some nice experiments
that could be done to compare reinforcement to control models
of people doing the "E. coli" (random consequences) experiment.
I leave the development of such experiments as an exercise for
stalwart opponents of reinforcement theory -- like Bruce Abbott.

I am reminded of the insightful observation of the well-known
team of philosophers of science, Simon and Garfunkle:

"A man sees what he wants to see and disregards the rest..."

Bruce

[From Bruce Abbott (970813.1520 EST)]

The jury was impaneled before they got to me, so I'm home early.

Bill Powers (970813.0729 MDT) --

While you're involved with the jury system, I'll look at some of the
archives as you suggest.

Great!

Starting with posts around 941114, I find that the program with this date
follows an insight you had (while driving somewhere) that corrected many of
the problems you were having with previous versions of your E. coli operant
conditioning program. After trying this program, I wrote (941116):

Got your new code for ecoli4a, compiled it, and ran it. An interesting
feature that you may have noticed: ps+ (pardon my shorthand) approaches
zero (actually 0.005 but let's call it zero) and ps- approaches 1. This
happens relatively quickly.

. . .

Now we can understand why the final approach to the target is so rapid and
the final position stays so close to the target: the model has approached
the condition in which there is a tumble for any movement down the gradient
and none for any movement up the gradient, as in the simplest PCT model.

Yep. Up to this point you were on target. What comes next was completely
wrong:

The question is now why the two probabilities tend so rapidly and
systematically toward 0 and 1. The implication is that NutSave, the
previous value of dNut, predicts the next value of dNut. But we know that
this is not true. Random means random: whatever the current value of dNut,
the next value can be anything from the maximum positive to the maximum
negative, and the most probable value is zero.

Everything else you said in that post followed from this mistake. Because
you concluded (incorrectly) that the model didn't perform as I had
described, you looked around for some accidental feature of the situation
that would produce the results actually observed during the program's
action, suggesting that something -- the nonlinear geometry? -- the
circular geometry? both together? -- must be responsible for this result.
This pure speculation did not emerge from a correct understanding of the
program's logic, but from an inability to comprehend how it could work at
all, given your flawed understanding.

When I said that you did not initially accept that the program worked as
advertised, I was not referring to the fact that the model e. coli converged
on the nutrient source. That much was indisputable. I was referring to
your failure to accept my description of _how_ it worked. The quoted
portion of your post above shows that my memory of this exchange was correct.

I don't have access to this archived material here at home. What was my reply?

[Me writing now]
As you can see, I accepted that your model ran and produced the right
results (qualitatively). At the time I still didn't understand the logic
fully, but even then, by experimenting with the program, I was able to
demonstrate that the "reinforcement" was just getting in the way. The model
converged more quickly when the reinforcing effect was randomized out. At
that time I didn't see that eliminating the reinforcing effect altogether
would work best of all.

Why was I doing this? You had a model that ran and produced the E. coli
effect. Wasn't that enough to validate the reinforcement model? Obviously,
for me it wasn't enough. I had to understand how the model worked. In an
empirical way, I played around with the parts of the model to see which
parts were essential and which weren't. I came close to the right answer,
but still didn't see exactly what the reason was. Now I can say that the
reinforcement aspect of the model is superfluous; the model works even
better without it (including learning).

You are still reconstructing history to fit your thesis, Bill. You didn't
understand the logic correctly at that time, and judging from the above, you
have completely forgotten the correct understanding you finally arrived at
after numerous exchanges that followed the post you quoted from. You didn't
"come close to the right answer," your attempt to understand the
reinforcement portion of the model failed completely.

Since you have access to the archive, how about reviewing the entire
exchange (up to the point where you finally accepted my analysis) and give
us a report?

Regards,

Bruce

[From Bruce Abbott (970813.1530)]

Bruce Gregory (9708131245 EDT) --

Rushing to judgment before the facts are all in, Bruce remarks:

I am reminded of the insightful observation of the well-known
team of philosophers of science, Simon and Garfunkle:

"A man sees what he wants to see and disregards the rest..."

I agree -- when one is highly motivated to do so. But you should keep in
mind another insightful observation of another well-known philosopher, Edgar
A. Poe:

"Ask not for whom the bell tolls; it tolls for thee."

Who is seeing what he wants to see?

Regards,

Bruce

[From Bruce Gregory (970813.1640 EDT)]

Bruce Abbott (970813.1530)]

>Bruce Gregory (9708131245 EDT) --

Rushing to judgment before the facts are all in, Bruce remarks:

>I am reminded of the insightful observation of the well-known
>team of philosophers of science, Simon and Garfunkle:

>"A man sees what he wants to see and disregards the rest..."

I agree -- when one is highly motivated to do so. But you should keep in
mind another insightful observation of another well-known philosopher, Edgar
A. Poe:

"Ask not for whom the bell tolls; it tolls for thee."

Who is seeing what he wants to see?

You, I guess. The quotation is from John Donne.

Bruce

[From Rick Marken (970813.1410)]

Bruce Abbott (970813.1520 EST) to Bill Powers (970813.0729 MDT) --

You didn't understand the logic correctly at that time, and
judging from the above, you have completely forgotten the
correct understanding you finally arrived at after numerous
exchanges that followed the post you quoted from.

I'd love to hear the "correct understanding" myself. It sure
sounded like Bill had it spot on: You built a control model
with some "reinforcing" properties that made it less efficient
than a pure control model (the environment was holding it back a
bit). But the model was not a complete failure because the
most detrimental aspects of reinforcement -- the increase to
the probability of a tumble that would occur when movement up
the gradient was reinforced -- could have little effect since
the gradient rarely improves (becomes reinforcing) after an
up-gradient tumble.

Best

Rick

···

--
Richard S. Marken Phone or Fax: 310 474-0313
Life Learning Associates e-mail: rmarken@earthlink.net
http://home.earthlink.net/~rmarken

[From Bill Powers (970813.1425 MDT)]
Bruce Abbott (970813.1520 EST)--

Now we can understand why the final approach to the target is so rapid and
the final position stays so close to the target: the model has approached
the condition in which there is a tumble for any movement down the gradient
and none for any movement up the gradient, as in the simplest PCT model.

Yep. Up to this point you were on target. What comes next was completely
wrong:

The question is now why the two probabilities tend so rapidly and
systematically toward 0 and 1. The implication is that NutSave, the
previous value of dNut, predicts the next value of dNut. But we know that
this is not true. Random means random: whatever the current value of dNut,
the next value can be anything from the maximum positive to the maximum
negative, and the most probable value is zero.

Everything else you said in that post followed from this mistake. Because
you concluded (incorrectly) that the model didn't perform as I had
described, you looked around for some accidental feature of the situation
that would produce the results actually observed during the program's
action, suggesting that something -- the nonlinear geometry? -- the
circular geometry? both together? -- must be responsible for this result.
This pure speculation did not emerge from a correct understanding of the
program's logic, but from an inability to comprehend how it could work at
all, given your flawed understanding.

At that time, I didn't understand the logic. You are quite right. Now I
know that the change in dnut across a tumble was being treated as a
reinforcement. In my post yesterday I gave the "correct" interpretation.

When I said that you did not initially accept that the program worked as
advertised, I was not referring to the fact that the model e. coli converged
on the nutrient source. That much was indisputable. I was referring to
your failure to accept my description of _how_ it worked. The quoted
portion of your post above shows that my memory of this exchange was correct.

Now that I understand the logic, I can make up my own mind _how_ it works.
The program speaks for itself.

I don't have access to this archived material here at home. What was my

reply?

Rick got into the act, and you countered by explaining that when dnut was
positive (using the program terms), the average dnut following a tumble
would be zero, which is a net decrease in dnut, showing that the
probability is that a tumble following a positive dnut will produce a less
positive dnut. That was the main point I did not yet understand -- and Rick
didn't understand it, either.

···

On the 941107 you said:
-------------------------------------------------------------------------
Yet traditional reinforcement theory has the all important details wrong, and
this has led to 100 years of confusion and to several recent attempts at
correction (added epicycles?) by theorists such as Meehl, Premack, Allison,
Timberlake, Staddon, and several others. In failing to see the organism as a
perceptual control system governed by negative feedback, theorists have
committed serious blunders, including an analysis wedded to discrete events
rather than to relationships among continuous variables, a misunderstanding of
the role of "reinforcement" or "punishment" as feedback (with its attendant
confusion of the meanings of "positive" and "negative" feedback), and a
failure to distinguish the roles of these changes in variable-states in
learning versus the maintenance of behavior.

The problem with the Ptolemaic system was not that it failed utterly to
account for the apparent motions of the planets--in fact it did an excellent
job. The problem was that its assumptions were ad hoc and fundamentally
wrong. The same is true of traditional reinforcement theory.

The problem with your e. coli demonstration is that it attempts to attack
traditional reinforcement theory an area where it has the relationships right
(although clumsily described in the language of discrete events). In
attempting to demonstrate this fact, I am not defending traditional
reinforcement theory as the correct account. Wrong models can give correct
results, and this is especially true when the results in question are
precisely those that the model was developed to explain. As you have
correctly surmised, the challenge is no longer to show how TRT can explain the
learning of human participants in the e. coli simulation, but rather to
construct a PCT model that does the same. Rather than asserting that the
empirical Law of Effect does not work (under many circumstances it does), we
need to be forcefully demonstrating how PCT provides a superior explanation
for why, under those circumstances, it does and why, under other
circumstances, it does not.
-----------------------------------------------------------------------------
I am still not convinced that reinforcement theory actually has the
relationships right in the E. coli model.

You are still reconstructing history to fit your thesis, Bill. You didn't
understand the logic correctly at that time, and judging from the above, you
have completely forgotten the correct understanding you finally arrived at
after numerous exchanges that followed the post you quoted from. You didn't
"come close to the right answer," your attempt to understand the
reinforcement portion of the model failed completely.

I think you are failing to understand my new explanation of my arguments,
which is now based on a correct understanding of the operant model. Two
years ago I realized -- approximately -- that the reinforcement aspect of
the operant model was yielding the right results for the wrong reasons, as
you said. I didn't have the "wrongness" right, because I was still muddled
about the logic.

I have recently tried my "empirical" approach again, this time simply
setting Nutsave to zero -- and the model converges toward the goal still
faster. It will converge fastest of all when the reinforcement aspect is
eliminated altogether. I didn't realize, two years ago, that this would be
true, although if you read my commentary on your program that I posted
yesterday you will see that I came within a hair's breadth of seeing it.

Unfortunately, we drifted away from the main thread and got off into
multiple targets, two-level systems, and all that, so the main issue was
just sort of abandoned. I think what we need is a clean program which
eliminates all the unnecessary parts of your fourth version, and just shows
an E.coli heading from the edge of the screen toward a target. Then I think
I will be able to show you what I'm talking about -- now that I DO
understand the logic.

To put what I am saying in a nutshell, I am claiming that selecting between
"reward" and "punishment" on the basis of (dnut - nutsave) accomplishes
nothing except the introduction of a fairly large fraction of changes in
the probabilities that goes the wrong way. If we simply say that the
pr{tumble|S-} increases when dnut is negative, and pr[tumble|S+} decreases
when dnut is positive, we will have captured all the empirical
observations, without introducing the idea of reinforcement at all. If we
start from there and THEN bring in the reinforcement relationships as in
your model, it will be clear that the result is to SLOW the learning.

I think it's important to get this settled, whoever has to give in. I think
that the thesis you were putting forth -- that reinforcement theory is
descriptively correct -- is an anchor that keeps you from moving all the
way to PCT. I am contending that the idea of reinforcement adds nothing to
our understanding of the learning process, and in fact (when introduced
into a model) creates difficulties. So let's just keep gnawing on that bone
until it's disposed of.

Best,

Bill P.