Operant; Reinforcement theory vs observation

[From Bill Powers (951212.1530 MST)]

Bruce Abbott (951212.1220 EST) --

     So, to use an analogy, if I claim that a "person" is a class of
     objects having certain properties in common, it is your claim that
     if I encounter you, and find you to have those properties, that I
     cannot thereby conclude that you are a member of the class
     "person." This is obviously wrong.

We aren't talking about properties, but effects of different actions on
a single variable. If you observe only the variable changing, you can
certainly say that at least one of the actions that can make it change
has occurred, but there is no indication of which one or ones. And there
is no way to tell, just by looking at an action, what variables it is
going to affect. The class of actions that can cause lever-pressing does
not have a fixed membership; the membership changes with every change in
the relationship between the active system and the variable that is
affected. An extension of a forelimb can depress a lever if the animal
is located above the lever, but if the animal is below the lever the
same action will have the opposite effect. So there is really no way to
define the operant _except_ in terms of the final effect.

In the operant defined as the class of actions that can cause a lever to
depress, therefore, we have to include all degrees and directions of
action that could ever, plausibly, make the lever go down. This means we
have to include each action and its opposite: at one moment, making the
lever go down might require moving the forepaw to the left and then
down, and at the next moment to the right and then down. So moving the
forepaw left and right become parts of the same operant.

The concept of "the operant" appears to solve the problem of what is
conditioned, but in fact utterly begs the question (that is, it assumes
the answer to the question in deriving the answer). What is
conditioned, apparently, is whatever direction, amount, and kind of
action is required at a given moment to make the immediate consequence
occur. What is conditioned is the act of moving left or moving right, or
extending a limb or retracting it, whichever is required at the moment.

With the problem of what _motor action_ becomes conditioned thus
disposed of, we are left with the statement that the immediate
consequence causes the contingent consequence. This is, of course, only
a statement about the apparatus. If the key contacts close as many times
or within the interval as specified in the schedule, the reinforcer will
be delivered. This we could have determined without needing the organism
to be present.

Skinner's reasoning must have gone something like this. There are many
actions that can make the lever go down. When one of those actions
occurs and the lever goes down (enough times, or in the right way), a
reinforcement occurs and the tendency to produce that particular action
is strengthened. All we can predict, therefore, is that one of the
actions that could make the lever go down will become a conditioned
response. We can only say in advance that the _operant_ will become
conditioned (meaning, one action within the class of actions that have
the right effect on the lever).

In stereotyped behavior, Skinner thought he saw the vindication of this
view. If the animal depressed the lever by sitting on it, sitting on the
lever became the conditioned response. So clearly, a specific act within
the operant class had become conditioned. However, Skinner failed to see
that "sitting on the lever" is only a category that includes many
different detailed motor acts from one instance to the next, and that it
is brought about by motor acts that can be different in amount and even
direction from one instance to the next. And here there can be no
stereotyped actions. If the animal is to the left of the lever, it
_must_ move to the right by exactly the right distance to sit on the
lever -- and that distance, as well as the direction, will differ on
every trial, although you may sometimes have to look closely to see the
differences. Animals are not precision machines; even in a totally fixed
environment, variations in their own actions even when they are
supposedly just repeating a simple action are enough to require
continual corrections if the same outcome is to repeat. The
purposiveness of behavior is always evident in the details.

There's much to discuss here, but I want to focus on one point that is
probably the basic bone of contention, at least as I see it. It's the
question of whether reinforcement is an empirical fact. So I'm skipping
to

Bruce Abbott (951211.1830 EST)

Your four-step procedure was

1. Observe rate of operant prior to establishing contingency between
    putative reinforcer and operant.

2. Put contingency into effect; observe rate of operant. If operant
rate increases, consequence MAY be a reinforcer.

3. Remove contingency; observe rate of operant. If operant rate
decreases, consequence is PROBABLY a reinforcer.

4. Repeat steps 2 and 3 several times to determine whether the changes
observed in steps 2 and 3 are REPLICABLE. If they are, consequence
IS a reinforcer.

We can reduce these verbal statements to more mathematical ones.
Evidently, you are defining the contingent consequence C as a reinforcer
if the change in the rate of behavior B is proportional to rate of
occurrent of C when the contingency is present, and if B tends toward
zero when the contingency is absent.

When the contingency is present,

(1) C = s(B),
where s is the schedule function.

When the contingency is absent, C = 0.

B is behavior rate, C is rate of occurrance of the contingent
consequence. The rate of change of behavior rate is dB/dt. If C is a
reinforcer,

dB/dt = k*C (to a first linear approximation),

meaning that the rate of behavior increases with time in proportion to
the reinforcement rate. If the reinforcement rate is zero, the behavior
rate remains constant. In terms of events, each occurrance of C
increases the rate of behavior by some small amount k.

However, you also specify that when the contingency is absent we should
observe that dB/dt decays toward zero. Assuming an exponential decay, we
can write

(2) dB/dt = k1*C - k2*B

where k1 is the change of behavior rate per reinforcement
      k2 is the decay constant

Putting equations (1) and (2) together, we get

dB/dt = k1*s(B) - k2*B

So far this seems straightforward enough.

For a simple ratio schedule,

s(B) = B/m,

the equation reduces to

dB/dt = K*B, where

   K = k1/m - k2.

There are two solutions:

(a) B = 0

(b) B = exp(Kt)

So either the behavior and reinforcement rates remain at zero when the
contingency is turned on, or they both increase as a positively-
accelerated exponential with time. Any slight perturbation of B away
from zero will produce the positively-accelerated exponential behavior.

If k1/m < k2, K is negative, and we have one solution:

B = Bo*exp(-ut), where u is abs(K)

Whatever the initial value of B, Bo, B will decay exponentially to zero.
Once B has reached zero, it will remain at zero whether the contingency
is present or absent.

···

-------------------------------
The above derivation began with the assumption that B depended on C in
the manner prescribed by the definition of reinforcement: each
occurrance of C produced an increment in the rate of behavior. The
mathematical consequence of that assumption was a behavior that
increases exponentially with time or remains at zero, depending on the
assumed system constants.

Now what if we assume a different law for the effect of C on B? Let us
say that

dB/dt = k1*(Co - C)

This says that the rate of change of behavior is affected negatively by
the contingent consequence and positively by a system constant Co, with
a proportionality factor k. Using the same ratio schedule as before for
s(B), we have a solution of the form

dB/dt = k*(Co - B/m)

This gives us a single solution. B changes from whatever its initial
value is, along a negatively-accelerated curve, until C reaches an
asymptote at Co and B reaches an asymptote at Co/m.

In fairness I have to point out that if the schedule is turned off so C
= 0, something must change the system constant Co to zero, in order to
get the decay of B to zero that is observed. However, this model
predicts that in the time just after the contingency is turned off and
before Co is set to zero, there will be a momentary large increase in
the behavior rate. This, I believe, is commonly observed.
------------------------------------
What is interesting about this development is that if we assume the
_relationship_ between C and B that is supposed to define the
_phenomenon_ of reinforcement, we do not get the behavior that is
expected. But if we assume a _different_ relationship between C and B,
we now _do_ get the kind of behavior expected in a reinforcement
situation.

We have found that the statement "reinforcement produces an increment in
behavior" leads to behavior that does not match what happens in
experiments, whereas the statement "Behavior is proportional to the
difference between a reference value Co and the actual value of the
contingent consequence C," leads to behavior that changes from any
starting rate to a final rate in a negatively-accelerated exponential
way -- which is what is observed.

The conventional way of defining reinforcement as an increment in
behavior actually leads to a prediction of either zero behavior or a
positive-feedback runaway condition. The difference between what does
happen (an approach to an asymptote) and the implication of the
definition of reinforcement (runaway) has, I assume, been recognized.
This difference, however, has not been attributed to an error in the
assumed mechanism of reinforcement. Instead, it has been assumed to be
due to failure to take into account a second phenomenon: satiation. The
basic conclusion has been that there would indeed be a runaway
condition, but as the reinforcement rate rises, satiation begins to
increase nonlinearly, until it is sufficient to stop the runaway.

In more complex schedules like a VI schedule, there is a natural
nonlinearity in s(B) which, with just the right values of the system
constants, can produce runway at low levels of behavior but an approach
to an asymptote at high levels, without requiring the introduction of
satiation to save the theory. However, the alternative law given above
also produces the same result, but does not depend so delicately on
assuming just the right values of system parameters -- and it works for
ratio schedules as well.

The real problem with the conventional model is that the wrong law has
been assumed. The observed behavior is NOT implied by the assumption
that an increment in reinforcement produces an increment in behavior.
The cure for the failure to predict is not to add new phenomena ad hoc,
but to use a law that _does_ lead to the observed behavior: the PCT
model.

The difference between reinforcement theory and PCT comes down to the
two models assumed:

Reinforcement: dB/dt = k1*C - k2*B

PCT: dB/dt = k1*(Co - C)

You will notice that C enters with opposite signs in these two models.

By this analysis, we can see that the statement "reinforcement causes an
increment in behavior rate," which is a definition of reinforcement, is
incompatible with the observation, "Behavior and reinforcement rate rise
to an asymptotic value", which is roughly how the _phenomenon_ called
reinforcement is described. The PCT model shows how both behavior rate
and reinforcement rate can rise to an asymptote while the reinforcement
rate always has a _negative_ effect on the behavior rate.
-----------------------------------------------------------------------
Best,

Bill P.