The Domains of Reinforcement Theory and PCT

Bruce_Abbott · June 23, 1995, 4:39pm

[From Bruce Abbott (950623.1135 EST)]

Bill Powers (950619.1455 MDT) --

Suppose we thought of B and R as having variable magnitudes. Now we can
have more or less of B, and more or less of R. This allows R to have an
effect on B in terms of magnitude. If we assumed that an increase in R
causes an increase in B and that R is proportional to B, then we would
have

R1 = k1*B1
B2 = B1 + k2*R1.
R2 = k1*B2
B3 = B2 + k2*R2,

and so on. From this we would quickly conclude that the behavior B will
either get larger and larger without bounds (k1*k2 > 1) or smaller and
smaller until it is zero (k1*k2 < 1).

This model has no stable states other than infinity or zero. This is not
what we want as a model of learning, either.

It is easy to modify the model, however, so that it no longer tends to
infinity. For example, think of reinforcement as being equivalent to
charging a capacitor. In this case the increment in response strength is
proportional to the difference between its present strength and some maximum
possible strength.

What we can do is go back to the idea that B and R are events that occur
or don't occur, but consider the _probability_ that B will occur. Now we
can say

B --> R (the behavior, if it occurs, leads to R)

pr{B} = pr{B} + k*R {where R is either 1 or 0)

This arrangement does not cause behavior to run away to infinity,
because now the greatest possible probability is 1. If the behavior
starts with a zero probability of occurring and the probability is
increased every time R occurs, eventually the probability of B occurring
will become 1, and B will occur every time R occurs.

This, then, gives us a basic model of learning with the right
properties. At first there is no behavior or very little. When the
behavior does appear, it produces a reinforcement, which in turn
increases the probability that the same behavior will occur again. With
enough repetition, the behavior will eventually occur every time the
reinforcer occurs. Since the behavior always results in a reinforcement,
the behavior will persist.

Yes, if we wish to view behavior as a series of well-defined "responses"
such as pressing a lever, we can then talk about the rate over time at which
these responses occur OR in terms of the probability of the response. The
probability analysis provides natural limits of 0 and 1; the rate measure
provides a natural lower limit of 0 but the upper limit depends on further
considerations such as minimum interresponse time and level of motivation.
I point this out merely to indicate that one is not necessarily driven to a
probability analysis as a means of avoiding a runaway to infinity.

This model can now be expanded to include a discriminative stimulus S,
which signals the conditions under which B will produce R. Now the
probability that is altered by R is Pr{B|S}: the probability that
occurance of the discriminative stimulus S will lead to behavior B.
Again there is no runaway condition, because the maximum possible
probability of any event is 1. The most that can happen even with
continued reinforcement is that every time S occurs, B will occur.

The discriminative stimulus should not be viewed as a stimulus that elicits
a response. Pr(B|S) is not the probability that S will lead to B but the
probability density over time of B, given that S is present. If this
probability is high and S is short, the result may look like S-R: S occurs
and R follows almost immediately. But this is only an extreme case.

If the probability in question is considered to be a probability
_density_ (the probability of occurrance per unit time), the probability
of B can be interpreted as the average interval between occurrances of
B. If the probability density is high, B will occur after only a few
time-increments; if low, B will occur only after many time increments.
This leads to seeing the natural measure of behavior as _rate of
repetition_, where the average rate of repetition is simply the
reciprocal of the probability density.

Yes, and Pr(B|S) can be interpreted as the rate of repetition of B given S.

Thus even though behaviors and reinforcements themselves are considered
as unitary events which either occur or don't occur, we can find a
continuous variable in terms of which to measure behavior: its rate of
occurrance, which is closely related to the basic measure of probability
of occurrance per unit time. The rate measure of behavior and
reinforcement is thus the natural result of considering behaviors and
reinforcements to be events, and of proposing that the effect of a
reinforcement on the production of a behavior (or on the response to a
discriminative stimulus) occurs via an effect on the probability-density
of the responses. The use of probabilities is dictated by the fact that
if the magnitude of reinforcement had an effect on the magnitude of
behavior, the resulting model would have stable states only at zero and
infinity.

Again, the use of probabilities is not "dictated" at all, as one may work
directly with the rate measure. In cases where what varies is the magnitude
of the response (i.e., response force), the reinforcement principle has been
applied successfully without recourse to a probability interpretation. This
is why reinforcement theorists talk about reinforcement affecting the
"strength" of behavior rather than its probability. Rate or probability are
considered to be measures of response strength, but so would the force
exerted on the lever if the reinforcement contingency required a specific
force before delivery of the reinforcer.

We can see that as originally conceived, the model is a positive
feedback model. The more reinforcement there is the more behavior there
is, the more behavior there is the more reinforcement there is. The
converse also holds true; if there is any lessening of either behavior
or reinforcement, both variables must decrease with the only limit being
zero.

As each response continues to produce the "same" reinforcement, the rate of
behavior will increase to some value dependent on the value of the
reinforcer to the subject and the cost of responding (e.g., effort). The
"amount" of reinforcement conceived as reinforcement per response is
constant in this scenario and this relationship determines how much response
"strength" is incremented with each successive reinforcement, up to its limit.

This "molecular" analysis based on the immediate consequence of each
response can be supplemented with a "molar" approach based on rates if one
takes the view that the subjects in these experiments can integrate events
over time. In this view, receiving individual reinforcements at a higher
rate is more reinforcing than receiving them at a lower rate. A brief
increase in response rate (e.g., accidental lever presses as byproducts of
general activity) result in an increase in reinforcement rate (increased
rates of lever-pressing are reinforced by increased rates of food pellet
delivery). Decreased rates of lever-pressing lead to decreased rates of
food pellet delivery (punishment). The molar approach thus predicts that
response rates will tend to increase toward the maximum sustainable rate as
determined by reinforcer value and response cost.

This is what is observed, and not the bi-stable behavior you predict on the
basis of the positive feedback relationship as you conceive it. The
analysis you propose would lead to instability of behavior on a ratio
schedule: any decrease in response rate (perhaps resulting from the rat's
momentary need to scratch an itch) would be expected to lead to total
extinction of the response. It doesn't happen, and it isn't predicted to
happen by reinforcement theory.

To avoid this obviously inappropriate result, the meaning of "more" had
to be modified so that the outcome was not a runaway condition. This is
the function of the concept of increasing _probabilities_ rather than
_magnitudes_. The model remains a positive feedback model, but the
insertion of a probability puts a limit on the runaway condition where
the probability becomes 1.

But this doesn't solve the problem, does it? Your analysis still has the
instability problem I pointed out whether "more" is an increase in rate or
an increase in probability.

Let's go back to the original model, but this time add a reference
signal and change the system to a negative feedback model:

                        R'*
                        >+
                    --------
            ------>| ORG |---->--
           > -------- |
           > >
           > >
           > -------- |
           R<----- | ENV |<------B
                    --------

[This was followed by a description of the control model.]

Let's apply this model to the steady-state situation involved in performance
on a simple CRF (1 response per reinforcement) schedule as ordinarly studied
in the operant chamber. A hungry rat is placed in the chamber. We might
conceive of the set point for rate of food pellet consumption as, say, 30
pellets per minute (30 ppm) under these conditions (essentially continuous
eating). However, the apparatus limits the maximum rate to 10 ppm because
of the delays involved in depressing the lever, moving to the food cup,
picking up the food, devouring it, and returning to the lever. Thus, there
is NO WAY that the rat can reach its reference level for this quantity,
although it can reduce the error by a significant amount through
lever-pressing (and by no other means). I believe that control theory
predicts that the rat (once it has learned what to do) will develop a rate
of responding that minimizes the error, up to the point where the error is
reduced enough to bring the output below maximum. Depending on the gain,
there will be a region within which the rate of responding will be a direct
function of the magnitude of the error.

If we now increase the ratio requirement, the maximum rate of reinforcement
available on the schedule is less (say, 5 ppm). This means that a larger
error remains after responding has reached its maximum rate than was the
case with the CRF schedule. If response rate on the CRF schedule was
already at maximum, it will still be at maximum on the new schedule; if not,
then it will increase.

But this analysis ignores response cost. Assume there is a second control
system for "response effort" and that the reference for this perception is
set to a low value. We now have two control systems in conflict when the
rate-of-eating system brings the rate of lever-pressing up and thereby
brings response effort above its set point. The result is that neither
system can reach set point and rate of responding stabilizes at a kind of
equilibrium value between the two set points. As the ratio requirement
increases, the rate of responding increases, but not enough to keep the rate
of food intake constant; instead the latter declines somewhat with
increasing ratio requirements. You get the right-hand portion of the
inverted U function shown by Motheral's data.

A similar analysis in terms of reinforcement (response-contingent food
delivery) and punishment (response effort) gives the same result. As the
ratio is increased, reinforcement will still tend to drive the response rate
toward its upper limit. However, the cost of the food in terms of response
effort increases, lowering the overall reinforcing "value" of the
transaction and reducing the rate of responding below that expected if no
effort were required in order to obtain the food. (There is also the effect
of delayed reinforcement to be considered, which I ignore here for
simplicity.)

You don't see the negative feedback regulation of food rate in the control
model because food rate never reaches its reference value. For this reason
I feel that your control model of the right portion of the Motheral curve
does not apply in the way it would if control could be achieved.

As the reinforcement earned via lever-pressing declines with further
increases in ratio requirement, one must become concerned with the possible
effect of alternative sources of reinforcement on responding. The effort
expended per reinforcer gets larger and larger as the ratio increases (the
cost of a reinforcer), diminishing the overall reward value. There comes a
point where doing other things is more rewarding than lever pressing, and
behavior will shift away from the latter activity. The further diminishing
return for lever pressing and availability of alternative sources of
reinforcement (exploration, etc.) contribute to the left portion of
Motheral's curve.

To say that a reinforcement increases the probability of a behavior thus
can legitimately be taken as meaning that R increases the probability
that the _right kind_ of behavior will appear. Instead of rearing up
against the wall of the cage in one place, the rat rears up in another
place such that the act of rearing up causes the lever to be depressed
by the front paws. What we observe is an increase in the probability
that the rat will rear up in that one critical position rather than in
any other. Equivalently, since we can't directly observe probabilities,
we can say that the rat's behavior changes so it shows an ever-greater
proportion of rearing-up actions in the right place relative to all
other places.

This is the sense in which Thorndike applied the Law of Effect to his cat
data and has always been recognized as an effect of reinforcement (although
this fact has sometimes been overlooked by some theorists); in addition to
selection, though, reinforcement has been considered to strengthen the
response absolutely as well as relatively.

The one area of confusion that is left concerns measuring behavior in
terms of rate of occurrance. This confusion shows up because in the
attempt to explain performance as well as acquisition of behavior,
experimenter-theoreticians set up experiments in which it was
impossible, even when the right behavior had been acquired, to vary the
_amount_ of effect produced by a single behavior. The use of levers and
keys effectively removed any ability of an animal to have a quantitative
effect on the reinforcement by varying its efforts.

This is true of most (but not all) operant conditioning experiments, where
rate of response is the most typical measure of response strength. But
there are many other examples in which other measures were employed, such as
response force, response latency, response probability (in choice studies),
and number of errors committed.

The reinforcement model can _also_ be set up to produce a continuous
relationship between rate of bar pressing and rate of reinforcement,
through the intermediate step of effects of reinforcement on probability
of behavior per unit time. However, this creates a conflict between the
reinforcement model and the negative feedback model. The reinforcement
model says that an increase in reinforcement rate must produce an
increase in the probability (frequency) of behavior, while the negative
feedback model says that an increase in reinforcement rate must produce
a _decrease_ in the behavior rate. If this conflict can be resolved,
there will be no further difficulties between the reinforcement model
and the negative feedback control system model. This is not to say that
neither model will be further modified on the basis of other
considerations, but at least this direct contradiction will have been
removed.
---------------------------------------------------------------------
The contradiction can be removed by saying that the reinforcement model
applies strictly to the process of increasing the relative frequency
with which the right behavior occurs, in comparison with all non-
reinforcement-producing behaviors that might also occur. Once the right
behavior has been acquired, the negative feedback control system model
applies to the process of creating the desired amount of reinforcement.

As I hope I've communicated above, the reinforcement model can handle the
data without restricting its application to the lower end of Motheral's
curve. However, the application at this end does partly involve considering
the selective effect of reinforcement when alternative sources of
reinforcement are available for different behaviors, as you suggest.

In the data that appear ambiguous, supporting the reinforcement model at
one extreme and the control model at the other, the difference can now
be explained easily, and in a way that can be tested against
experimental data. Where reinforcement rates are low enough, the
reinforcement model applies, and the animal begins to search for other
behaviors that will more reliably produce the reinforcer. This means
that other behaviors beside the bar-pressing will be seen, and
proportionally less time will be spent doing the right behavior. This
shows up as an apparent drop-off of behavior with decreasing
reinforcement rates, or an apparent increase in behavior rate (of the
kind being measured) with increasing reinforcement rates. Where
reinforcement rates are high enough, the animal essentially always uses
the right behavior, and controls the amount of received reinforcer near
a specific reference level by varying its rate of behavior: now an
increase in reinforcement rate goes with a decrease in behavior rate.

This is a nice resolution, and essentially agrees with the analysis I
presented in pure reinforcement terms. Clearly the upper limb of Motheral's
curve is better explained by considering the way the mechanism of the
control model would be expected to operate under the conditions of the study
than by the complex analysis that must be developed under reinforcement
theory (although one CAN be constructed). And as you suggest, the lower
limb can be understood at least partly by considering the selective property
of the reinforcement principle. An interesting project would be to consider
how such a selective effect might be handled by PCT.

The crux of the matter is this: Given that there are several activities
that can be carried out, but which cannot be carried out simultaneously,
what determines which control system will be active? For example, will the
food-earning system predominate or the one that results in exploration?
What determines how the organism will allocate its behavioral resources?

Regards,

Bruce