# The Domains of reinforcement theory and PCT

[From Bill Powers (950619.1455 MDT)]

Some thoughts on reinforcement theory that must have passed through the
minds of those who developed it, if not exactly in the order presented
below. And following, a possible resolution of the conflict between
reinforcement theory and control theory.

···

-----------------------------------------------
Consider this simple closed-loop representation of the reinforcement
model:

--------
------>| ORG |---->--
> -------- |
> >
> >
> -------- |
R<----- | ENV |<------B
--------

The organism outputs a behavior B with a reinforcing consequence R
produced through an environmental link. The reinforcing consequence R
enters the organism via senses, and the result is change in the behavior
B. What kind of change in B follows from the reinforcement R?

If we think of B and R as events, then there can be no change in either
B or R taken as individual occurrances. They either occur or they do not
occur; there is no other choice. A lever is either pressed or it is not
pressed (B); a reinforcer is either given as a consequence or it is not
given (R). This would seem to leave reinforcements with no way of having
any effect on behaviors. This is not what we want as a model of
learning.

Suppose we thought of B and R as having variable magnitudes. Now we can
have more or less of B, and more or less of R. This allows R to have an
effect on B in terms of magnitude. If we assumed that an increase in R
causes an increase in B and that R is proportional to B, then we would
have

R1 = k1*B1
B2 = B1 + k2*R1.
R2 = k1*B2
B3 = B2 + k2*R2,

and so on. From this we would quickly conclude that the behavior B will
either get larger and larger without bounds (k1*k2 > 1) or smaller and
smaller until it is zero (k1*k2 < 1).

This model has no stable states other than infinity or zero. This is not
what we want as a model of learning, either.

What we can do is go back to the idea that B and R are events that occur
or don't occur, but consider the _probability_ that B will occur. Now we
can say

B --> R (the behavior, if it occurs, leads to R)

pr{B} = pr{B} + k*R {where R is either 1 or 0)

This arrangement does not cause behavior to run away to infinity,
because now the greatest possible probability is 1. If the behavior
starts with a zero probability of occurring and the probability is
increased every time R occurs, eventually the probability of B occurring
will become 1, and B will occur every time R occurs.

This, then, gives us a basic model of learning with the right
properties. At first there is no behavior or very little. When the
behavior does appear, it produces a reinforcement, which in turn
increases the probability that the same behavior will occur again. With
enough repetition, the behavior will eventually occur every time the
reinforcer occurs. Since the behavior always results in a reinforcement,
the behavior will persist.

This model can now be expanded to include a discriminative stimulus S,
which signals the conditions under which B will produce R. Now the
probability that is altered by R is Pr{B|S}: the probability that
occurance of the discriminative stimulus S will lead to behavior B.
Again there is no runaway condition, because the maximum possible
probability of any event is 1. The most that can happen even with
continued reinforcement is that every time S occurs, B will occur.

If the probability in question is considered to be a probability
_density_ (the probability of occurrance per unit time), the probability
of B can be interpreted as the average interval between occurrances of
B. If the probability density is high, B will occur after only a few
time-increments; if low, B will occur only after many time increments.
This leads to seeing the natural measure of behavior as _rate of
repetition_, where the average rate of repetition is simply the
reciprocal of the probability density.

Thus even though behaviors and reinforcements themselves are considered
as unitary events which either occur or don't occur, we can find a
continuous variable in terms of which to measure behavior: its rate of
occurrance, which is closely related to the basic measure of probability
of occurrance per unit time. The rate measure of behavior and
reinforcement is thus the natural result of considering behaviors and
reinforcements to be events, and of proposing that the effect of a
reinforcement on the production of a behavior (or on the response to a
discriminative stimulus) occurs via an effect on the probability-density
of the responses. The use of probabilities is dictated by the fact that
if the magnitude of reinforcement had an effect on the magnitude of
behavior, the resulting model would have stable states only at zero and
infinity.

A ratio schedule is one in which the occurrance of a reinforcer depends
probabilistically on the occurrance of a behavior. If the probability is
less than 1, we have a variable ratio schedule which requires a number
of behavioral acts to produce one reinforcement. A fixed ratio schedule
is a regular approximation to the variable-ratio schedule. Once again,
there is a natural limit that prevents runaway: there can be no more
than one reward per behavior; the probability of a reward given a
behavior can be no greater than 1.

From here the development can be taken in many directions which will not

be considered here.
-------------------------------------------------
We can see that as originally conceived, the model is a positive
feedback model. The more reinforcement there is the more behavior there
is, the more behavior there is the more reinforcement there is. The
converse also holds true; if there is any lessening of either behavior
or reinforcement, both variables must decrease with the only limit being
zero.

To avoid this obviously inappropriate result, the meaning of "more" had
to be modified so that the outcome was not a runaway condition. This is
the function of the concept of increasing _probabilities_ rather than
_magnitudes_. The model remains a positive feedback model, but the
insertion of a probability puts a limit on the runaway condition where
the probability becomes 1.
--------------------------------------------------
Let's go back to the original model, but this time add a reference
signal and change the system to a negative feedback model:

R'*
>+
--------
------>| ORG |---->--
> -------- |
> >
> >
> -------- |
R<----- | ENV |<------B
--------

The symbol R' (R-prime) now stands for a desired amount of R. While it
is not evident in the diagram, we have also changed the sign of the
effect of R on B. Now B is determined by the difference between R' and
R: that is,

B = k1*(R' - R).

As a result, an increase in R will cause a decrease in B. As before, an
increase in B will cause an increase in R:

R = k2*B

If we now consider that R and B are variable in magnitude, rather than
being events, we can solve the above two equations for the steady-state
result:

R = k2*(k1*(R' - R), or

k1*k2
R = ---------- R'
1 + k1*k2

If k1*k2 becomes much larger than one, we find that the result is

R = R'.

Now thinking of R and B in terms of magnitudes leads to a stable system.
R will come to the magnitude specified by R', and B will be R'/k2. The
only requirement is that k1*k2 be so large that we can neglect the 1 in
the denominator.
----------------------------------------------
Needless to say, realizing either of the above two models in a physical
system requires adding some details: in the reinforcement model we must
find a physical way of doing the equivalent of changing a probability.
In the negative feedback model we must insert a filter that allows k1*k2
to be large without resulting in oscillations.
----------------------------------------------
Under what conditions would we expect the reinforcement model to be
appropriate? The answer can be found by asking under what conditions we
could legitimately consider only the _occurrance_ of a behavior and a
reinforcement without also considering their magnitudes.

When B is considered as an event, we have only the choice between that
behavior being observed and its not being observed. This is equivalent
to seeing either the behavior B that results in R, or SOME OTHER
BEHAVIOR that does not result in the appearance of R. When an operant
behavior is being acquired, this is exactly the situation that is seen.
If the rat is not pressing the bar, it is doing something else. So to
speak of the probability of the behavior is to speak of the probability
that the _right_ behavior, among all those kinds of behavior that are
possible, has occurred.

To say that a reinforcement increases the probability of a behavior thus
can legitimately be taken as meaning that R increases the probability
that the _right kind_ of behavior will appear. Instead of rearing up
against the wall of the cage in one place, the rat rears up in another
place such that the act of rearing up causes the lever to be depressed
by the front paws. What we observe is an increase in the probability
that the rat will rear up in that one critical position rather than in
any other. Equivalently, since we can't directly observe probabilities,
we can say that the rat's behavior changes so it shows an ever-greater
proportion of rearing-up actions in the right place relative to all
other places.

This is the kind of positive feedback situation in which there is a
natural limit to the runaway condition: the animal can't spend more than
all of its time performing the right kind of behavior to produce
reinforcements. The reinforcement model is therefore applicable, if
anywhere, to the process of _acquiring the right kind of behavior_ as
opposed to behaviors that have no effect on producing reinforcements.

Once the right kind of behavior has been established, the reinforcement
model has gone as far as it can go. It cannot also account for the fact
that the animal comes to produce exactly the amount of behavior that is
needed to bring the amount of reinforcement to the _right amount_. That
process requires a negative feedback model with a reference signal, with
provision for adjusting the feedback parameters for best control.
-------------------------------------------
The one area of confusion that is left concerns measuring behavior in
terms of rate of occurrance. This confusion shows up because in the
attempt to explain performance as well as acquisition of behavior,
experimenter-theoreticians set up experiments in which it was
impossible, even when the right behavior had been acquired, to vary the
_amount_ of effect produced by a single behavior. The use of levers and
keys effectively removed any ability of an animal to have a quantitative
effect on the reinforcement by varying its efforts.

Even in this artificial situation, an animal could vary its behavior as
a way of varying the _average_ amount of obtained reinforcement. Once
the right behavior had been found, pressing a key or lever, the animal
(or at least some animals) could now vary the amount of received
reinforcement by varying the rate at which it pressed the key or lever.

So a negative feedback model can be set up in which R and B are both
measured in terms of rates, with R having an average perceived effect
depending on the rate of delivery of reinforcements and the internal
rate of decay of effects of reinforcements. This internal perceptual
effect could then be compared with the desired effect, and the
difference could be converted proportionally into a rate of generation
of bar or key pressing acts.

The reinforcement model can _also_ be set up to produce a continuous
relationship between rate of bar pressing and rate of reinforcement,
through the intermediate step of effects of reinforcement on probability
of behavior per unit time. However, this creates a conflict between the
reinforcement model and the negative feedback model. The reinforcement
model says that an increase in reinforcement rate must produce an
increase in the probability (frequency) of behavior, while the negative
feedback model says that an increase in reinforcement rate must produce
a _decrease_ in the behavior rate. If this conflict can be resolved,
there will be no further difficulties between the reinforcement model
and the negative feedback control system model. This is not to say that
neither model will be further modified on the basis of other
considerations, but at least this direct contradiction will have been
removed.
---------------------------------------------------------------------
The contradiction can be removed by saying that the reinforcement model
applies strictly to the process of increasing the relative frequency
with which the right behavior occurs, in comparison with all non-
reinforcement-producing behaviors that might also occur. Once the right
behavior has been acquired, the negative feedback control system model
applies to the process of creating the desired amount of reinforcement.

In the data that appear ambiguous, supporting the reinforcement model at
one extreme and the control model at the other, the difference can now
be explained easily, and in a way that can be tested against
experimental data. Where reinforcement rates are low enough, the
reinforcement model applies, and the animal begins to search for other
behaviors that will more reliably produce the reinforcer. This means
that other behaviors beside the bar-pressing will be seen, and
proportionally less time will be spent doing the right behavior. This
shows up as an apparent drop-off of behavior with decreasing
reinforcement rates, or an apparent increase in behavior rate (of the
kind being measured) with increasing reinforcement rates. Where
reinforcement rates are high enough, the animal essentially always uses
the right behavior, and controls the amount of received reinforcer near
a specific reference level by varying its rate of behavior: now an
increase in reinforcement rate goes with a decrease in behavior rate.
-----------------------------------------------------------------------
Best to all EABers and PCTers,

Bill P.

<[Bill Leach 950619.23:09 U.S. Eastern Time Zone]

[From Bill Powers (950619.1455 MDT)]

An excellent posting I think... though I suppose that we will really
have to wait for Bruce to return and see how he perceives it.

Something that occured to me as I was trying to "think like the devil's

Regardless of what "errors" may have existed in the thinking associated
with the "control" test that Bruce described, I believe that an important
concept might have been missed (or at least the concept's real
significance might have been -- and assuredly was by me anyway).

It seems that the discussion centered upon the test methodology and of
course the related testing performed in an attempt to confirm
assumptions.

Though the purpose of the test was talked about (and I believe that Rick
even mentioned "search for controlled variable"), it seems that it was
not really noticed that "trying to determine of the rats prefered the
ability to control the shock conditions" is clearly NOT a search for a
re-enforcer NOR is it a search for a particular ACTION or OUTPUT
(behaviour in the EAB sense, I believe).

In other words, the explicit function of the test was to determine if a
rat had a particular PURPOSE (ie: PCT reference implying a controlled
variable). It seems that this is the case, even if the researchers were
not themselves aware of the vast difference between trying to determine
what the rat is observed to do as opposed to WHY (from the rat's own
perspective) it is doing anything.

The other related tests were specifically intended to insure that
observed behaviour would not be do to other possible goals of the rat.
And again, regardless of the possibility of serious "flaws" in both the
specific methodology and thinking with regard to experimental design, the
basic idea was right whether it was recognized to be so for the right
reasons or not.

Once again, I might be engaging in "wishful" thinking but if the above is
even close then EAB is at least heading in the right direction for the
recognition of the validity of PCT.

-bill