[From Bill Powers (950619.1455 MDT)]

Some thoughts on reinforcement theory that must have passed through the

minds of those who developed it, if not exactly in the order presented

below. And following, a possible resolution of the conflict between

reinforcement theory and control theory.

## ···

-----------------------------------------------

Consider this simple closed-loop representation of the reinforcement

model:

--------

------>| ORG |---->--

> -------- |

> >

> >

> -------- |

R<----- | ENV |<------B

--------

The organism outputs a behavior B with a reinforcing consequence R

produced through an environmental link. The reinforcing consequence R

enters the organism via senses, and the result is change in the behavior

B. What kind of change in B follows from the reinforcement R?

If we think of B and R as events, then there can be no change in either

B or R taken as individual occurrances. They either occur or they do not

occur; there is no other choice. A lever is either pressed or it is not

pressed (B); a reinforcer is either given as a consequence or it is not

given (R). This would seem to leave reinforcements with no way of having

any effect on behaviors. This is not what we want as a model of

learning.

Suppose we thought of B and R as having variable magnitudes. Now we can

have more or less of B, and more or less of R. This allows R to have an

effect on B in terms of magnitude. If we assumed that an increase in R

causes an increase in B and that R is proportional to B, then we would

have

R1 = k1*B1

B2 = B1 + k2*R1.

R2 = k1*B2

B3 = B2 + k2*R2,

and so on. From this we would quickly conclude that the behavior B will

either get larger and larger without bounds (k1*k2 > 1) or smaller and

smaller until it is zero (k1*k2 < 1).

This model has no stable states other than infinity or zero. This is not

what we want as a model of learning, either.

What we can do is go back to the idea that B and R are events that occur

or don't occur, but consider the _probability_ that B will occur. Now we

can say

B --> R (the behavior, if it occurs, leads to R)

pr{B} = pr{B} + k*R {where R is either 1 or 0)

This arrangement does not cause behavior to run away to infinity,

because now the greatest possible probability is 1. If the behavior

starts with a zero probability of occurring and the probability is

increased every time R occurs, eventually the probability of B occurring

will become 1, and B will occur every time R occurs.

This, then, gives us a basic model of learning with the right

properties. At first there is no behavior or very little. When the

behavior does appear, it produces a reinforcement, which in turn

increases the probability that the same behavior will occur again. With

enough repetition, the behavior will eventually occur every time the

reinforcer occurs. Since the behavior always results in a reinforcement,

the behavior will persist.

This model can now be expanded to include a discriminative stimulus S,

which signals the conditions under which B will produce R. Now the

probability that is altered by R is Pr{B|S}: the probability that

occurance of the discriminative stimulus S will lead to behavior B.

Again there is no runaway condition, because the maximum possible

probability of any event is 1. The most that can happen even with

continued reinforcement is that every time S occurs, B will occur.

If the probability in question is considered to be a probability

_density_ (the probability of occurrance per unit time), the probability

of B can be interpreted as the average interval between occurrances of

B. If the probability density is high, B will occur after only a few

time-increments; if low, B will occur only after many time increments.

This leads to seeing the natural measure of behavior as _rate of

repetition_, where the average rate of repetition is simply the

reciprocal of the probability density.

Thus even though behaviors and reinforcements themselves are considered

as unitary events which either occur or don't occur, we can find a

continuous variable in terms of which to measure behavior: its rate of

occurrance, which is closely related to the basic measure of probability

of occurrance per unit time. The rate measure of behavior and

reinforcement is thus the natural result of considering behaviors and

reinforcements to be events, and of proposing that the effect of a

reinforcement on the production of a behavior (or on the response to a

discriminative stimulus) occurs via an effect on the probability-density

of the responses. The use of probabilities is dictated by the fact that

if the magnitude of reinforcement had an effect on the magnitude of

behavior, the resulting model would have stable states only at zero and

infinity.

A ratio schedule is one in which the occurrance of a reinforcer depends

probabilistically on the occurrance of a behavior. If the probability is

less than 1, we have a variable ratio schedule which requires a number

of behavioral acts to produce one reinforcement. A fixed ratio schedule

is a regular approximation to the variable-ratio schedule. Once again,

there is a natural limit that prevents runaway: there can be no more

than one reward per behavior; the probability of a reward given a

behavior can be no greater than 1.

From here the development can be taken in many directions which will not

be considered here.

-------------------------------------------------

We can see that as originally conceived, the model is a positive

feedback model. The more reinforcement there is the more behavior there

is, the more behavior there is the more reinforcement there is. The

converse also holds true; if there is any lessening of either behavior

or reinforcement, both variables must decrease with the only limit being

zero.

To avoid this obviously inappropriate result, the meaning of "more" had

to be modified so that the outcome was not a runaway condition. This is

the function of the concept of increasing _probabilities_ rather than

_magnitudes_. The model remains a positive feedback model, but the

insertion of a probability puts a limit on the runaway condition where

the probability becomes 1.

--------------------------------------------------

Let's go back to the original model, but this time add a reference

signal and change the system to a negative feedback model:

R'*

>+

--------

------>| ORG |---->--

> -------- |

> >

> >

> -------- |

R<----- | ENV |<------B

--------

The symbol R' (R-prime) now stands for a desired amount of R. While it

is not evident in the diagram, we have also changed the sign of the

effect of R on B. Now B is determined by the difference between R' and

R: that is,

B = k1*(R' - R).

As a result, an increase in R will cause a decrease in B. As before, an

increase in B will cause an increase in R:

R = k2*B

If we now consider that R and B are variable in magnitude, rather than

being events, we can solve the above two equations for the steady-state

result:

R = k2*(k1*(R' - R), or

k1*k2

R = ---------- R'

1 + k1*k2

If k1*k2 becomes much larger than one, we find that the result is

R = R'.

Now thinking of R and B in terms of magnitudes leads to a stable system.

R will come to the magnitude specified by R', and B will be R'/k2. The

only requirement is that k1*k2 be so large that we can neglect the 1 in

the denominator.

----------------------------------------------

Needless to say, realizing either of the above two models in a physical

system requires adding some details: in the reinforcement model we must

find a physical way of doing the equivalent of changing a probability.

In the negative feedback model we must insert a filter that allows k1*k2

to be large without resulting in oscillations.

----------------------------------------------

Under what conditions would we expect the reinforcement model to be

appropriate? The answer can be found by asking under what conditions we

could legitimately consider only the _occurrance_ of a behavior and a

reinforcement without also considering their magnitudes.

When B is considered as an event, we have only the choice between that

behavior being observed and its not being observed. This is equivalent

to seeing either the behavior B that results in R, or SOME OTHER

BEHAVIOR that does not result in the appearance of R. When an operant

behavior is being acquired, this is exactly the situation that is seen.

If the rat is not pressing the bar, it is doing something else. So to

speak of the probability of the behavior is to speak of the probability

that the _right_ behavior, among all those kinds of behavior that are

possible, has occurred.

To say that a reinforcement increases the probability of a behavior thus

can legitimately be taken as meaning that R increases the probability

that the _right kind_ of behavior will appear. Instead of rearing up

against the wall of the cage in one place, the rat rears up in another

place such that the act of rearing up causes the lever to be depressed

by the front paws. What we observe is an increase in the probability

that the rat will rear up in that one critical position rather than in

any other. Equivalently, since we can't directly observe probabilities,

we can say that the rat's behavior changes so it shows an ever-greater

proportion of rearing-up actions in the right place relative to all

other places.

This is the kind of positive feedback situation in which there is a

natural limit to the runaway condition: the animal can't spend more than

all of its time performing the right kind of behavior to produce

reinforcements. The reinforcement model is therefore applicable, if

anywhere, to the process of _acquiring the right kind of behavior_ as

opposed to behaviors that have no effect on producing reinforcements.

Once the right kind of behavior has been established, the reinforcement

model has gone as far as it can go. It cannot also account for the fact

that the animal comes to produce exactly the amount of behavior that is

needed to bring the amount of reinforcement to the _right amount_. That

process requires a negative feedback model with a reference signal, with

provision for adjusting the feedback parameters for best control.

-------------------------------------------

The one area of confusion that is left concerns measuring behavior in

terms of rate of occurrance. This confusion shows up because in the

attempt to explain performance as well as acquisition of behavior,

experimenter-theoreticians set up experiments in which it was

impossible, even when the right behavior had been acquired, to vary the

_amount_ of effect produced by a single behavior. The use of levers and

keys effectively removed any ability of an animal to have a quantitative

effect on the reinforcement by varying its efforts.

Even in this artificial situation, an animal could vary its behavior as

a way of varying the _average_ amount of obtained reinforcement. Once

the right behavior had been found, pressing a key or lever, the animal

(or at least some animals) could now vary the amount of received

reinforcement by varying the rate at which it pressed the key or lever.

So a negative feedback model can be set up in which R and B are both

measured in terms of rates, with R having an average perceived effect

depending on the rate of delivery of reinforcements and the internal

rate of decay of effects of reinforcements. This internal perceptual

effect could then be compared with the desired effect, and the

difference could be converted proportionally into a rate of generation

of bar or key pressing acts.

The reinforcement model can _also_ be set up to produce a continuous

relationship between rate of bar pressing and rate of reinforcement,

through the intermediate step of effects of reinforcement on probability

of behavior per unit time. However, this creates a conflict between the

reinforcement model and the negative feedback model. The reinforcement

model says that an increase in reinforcement rate must produce an

increase in the probability (frequency) of behavior, while the negative

feedback model says that an increase in reinforcement rate must produce

a _decrease_ in the behavior rate. If this conflict can be resolved,

there will be no further difficulties between the reinforcement model

and the negative feedback control system model. This is not to say that

neither model will be further modified on the basis of other

considerations, but at least this direct contradiction will have been

removed.

---------------------------------------------------------------------

The contradiction can be removed by saying that the reinforcement model

applies strictly to the process of increasing the relative frequency

with which the right behavior occurs, in comparison with all non-

reinforcement-producing behaviors that might also occur. Once the right

behavior has been acquired, the negative feedback control system model

applies to the process of creating the desired amount of reinforcement.

In the data that appear ambiguous, supporting the reinforcement model at

one extreme and the control model at the other, the difference can now

be explained easily, and in a way that can be tested against

experimental data. Where reinforcement rates are low enough, the

reinforcement model applies, and the animal begins to search for other

behaviors that will more reliably produce the reinforcer. This means

that other behaviors beside the bar-pressing will be seen, and

proportionally less time will be spent doing the right behavior. This

shows up as an apparent drop-off of behavior with decreasing

reinforcement rates, or an apparent increase in behavior rate (of the

kind being measured) with increasing reinforcement rates. Where

reinforcement rates are high enough, the animal essentially always uses

the right behavior, and controls the amount of received reinforcer near

a specific reference level by varying its rate of behavior: now an

increase in reinforcement rate goes with a decrease in behavior rate.

-----------------------------------------------------------------------

Best to all EABers and PCTers,

Bill P.