# A Reinforcement Story

[From Samuel Saunders (951222:14:46:54 EST)]

The following presentation is "borrowed" in part from James Greeno and
Frank Restle. It illustrates one class of reinforcement approach, and
affinities can be seen in the approaches of Premak, Allison, and Timberlake,
among others. This is an incomplete presentation, but it may help to
further the discussion. This presentation does not address 'acquisition'.
I will present acquisition of a bar press according to W.K. Estes in a few
days to illustrate reinforcement approaches to 'acquisition'.

Let us assign a value to each behavior possible in some situation. We can
then express the probability of some behavior Ai in terms of the values:
P(Ai)= v(Ai)/SUMx(Ax).

When opportunity to engage in some behavior is denied for some time
(deprivation), the behavior will have an increased probability when it is
first made available again. Data from Skinner (1938) The Behavior of
Organisms (p.348) show that a rat deprived of food except in an apparatus
ate pellets, after a brief initial burst, at a rate of 1 pellet a minute.
When eating was interrupted by making pellets unavailable for a 13 min
period 25 min into the session, rate of pellet consumption went to about 3
per minute for several minutes before returning to the 1 per minute rate.
(This cited to show two kinds of deprivation- deprivation of eating for a
long period raises the probability of eating, and 'relative deprivation' by
interfering with the resulting rate produces an additional increment).
Evidence can be marshalled that this is typical of behavior in general, that
deprivation produces an increment in the behavior when opportunity to
perform it is first introduced. This can be treated as a change in the
relevant v(Ax), as long as the Ax are independent. In some cases (eating
and drinking, for instance) there are interactions, so that deprivation of
one response response can affect the probability of another response
directly (for example, water deprivation decreases eating).

When opportunity to engage in a behavior is made contingent on another
behavior, a base response B must be performed in order to perform a
contingent response C, so, if B is much less likely than C, B must be
performed more than it would without the contingency in order for C to be
performed as much is it would without the contingency. The result is a
compromise. Letting v0(Ax) indicate the pre-contingency value of Ax and
v1(Ax) the value of Ax with the contingency, these considerations may be
expressed:

Reinforcement:
v1(B) = b * v0(B) + c * v1(C)
Deprivation:
v1(C) = v0(C) + h(1-t)
where h is a constant and t is proportion of the time C is
available.

The vx(Ax) cannot be measured directly, but only relative to a context that
offers other alternative behaviors. If the pre-contingency and contingency
conditions are presented in a constant context, the other Ax can be
combined into one term, say M, and the above can be expressed _re_ a
constant v(M):

Reinforcement:
v1(B)/v(M) = b * v0(B)/v(M) + c * v1(C)/c(M)
Deprivation:
v1(C)/v(M) = v0(C)/v(M) + h/v(M) * (1-t)
setting h' = h/v(M)
v1(C)/v(M) = v0(C)/v(M) + h'(1-t)

Let us consider the proposed Marken experiment. The contingent response is
viewing pictures for a fixed length of time p, while the base response is
pressing the mouse button, which must be pressed n times to view a picture.
Then for a session of length d in which a total of s presses occur, the
value t can be calculated
t = (s/n * p) / d .

Since button pressing is not very likely in the absence of the
contingency, we may set v0(B)/v(M) = 0. Then
v1(B)/v(M) = c * v1(C)/v(M)

Substituting the deprivation equation into the above:
v1(B)/v(M) = c [ v0(C)/v(M) + h' (1-t)] .

Some considerations:
If we vary the picture time by occasionally presenting the picture longer,
but not so much that we cause changes in b or c, then t will increase, so
(1 - t) deceases, P1(B) = v1(B)/(v1(B) + v(M)) will decrease.
The value in the contingent condition will be higher for a response with

0 initial probability than for a neutral response.

In the literature, b varies as a function of FR value when FRs have been
used with non-neutral base responses. A quick and by no means definitive
examination suggests that b ~= log(n) / k , where k is fixed at least for a
particular experiment, may apply. This could reflect something like
'perceived numerosity'. If v0(B) is not 0, but < 0, then increased ratio
would produce an increased subtractive term in the reinforcement equation.
Negative v0 could not be observed directly, of course, but might be
measured in a situation that include a B1 (presumed 0 or negative), a B2
(measured v0 > 0), and C. Then providing the contingent relationship for
B2, a negative v0(B2) could be seen from a v1(B2) less than predicted as
above.

This all appears to have a hidden control model inside. The v0
particularly suggest reference levels. Timberlake has gone part way toward
a control interpretation, expressing things in terms of "behavioral set
points". From there, it is not a very big step to move to control of
perception, and perceptual reference. I wonder if that step has been
blocked in part by a lack of insight into a method for determining the
perceptual function and perceptual set point (the TEST)?

I would like to spend my time working of PCT models, and maybe even
experiments, rather than trying to put reinforcement thinking into
reasonable form for comparison and contrast. I volunteered to do this,
however, so I will continue to do it. I think it is important to represent
the reinforcement view as accurately as possible, but I don't assert that
view myself, so please try to keep comments on a scientific rather than
personal level.

//----------------------------------------------------------------------------
//Samuel Spence Saunders,Ph.D.

[From Rick Marken (951222.1500)]

Samuel Saunders (951222:14:46:54 EST) --

On causal inspection of the reinforcemnt model you present, I don't see any
equations that map variables observed in an experiment (response rate,
reinforcement rate) to the variables described in the model. What, for
example, do the equations you derived for behavior in a fixed ratio human
operant conditioning experiment:

t = (s/n * p) / d

v1(B)/ v(M) = c [ v0(C)/v(M) + h' (1-t)] .

tell you about the quantitative relationship that will be observed between
temporal variations in response and reinforcement rate? How, for example, is
v1(B) related to observed response rate; how is v0(C) related to observed
reinforcement rate?

I would like to spend my time working of PCT models, and maybe even
experiments, rather than trying to put reinforcement thinking into
reasonable form for comparison and contrast.

I'm glad you want to work on PCT models but I am _really_ puzzled by your
reluctance to compare and contrast PCT and reinforcement theory? I really
don't get it. Could you help me out here, Sam? Here's my problem:

We have two theories of behavior (reinforcement and PCT) that are based on
totally different assumptions about how organisms function; one (reinforcment
theory) says that organisms function by emitting behaviors (actions) that are
strengthened or weakened (whether the organism "likes it or not") by their
consequences; the other (PCT) says that organisms function by varying
behaviors (actions) as necessary to produce the consequences they want.

What could possibly be wrong with comparing these two drastically (and rather
importantly) different views to see which one seems closer to being
the correct picture of how organsisms work? This is a "real life" question;
there are a lot of people running around out there -- some of them in very
important positions in society -- who believe, deep down in their hearts,
that people operate according to reinforcement theory. These people think
that rewards, punishments, incentives, and contingencies are essential
for dealing with employees, the unemployed, students, children, welfare
recipients, criminals, etc. Why don't we give the world a nice present in
1996 and show that the "reinforcement story" is just another Western myth;
let's show that behavior selects, and is not selected by, its consequences.

Best

Rick