Reinforcement Learning

[From Bruce Abbott (2017.09.14.2005 EDT)]

Sutton and Barto’s book, Reinforcement Learning: An Introduction, draft of a second edition (2012) is available as a pdf at:

http://people.inf.elte.hu/lorincz/Files/RL_2006/SuttonBook.pdf

From the Preface:

“. . . in 1979 we came to realize that perhaps the simplest of the ideas, which had long been taken for granted, had received surprisingly little attention from a computational perspective. This was simply the idea of a learning system that wants something, that adapts its behavior in order to maximize a special signal from its environment. This was the idea of a “hedonistic” learning system, or, as we would say now, the idea of reinforcement learning.” [emphasis mine]

Bill Powers used to argue that food has no special reinforcing powers. The idea that it does have such powers he referred to rather derisively is a “dormitive principle.” Opium puts people to sleep because it has a sedative property, or in other words, because it has the capacity to put people to sleep. This is circular reasoning, a tautology. Similarly, reinforcers are able to strengthen the behavior that produces them because they have a reinforcing property, or in other words, the ability to strengthen behavior.

According to PCT, however, a hungry rat responds for food, not because the food has acted as a reinforcer for the behavior that produces it, but because hunger is a manifestation of error in an intrinsic variable, and this hunger has driven the reorganizing system to vary aspects of the control hierarchy until a control system emerges that diminishes this error. In this case, it is a system organized to produce and consume food. The food has not reinforced the behavior that now produces the food, by reducing error (upon consumption) it has prevented a control system that succeeds in producing food from being reorganized away.

The problem with reorganization as currently described in PCT is that there is no mechanism that identifies what part of the organism’s vast control hierarchy needs to be modified in order to reduce error in some intrinsic variable. Bill was aware of the problem but so far as I know was not able to find a satisfactory solution to it. Computer models of reorganization always benefit from the programmer’s decisions as to what control system will be subject to reorganization and what aspect of that control system will be reorganized – almost always, if not always, the loop gain. The basic structure of the system is not affected, only the gain, which is basically saying that reorganization makes the system self-tuning (with respect to loop gain).

From time to time I revisited this issue with Bill and tried to coax him into accepting that reinforcement theory solves the problem. The hungry animal is not looking at random for some action that will reduce intrinsic error; it is looking for some action that will produce food. The animal will associate whatever it was doing at the time that food appeared with the event of food delivery, and will reorganize in such a way that this behavior will tend to be more-or-less repeated. In this way a food-delivery control system emerges that is targeted to the specific error – experienced as hunger – that needs to be reduced. Hunger (error) generates behavior that produces food, and food consumption reduces hunger.

Bruce

[From Bruce Abbott (2017.09.14.2005 EDT)]

Sutton and Barto’s book, Reinforcement Learning: An Introduction, draft of a second edition (2012) is available as a pdf at:

http://people.inf.elte.hu/lorincz/Files/RL_2006/SuttonBook.pdf

From the Preface:

“. . . in 1979 we came to realize that perhaps the simplest of the ideas, which had long been taken for granted, had received surprisingly little attention from a computational perspective. This was simply the idea of a learning system that wants something, that adapts its behavior in order to maximize a special signal from its environment. This was the idea of a “hedonistic� learning system, or, as we would say now, the idea of reinforcement learning.� [emphasis mine]

Bill Powers used to argue that food has no special reinforcing powers. The idea that it does have such powers he referred to rather derisively is a “dormitive principle.� Opium puts people to sleep because it has a sedative property, or in other words, because it has the capacity to put people to sleep. This is circular reasoning, a tautology. Similarly, reinforcers are able to strengthen the behavior that produces them because they have a reinforcing property, or in other words, the ability to strengthen behavior.

According to PCT, however, a hungry rat responds for food, not because the food has acted as a reinforcer for the behavior that produces it, but because hunger is a manifestation of error in an intrinsic variable, and this hunger has driven the reorganizing system to vary aspects of the control hierarchy until a control system emerges that diminishes this error. In this case, it is a system organized to produce and consume food. The food has not reinforced the behavior that now produces the food, by reducing error (upon consumption) it has prevented a control system that succeeds in producing food from being reorganized away.

The problem with reorganization as currently described in PCT is that there is no mechanism that identifies what part of the organism’s vast control hierarchy needs to be modified in order to reduce error in some intrinsic variable. Bill was aware of the problem but so far as I know was not able to find a satisfactory solution to it. Computer models of reorganization always benefit from the programmer’s decisions as to what control system will be subject to reorganization and what aspect of that control system will be reorganized –“ almost always, if not always, the loop gain. The basic structure of the system is not affected, only the gain, which is basically saying that reorganization makes the system self-tuning (with respect to loop gain).

From time to time I revisited this issue with Bill and tried to coax him into accepting that reinforcement theory solves the problem. The hungry animal is not looking at random for some action that will reduce intrinsic error; it is looking for some action that will produce food. The animal will associate whatever it was doing at the time that food appeared with the event of food delivery, and will reorganize in such a way that this behavior will tend to be more-or-less repeated. In this way a food-delivery control system emerges that is targeted to the specific error – experienced as hunger – that needs to be reduced.&nbspbsp; Hunger (error) generates behavior that produces food, and food consumption reduces hunger.

Bruce

[From Bruce Nevin (2017.09.15.11:21ET)]

The standard PCT model refers to rates of firing in neuron(-bundle)s, and that is what the interior quantities in a PCT simulation represent.Â

Astrocytes (star-shaped glial cells) hold neurons in place, supply nutrients, and digest parts of dead neurons. They can attract new cells (like immune cells and perhaps even adult neural stem cells) to repair damage. But in addition to these ‘housekeeping’ functions, research in the last decade or so shows that they communicate with neurons and modify their input and output signals. Each is connected to a unique collection of neurons and up to thousands of synapses, with a ‘territory’ which does not overlap that of any other astrocyte. There is growing evidence that astrocytes can alter how a neuron is built by directing where to make synapses or dendritic spines.Â

http://learn.genetics.utah.edu/content/neuroscience/braincells/

They “guide” growing neurons / axons during the development of the brain.

http://www.connexin.de/en/neuron-astro-cytes-micro-glia.html

Here’s a survey of the relevant literature. I’ve only been able to access the abstract.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2982258/

But there’s more to the war of the soups and the sparks than the neurotransmitters at synapses. I have conjectured that reorganization is localized by changes in the ‘soup’ of neurochemicals in the environment of persistent high error output. Astrocytes might mediate and facilitate this effect.

I have neither the means nor the competence to investigate this, but this is the sort of thing I would look for.

···

On Fri, Sep 15, 2017 at 4:13 AM, Warren Mansell wmansell@gmail.com wrote:

Hi folks,

I think the problem is that we haven’t really thought in detail about how learning may appear completely different depending on which level of the hierarchy the learning is occurring. I would think that one of the reasons infant development is constrained to stages, which according to Frans, presumable relate to the levels of hierarchy emerging, is that some of these key components of the lower levels of the hierarchy can be put in place then without having to ‘choose’ from the vast awareness of perceptions to place in awareness. As adults we need to control our awareness to learn this new links… I think…

On 15 Sep 2017, at 01:05, Bruce Abbott bbabbott@frontier.com wrote:

[From Bruce Abbott (2017.09.14.2005 EDT)]

Â

Sutton and Barto’s book, Reinforcement Learning: An Introduction, draft of a second edition (2012) is available as a pdf at:

http://people.inf.elte.hu/lorincz/Files/RL_2006/SuttonBook.pdf

Â

From the Preface:

Â

“. . .  in 1979 we came to realize that perhaps the simplest of the ideas, which had long been taken for granted, had received surprisingly little attention from a computational perspective. This was simply the idea of a learning system that wants something, that adapts its behavior in order to maximize a special signal from its environment. This was the idea of a “hedonistic� learning system, or, as we would say now, the idea of reinforcement learning.� [emphasis mine]

Â

Bill Powers used to argue that food has no special reinforcing powers. The idea that it does have such powers he referred to rather derisively is a “dormitive principle.â€?  Opium puts people to sleep because it has a sedative property, or in other words, because it has the capacity to put people to sleep. This is circular reasoning, a tautology. Similarly, reinforcers are able to strengthen the behavior that produces them because they have a reinforcing property, or in other words, the ability to strengthen behavior.Â

Â

According to PCT, however, a hungry rat responds for food, not because the food has acted as a reinforcer for the behavior that produces it, but because hunger is a manifestation of error in an intrinsic variable, and this hunger has driven the reorganizing system to vary aspects of the control hierarchy until a control system emerges that diminishes this error. In this case, it is a system organized to produce and consume food. The food has not reinforced the behavior that now produces the food, by reducing error (upon consumption) it has prevented a control system that succeeds in producing food from being reorganized away.

Â

The problem with reorganization as currently described in PCT is that there is no mechanism that identifies what part of the organism’s vast control hierarchy needs to be modified in order to reduce error in some intrinsic variable. Bill was aware of the problem but so far as I know was not able to find a satisfactory solution to it. Computer models of reorganization always benefit from the programmer’s decisions as to what control system will be subject to reorganization and what aspect of that control system will be reorganized – almost always, if nnot always, the loop gain. The basic structure of the system is not affected, only the gain, which is basically saying that reorganization makes the system self-tuning (with respect to loop gain).

Â

From time to time I revisited this issue with Bill and tried to coax him into accepting that reinforcement theory solves the problem. The hungry animal is not looking at random for some action that will reduce intrinsic error; it is looking for some action that will produce food. The animal will associate whatever it was doing at the time that food appeared with the event of food delivery, and will reorganize in such a way that this behavior will tend to be more-or-less repeated. In this way a food-delivery control system emerges that is targeted to the specific error – eexperienced as hunger – that needs to be reduced. Hunger (erroor) generates behavior that produces food, and food consumption reduces hunger.

Â

Bruce

Â

Â

[From Rick Marken (2017.09.16.1450)]

···

Bruce Abbott (2017.09.14.2005 EDT)

 Â

BA: Bill Powers used to argue that food has no special reinforcing powers.Â

Â

RM: I think what Bill argued is that there is no such thing as reinforcing power of any kind, special or not. The appearance that some consequences of action seem to strengthen those actions – that certain consequences have reinforcing power – is another example of a behavioral illusion. Consequences of action (like the food pellets delivered after a bar press in operant conditioning) Â that appear to be reinforcing are actually controlled variables. (see Yin, H. (2013) “Restoring purpose to behavior” in G. Baldassarre and M. Mirolli (eds.), Computational and Robotic Models of the Hierarchical Organization of Behavior, DOI 10.1007/978-3-642-39875-9 14; particularly Figure 5).Â

BA: According to PCT, however, a hungry rat responds for food,

RM: I don’t think PCT says that the rat responds to food. PCT recognizes that in an operant situation the food responds to the rat as much as the rat responds to the food. The food is in a control loop. PCT says that the rat controls the food (actually PCT says the rat controls a variable aspect of the food, such as its rate or probability of occurrence; the Yin research cited above suggests that it is the rate of food occurrence that is controlled). Â

BA: not because the food has acted as a reinforcer for the behavior that produces it, but because hunger is a manifestation of error in an intrinsic variable, and this hunger has driven the reorganizing system to vary aspects of the control hierarchy until a control system emerges that diminishes this error.Â

RM: I think this can be explained by a model that has the intrinsic error (hunger) in the blood sugar level control system increase the reference for the system controlling for the rate of food input and decrease it when the hunger error decreases.Â

Â

BA: In this case, it is a system organized to produce and consume food.Â

RM: Yes.Â

Â

BA: The food has not reinforced the behavior that now produces the food, by reducing error (upon consumption) it has prevented a control system that succeeds in producing food from being reorganized away.

RM: Yes.Â

Â

BA: The problem with reorganization as currently described in PCT is that there is no mechanism that identifies what part of the organism’s vast control hierarchy needs to be modified in order to reduce error in some intrinsic variable.Â

RM: I don’t see this as a problem. Organisms have surely developed the systems that control for food intake as a means of controlling blood sugar level (hunger) from a very early age. Rats in a Skinner box are doing what rats do outside the Skinner box when they are hungry; they forage for food. The intrinsic error (hunger) can be seen to increase the reference for the perception of food input, resulting in error that drives the foraging behavior; rats forage when they are hungry. I see this foraging behavior as the varying outputs (references to lower level control systems) of the food input control system; these outputs are varied to counter the varying disturbances to the way food can be obtained.Â

BA: Bill was aware of the problem but so far as I know was not able to find a satisfactory solution to it.

RM: I don’t see any problem here either. I can envision a model where intrinsic “hunger” error increases the reference for the perception of the rate of food intake. When the perception of rate of food input is well below this reference there is error that drives the foraging behavior which is, as I said, variation in outputs of the food control system that lead to the somewhat random appearing variations in the way the organism will try to get food; moving about, sniffing, climbing, etc. Once one of these outputs (like leaning on the bar in a Skinner box) results in some food the variation in these references for foraging behaviors slows and eventually ceases when the food is being delivered regularly. Foraging looks like reorganization but since animals do this all their regular means of finding food I think it might be more appropriate to see foraging as the normal, variable output of the food input control system. I think robots that seek electrical outlets when they are running low on power are doing something like foraging; one who s inclined to see behavior as the behaviorists do (and don’t know how the robot works) might be inclined to see the electrical outlet as having reinforcing power over the behavior of the robot.Â

Â

BA: From time to time I revisited this issue with Bill and tried to coax him into accepting that reinforcement theory solves the problem.Â

RM: Actually, I don’s see how reinforcement theory solves the problem. But I it does point to the fact that animals are able to perceive that this particular aspect of the environment – food – is what they have to control for in order to stop being hungry. So it is interesting that animals are able to see that some things are food and some things are not. A real model of control of blood sugar level would have to have perceptual systems that can distinguish food from non-food – just as the above mentioned robot has to be able to tell an electrical outlet from other things, like food.

BA: The hungry animal is not looking at random for some action that will reduce intrinsic error; it is looking for some action that will produce food

RM: Exactly. And that’s how my model would work. The hungry animal has an error in the system controlling blood sugar level. The size of this error determines the magnitude of the reference level for food input. The food input control system varies, fairly randomly, its output as the references for lower level control systems that are controlling for perceiving smells, tastes, sights or even sounds that are indications of food; these variations are seen as foraging.Â

BA: The animal will associate whatever it was doing at the time that food appeared with the event of food delivery, and will reorganize in such a way that this behavior will tend to be more-or-less repeated. In this way a food-delivery control system emerges that is targeted to the specific error – experiienced as hunger – that needs to be reduced. Hunger (error) geenerates behavior that produces food, and food consumption reduces hunger.

Â

RM: I would prefer to go with a reorganization type process – where the rate of variation in the references that produce foraging decreases after a particular “forage” results in food – rather than an associative strengthening process. But we really don’t have much data on that so either one should work. But from a PCT perspective, I think the only thing that is “special” about food for a hungry animal is that the hunger error can be assumed to “automatically” set the reference for control of food input and not for control of other, completely irrelevant perceptions (like poems, no matter what William Carlos Williams says about men dying for want of what is found in them).

BestÂ

Rick

Richard S. MarkenÂ

"Perfection is achieved not when you have nothing more to add, but when you
have nothing left to take away.�
                --Antoine de Saint-Exupery

[From Rupert Young (2017.09.26 9.45)]

  That link doesn't work for me. But I've been looking at this and thought I'd do a summary of RL from this, and then perhaps

comment on it with respect to a PCT context. [Bruce provided this
alternative link ]

  RL is a method of learning by interacting with the environment,

and evaluating the consequences of actions according to
goals.

  RL is about mapping situations to actions in order to maximise a

reward signal. The learner must discover which actions yield the
most reward by trying them, though the effects may not be
immediate. These two characteristics—trial-and-error search and
delayed reward—are the two most important distinguishing
features of reinforcement learning.

···

http://matt.colorado.edu/teaching/highcog/readings/sb98c1.pdfhttp://ufal.mff.cuni.cz/~straka/courses/npfl114/2016/sutton-bookdraft2016sep.pdf
An agent must be able to sense the state
of the environment and be able to take actions that affect that
state. Goals relate to the state of the environment. RL concerns
these three aspects —sensation, action, and goal. All
reinforcement learning agents have explicit goals, can sense
aspects of their environments, and can choose actions to influence
their environments.

  A RL agent must *exploit*       actions it has tried in the past

that it has found to be effective and also must also explore actions
it has not tried before. Balancing the two is a dilemma.

  Examples: Choosing moves in chess. Adjusting parameters of an

adaptive controller.

  The four main elements of a reinforcement learning system are a

policy, a reward function, a value function, and, optionally, a
model of the environment.

  • a policy is a mapping from perceived states of the
    environment to actions to be taken when in those states; a set
    of stimulus-response rules (which may be stochastic).

  • a reward function maps perceived states of the
    environment to a reward value, indicating the intrinsic
    desirability of the state.

  • a value of a state is the total amount of reward an
    agent can expect to accumulate over the future starting from
    that state.

  • a model is something that mimics the behavior of
    the environment, and might predict the resultant next state
    and next reward; used for planning.

    Action choices are made on the basis of value judgements. We seek
    

actions that bring about states of highest value, not highest
reward, because these actions obtain for us the greatest amount of
reward over the long run.

  In a tic-tac-toe example we examine the states that would result

from each of our possible moves. Most of the time we select the
move that leads to the state with greatest value. Occasionally,
however, we select randomly from one of the other moves instead.
As moves are made the value of states are adjusted according to a
fraction of the difference between the previous and new states.
The game is played many times and the values of states converge to
an optimal route through the states.

  There we have a basic summary. Some things sound very familiar

from a PCT context, but what are the differences? What aspects of
RL make it an invalid approach from a PCT, or biological,
perspective? How can it be modified to align with PCT?

  Rupert

  On 15/09/2017 01:05, Bruce Abbott

wrote:

[From Bruce Abbott (2017.09.14.2005 EDT)]

Sutton and Barto’s book, * Reinforcement
Learning: An Introduction* , draft of a second edition
(2012) is available as a pdf at:

http://people.inf.elte.hu/lorincz/Files/RL_2006/SuttonBook.pdf

From the Preface:

      “. . .  in 1979 we came to realize that

perhaps the simplest of the ideas, which had long been taken
for granted, had received surprisingly little attention from a
computational perspective. This was simply the idea of a
learning system that wants something, that
adapts its behavior in order to maximize a special signal from
its environment. This was the idea of a “hedonistic” learning
system, or, as we would say now, the idea of reinforcement
learning.” [emphasis mine]

      Bill Powers used to argue that food has no

special reinforcing powers. The idea that it does have such
powers he referred to rather derisively is a “dormitive
principle.” Opium puts people to sleep because it has a
sedative property, or in other words, because it has the
capacity to put people to sleep. This is circular reasoning,
a tautology. Similarly, reinforcers are able to strengthen
the behavior that produces them because they have a
reinforcing property, or in other words, the ability to
strengthen behavior.

      According to PCT, however, a hungry rat

responds for food, not because the food has acted as a
reinforcer for the behavior that produces it, but because
hunger is a manifestation of error in an intrinsic variable,
and this hunger has driven the reorganizing system to vary
aspects of the control hierarchy until a control system
emerges that diminishes this error. In this case, it is a
system organized to produce and consume food. The food has
not reinforced the behavior that now produces the food, by
reducing error (upon consumption) it has prevented a control
system that succeeds in producing food from being reorganized
away.

      The problem with reorganization as

currently described in PCT is that there is no mechanism that
identifies what part of the organism’s vast control hierarchy
needs to be modified in order to reduce error in some
intrinsic variable. Bill was aware of the problem but so far
as I know was not able to find a satisfactory solution to it.
Computer models of reorganization always benefit from the
programmer’s decisions as to what control system will be
subject to reorganization and what aspect of that control
system will be reorganized – almost always, if not always, the
loop gain. The basic structure of the system is not affected,
only the gain, which is basically saying that reorganization
makes the system self-tuning (with respect to loop gain).

      From time to time I revisited this issue

with Bill and tried to coax him into accepting that
reinforcement theory solves the problem. The hungry animal is
not looking at random for some action that will reduce
intrinsic error; it is looking for some action that will
produce food . The animal will associate whatever it
was doing at the time that food appeared with the event of
food delivery, and will reorganize in such a way that this
behavior will tend to be more-or-less repeated. In this way a
food-delivery control system emerges that is targeted to the
specific error – experienced as hunger – that needs to be
reduced. Hunger (error) generates behavior that produces
food, and food consumption reduces hunger.

Bruce

Hi Rupert,

I have a series of questions…

  1. What assumptions does RL make about control regardless of learning, such as how primitive organisms locomote?

  2. Does RL parse up its perception of the environment into an organised set of levels?

  3. How does it apply RL to avoidance of danger rather than pursuit of reward?

  4. Does it consider the problems that goal conflict might cause?

  5. Does it have a mechanism for spotlighting learning to a specific goal or set of goals at a time?

···

http://matt.colorado.edu/teaching/highcog/readings/sb98c1.pdfhttp://ufal.mff.cuni.cz/~straka/courses/npfl114/2016/sutton-bookdraft2016sep.pdf
An agent must be able to sense the state
of the environment and be able to take actions that affect that
state. Goals relate to the state of the environment. RL concerns
these three aspects —sensation, action, and goal. All
reinforcement learning agents have explicit goals, can sense
aspects of their environments, and can choose actions to influence
their environments.

  A RL agent must *exploit*       actions it has tried in the past

that it has found to be effective and also must also explore actions
it has not tried before. Balancing the two is a dilemma.

  Examples: Choosing moves in chess. Adjusting parameters of an

adaptive controller.

  The four main elements of a reinforcement learning system are a

policy, a reward function, a value function, and, optionally, a
model of the environment.

  • a policy is a mapping from perceived states of the
    environment to actions to be taken when in those states; a set
    of stimulus-response rules (which may be stochastic).

  • a reward function maps perceived states of the
    environment to a reward value, indicating the intrinsic
    desirability of the state.

  • a value of a state is the total amount of reward an
    agent can expect to accumulate over the future starting from
    that state.

  • a model is something that mimics the behavior of
    the environment, and might predict the resultant next state
    and next reward; used for planning.

    Action choices are made on the basis of value judgements. We seek
    

actions that bring about states of highest value, not highest
reward, because these actions obtain for us the greatest amount of
reward over the long run.

  In a tic-tac-toe example we examine the states that would result

from each of our possible moves. Most of the time we select the
move that leads to the state with greatest value. Occasionally,
however, we select randomly from one of the other moves instead.
As moves are made the value of states are adjusted according to a
fraction of the difference between the previous and new states.
The game is played many times and the values of states converge to
an optimal route through the states.

  There we have a basic summary. Some things sound very familiar

from a PCT context, but what are the differences? What aspects of
RL make it an invalid approach from a PCT, or biological,
perspective? How can it be modified to align with PCT?

  Rupert

  On 15/09/2017 01:05, Bruce Abbott

wrote:

[From Bruce Abbott (2017.09.14.2005 EDT)]

Sutton and Barto’s book, * Reinforcement
Learning: An Introduction* , draft of a second edition
(2012) is available as a pdf at:

http://people.inf.elte.hu/lorincz/Files/RL_2006/SuttonBook.pdf

From the Preface:

      “. . .  in 1979 we came to realize that

perhaps the simplest of the ideas, which had long been taken
for granted, had received surprisingly little attention from a
computational perspective. This was simply the idea of a
learning system that wants something, that
adapts its behavior in order to maximize a special signal from
its environment. This was the idea of a “hedonistic� learning
system, or, as we would say now, the idea of reinforcement
learning.� [emphasis mine]

      Bill Powers used to argue that food has no

special reinforcing powers. The idea that it does have such
powers he referred to rather derisively is a “dormitive
principle.� Opium puts people to sleep because it has a
sedative property, or in other words, because it has the
capacity to put people to sleep. This is circular reasoning,
a tautology. Similarly, reinforcers are able to strengthen
the behavior that produces them because they have a
reinforcing property, or in other words, the ability to
strengthen behavior.

      According to PCT, however, a hungry rat

responds for food, not because the food has acted as a
reinforcer for the behavior that produces it, but because
hunger is a manifestation of error in an intrinsic variable,
and this hunger has driven the reorganizing system to vary
aspects of the control hierarchy until a control system
emerges that diminishes this error. In this case, it is a
system organized to produce and consume food. The food has
not reinforced the behavior that now produces the food, by
reducing error (upon consumption) it has prevented a control
system that succeeds in producing food from being reorganized
away.

      The problem with reorganization as

currently described in PCT is that there is no mechanism that
identifies what part of the organism’s vast control hierarchy
needs to be modified in order to reduce error in some
intrinsic variable. Bill was aware of the problem but so far
as I know was not able to find a satisfactory solution to it.
Computer models of reorganization always benefit from the
programmer’s decisions as to what control system will be
subject to reorganization and what aspect of that control
system will be reorganized – almost always, if not alwayss, the
loop gain. The basic structure of the system is not affected,
only the gain, which is basically saying that reorganization
makes the system self-tuning (with respect to loop gain).

      From time to time I revisited this issue

with Bill and tried to coax him into accepting that
reinforcement theory solves the problem. The hungry animal is
not looking at random for some action that will reduce
intrinsic error; it is looking for some action that will
produce food . The animal will associate whatever it
was doing at the time that food appeared with the event of
food delivery, and will reorganize in such a way that this
behavior will tend to be more-or-less repeated. In this way a
food-delivery control system emerges that is targeted to the
specific error – experienced as hunger – that neeneeds to be
reduced. Hunger (error) generates behavior that produces
food, and food consumption reduces hunger.

Bruce

[From Erling Jorgensen (2017.09.26 0909 EDT)]

Rupert Young (2017.09.26 9.45)

RY: There we have a basic summary. Some things sound very familiar from a PCT context, but what are the differences? What aspects of RL make it an invalid approach from a PCT, or biological, perspective? How can it be modified to align with PCT?

EJ: I noticed the following issues.

RY (summarizing Reinforcement Learning): An agent must be able to sense the state of the environment and be able to take actions that affect that state. Goals relate to the state of the environment. RL concerns these three aspects —sensation,
action, and goal. All reinforcement learning agents have explicit goals, can sense aspects of their environments, and can choose actions to influence their environments.

EJ: I was waiting to hear four more words at the end of this last sentence: “… to attain their goals.” If RL were to close the loop in terms of reducing error in achieving goals, some of its other concepts may not be necessary.

RY: The four main elements of a reinforcement learning system are a policy, a reward function, a value function, and, optionally, a model of the environment.

  • a reward function maps perceived states of the environment to a reward value, indicating the intrinsic desirability of the state.

    EJ: This is one of those extra intermediary concepts that takes on a driving force in
    RL. A more direct link that fills this role in PCT is the one listed above: reduction of error in attaining a goal. I don’t see that there is a need for “intrinsic desirability” of various states. What makes something desirable is attaching a goal to it. When the goal changes, it is no longer desirable.

    EJ: I believe there is room for PCT to entertain the notion of a gain-adjusting system, which I connect with the emotion system. This would be driven by the rate of change in error, (does that make it a first derivative?) I am not sure how broadly error would be sampled for this. I don’t know whether the Amygdala is a possible site for such sampling. But it does seem that humans and seemingly other animals utilize broad hormonal effects to adjust the gain of their control systems. Because certain chemicals may lean the system in certain directions, this may be one place where the
    RL notion of “reward value” could map onto a PCT understanding of “error reduction.” However, I still don’t like the tendency of reward to become a “dormitive principle,” as Bruce Abbott (2017.09.14.2005 EDT) pointed out.

All the best,

Erling

···

Disclaimer: This message is intended only for the use of the individual or entity to which it is addressed, and may contain information that is privileged, confidential and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, or the employer or agent responsible for delivering the message to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please notify the sender immediately by telephone and delete the material from your computer. Thank you for your cooperation.

[From Rupert Young (2017.09.27 14.45)]

As far as I know ...

Hi Rupert,
I have a series of questions...

1. What assumptions does RL make about control regardless of learning, such as how primitive organisms locomote?

None.

2. Does RL parse up its perception of the environment into an organised set of levels?

No.

3. How does it apply RL to avoidance of danger rather than pursuit of reward?

Well, presumably the consequences of danger would be reflected in the reward value.

4. Does it consider the problems that goal conflict might cause?

No.

5. Does it have a mechanism for spotlighting learning to a specific goal or set of goals at a time?

Learning is based upon a specific goal value, e.g. three crosses in a line in tic-tac-toe equals 1. I don't think multiple goals are catered for.

Rupert

···

On 26/09/2017 12:39, Warren Mansell wrote:

[From Rupert Young (2017.09.27 17.50)]

···

On 26/09/2017 12:39, Warren Mansell wrote:

Hi Rupert,
I have a series of questions...

2. Does RL parse up its perception of the environment into an organised set of levels?

Just came across this, "Nested Q-learning of hierarchical control structures", http://ieeexplore.ieee.org/abstract/document/549152/ but I'm not able to access it. Any chance anyone can get hold of it?

Rupert

[From Roger Moore (2017.09.27 18.16)]

Enjoy …

00549152.pdf (581 KB)

···

On 27 September 2017 at 17:48, Rupert Young rupert@perceptualrobots.com wrote:

[From Rupert Young (2017.09.27 17.50)]

On 26/09/2017 12:39, Warren Mansell wrote:

Hi Rupert,

I have a series of questions…

  1. Does RL parse up its perception of the environment into an organised set of levels?

Just came across this, “Nested Q-learning of hierarchical control structures”, http://ieeexplore.ieee.org/abstract/document/549152/ but I’m not able to access it. Any chance anyone can get hold of it?

Rupert

[From Rupert Young (2017.09.27 18.30)]

Thanks!

···

On 27/09/2017 18:17, Prof. Roger K.
Moore wrote:

      [From

Roger Moore (2017.09.27 18.16)]

Enjoy …

        On 27 September 2017 at 17:48, Rupert

Young rupert@perceptualrobots.com
wrote:

          [From

Rupert Young (2017.09.27 17.50)]

            On 26/09/2017 12:39, Warren Mansell wrote:
              Hi Rupert,

              I have a series of questions...




              2. Does RL parse up its perception of the environment

into an organised set of levels?

          Just came across this, "Nested Q-learning of hierarchical

control structures", http://ieeexplore.ieee.org/abstract/document/549152/
but I’m not able to access it. Any chance anyone can get
hold of it?

              Rupert

[From Rupert Young (2017.09.28 10.30)]

(Erling Jorgensen (2017.09.26 0909 EDT)]

RL does close the loop related to achieving goals, as I understand
it. In tic-tac-toe the goal is three crosses in a line, represented
by the reward value. Actions are selected such that the reward is
maximised (i.e. the goal is reached).
Well, I think “intrinsic desirability” is related to PCT goals in
that certain states represent sub-goals on the way to achieving a
man goal, and some states will help attain that better than others.
However, this may only be valid in well-defined, constrained
environments such as games where you are dealing with discrete
states rather than dynamic environments where the reference values
continually change. Rupert

···

Rupert Young (2017.09.26 9.45)

      >RY:  There we have a basic summary. Some things sound

very familiar from a PCT context, but what are the
differences? What aspects of RL make it an invalid approach
from a PCT, or biological, perspective? How can it be modified
to align with PCT?

EJ: I noticed the following issues.

      >RY (summarizing Reinforcement Learning):  An agent must

be able to sense the state of the environment and be able to
take actions that affect that state. Goals relate to the state
of the environment. RL concerns these three aspects
—sensation, action, and goal. All reinforcement learning
agents have explicit goals, can sense aspects of their
environments, and can choose actions to influence their
environments.

      EJ:  I was waiting to hear four more words at the end of

this last sentence: “… to attain their goals.” If RL were
to close the loop in terms of reducing error in achieving
goals, some of its other concepts may not be necessary.

  • a reward function maps perceived states of the
    environment to a reward value, indicating the intrinsic
    desirability of the state.
      EJ:  This is one of those extra intermediary concepts that

takes on a driving force in RL. A more direct link that fills
this role in PCT is the one listed above: reduction of error
in attaining a goal. I don’t see that there is a need for
“intrinsic desirability” of various states. What makes
something desirable is attaching a goal to it. When the goal
changes, it is no longer desirable.

[Erling Jorgensen (2017.09.28 0739 EDT)]

Rupert Young (2017.09.28 10.30)

Hi Rupert,

Thanks for the reply. I was wondering about your reactions to what I wrote.

RY: Well, I think “intrinsic desirability” is related to PCT goals in that certain states represent sub-goals on the way to achieving a man [main?] goal, and some states will help attain that better than others. However, this may only be valid in well-defined, constrained environments such as games where you are dealing with discrete states rather than dynamic environments where the reference values continually change.

EJ: This doesn’t sound very “intrinsic” to me. The claim of Reinforcement Learning seemed to be that reward value is an intrinsic property of certain states. Whereas with PCT, the goal / the reference is the source of that desirability. Remove that goal, or choose a different sub-goal, and the supposed ‘intrinsic-ness’ evaporates. (Upon re-reading your statement, we may be saying essentially the same thing: Goals = Desirability, intrinsically, or we could say, by definition.)

EJ: You had asked about the overlap or differences of a mapping of Reinforcement Learning onto Perceptual Control Theory. At present in PCT we do not have a good switching mechanism to see the change of goals in action, other than ‘Bump it Upstairs’ with postulating a higher level reference standard. We also view a Program level perception as a network of contingencies that, once a higher level goal has been determined, can navigate among descending reference specifications, depending on the perceptual results achieved so far (i.e., those are the contingency nodes.)

EJ: It is possible that “the four main elements of a reinforcement learning system” that you outlined [in Rupert Young (2017.09.26 9.45)] could fill that gap. RL includes a mapping from goal-related environmental states, to reward values, to expected accumulation of reward values, to planning among action options. So there are inserted two potentially measurable functions – reward values and expected reward values – to guide selection of goals and sub-goals.

EJ: My proposal and preference is for a gain-adjusting system, (where a zero level of gain can suspend a goal). I believe that error-reduction in PCT already fills that intermediary role that RL assigns to rewards and the expected value of accumulating those rewards. But PCT still needs to have some mechanism to switch goals on and off. It is not yet very specified to say, in effect, “the HPCT hierarchy does it.” Moreover, we already see (perhaps as a mixture of theory and phenomenon) that goals are carried out with various levels of gain. So PCT has the problem of discerning a mechanism for adjusting gain anyway. I would like to see a more direct tackling of that issue, rather than importing concepts (and perhaps assumptions) from Reinforcement Learning.

EJ: That’s my take on it, at any rate. All the best.

Erling

···

Disclaimer: This message is intended only for the use of the individual or entity to which it is addressed, and may contain information that is privileged, confidential and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, or the employer or agent responsible for delivering the message to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please notify the sender immediately by telephone and delete the material from your computer. Thank you for your cooperation.

Hi all,

Thought I’d pitch in as someone working on reorganisation of hierarchies and also have studied reinforcement learning.

Well done on your find there, Rupert! Digney’s work is hard to track down. His work past about 2001 is near impossible to find due to him working in the military…

Some observations, I’m afraid not in chronological order…

EJ: This is one of those extra intermediary concepts that takes on a driving force in RL. A more direct link that fills this role in PCT is the one listed above: reduction of error in attaining a goal. I don’t see that there is a need for “intrinsic desirability” of various states. What makes something desirable is attaching a goal to it. When the goal changes, it is no longer desirable.

BH: The purpose of the state based approach is to allow it to solve problems that do not compliment simple error reduction of a single controller. For example, reaching an exit point of a room where one must bypass a wall first which involves going in the wrong direction. Simply trying to get as close to the exit as possible would fail, hence the weighting of states allows it to bypass this problem by learning that the gap(s) in the wall are key places to reach in the trajectory. Do I think this is a sensible approach? Not necessarily at all. A properly structured hierarchy of error reducing nodes should solve the problem. Furthermore, as you stated, when the goal changes, it’s no longer desirable… and reinforcement learning collapses here. A robust hierarchy of error reducing controllers does not suffer in the same way. However, engineers do not have the patience or skill in many cases to design appropriately robust hierarchies. Therefore, reinforcement learning.

EJ: I believe there is room for PCT to entertain the notion of a gain-adjusting system, which I connect with the emotion system. This would be driven by the rate of change in error, (does that make it a first derivative?) I am not sure how broadly error would be sampled for this. I don’t know whether the Amygdala is a possible site for such sampling. But it does seem that humans and seemingly other animals utilize broad hormonal effects to adjust the gain of their control systems. Because certain chemicals may lean the system in certain directions, this may be one place where the RL notion of “reward value” could map onto a PCT understanding of “error reduction.” However, I still don’t like the tendency of reward to become a “dormitive principle,” as Bruce Abbott (2017.09.14.2005 EDT) pointed out.

BH: I’ve been thinking about (and would definitely test if I had time in my PhD) into the notion of stress being a motivator in changing the “gains” of a control system (as the system gets “stressed”, it is more likely to change gains and also more rapidly). So, I’m also someone who’s been thinking about emotions being a gain changer. It’s still a very open topic with lots of room to explore, however. “Reward Value” mapping onto “error reduction” is very simple. To maximise reward is to minimise negative reward (or punishment). Many reinforcement learning experiments actually use negative reward, so it’s commonly used, and they aim to minimise that. So, in both cases, minimisation is occurring. It’s quite plausible that the “reward value” signal could be considered a perceptual reference to the system and then this removes a key difference. Obviously, they have huge differences in how they handle the understanding of the world (perceptual signals versus state based representation).

EJ: This doesn’t sound very “intrinsic” to me. The claim of Reinforcement Learning seemed to be that reward value is an intrinsic property of certain states. Whereas with PCT, the goal / the reference is the source of that desirability. Remove that goal, or choose a different sub-goal, and the supposed ‘intrinsic-ness’ evaporates. (Upon re-reading your statement, we may be saying essentially the same thing: Goals = Desirability, intrinsically, or we could say, by definition.)

The reward value of the states is not the source of the desirability in Reinforcement Learning, the reward function is. The reward function dictates the desirability of a state. Technically, this does mean the reward value of the state is the motivator (i.e. what is desirable) for the agent, but this isn’t in reflection to current reinforcement learning work. There are a variety of paradigms of reinforcement learning, many of which have much better results than state reward value driven approaches. In particular, some approaches aim to select the best policy which maximises reward, which dodges the whole explicit state action declaration altogether. To attempt to analogise to PCT, this would be like an individual PCT node learning how to change the gain to best keep the difference between input and reference as low as possible. Q-Learning, which is very popular, gives state-action pairs reward values rather than specific states. This is because, obviously, specific actions may have different values in different states but also the other way around.

Reinforcement learning still suffers from many of the problems from when it was initially proposed. The curse of dimensionality means that considering every action and state combination is expensive. Discretising the action and state space can be complicated or impossible. Digney’s attempt to make it hierarchical was fascinating, and it’s definitely the right way to think about things. One of the lead researchers at google deep mind was talking to me about how the “curriculum” of deep neural networks is their next hurdle to go over.

I find it amusing that the “bump it upstairs” argument is referenced here, as it genuinely is the answer. “Changing the goal” is nothing more than different high level nodes firing and referencing lower nodes instead of the higher level nodes that relate to the previous goal. A diverse hierarchy not only allows adaptive and varied behaviour, but robust approaches that can handle disruption. The problem is that in traditional approaches, deriving these hierarchies proved to be difficult and often didn’t have anything like the intended positives (see Subsumption Architecture). This is why approaches such as reinforcement learning and deep neural networks have gained traction, using high amounts of data to develop complex linear transforms that allow one to bypass the need for developmental learning of layers of behaviour.

To actually help address the original point, reinforcement learning could be used in a variety of ways. Would it be effective? Probably not. The biggest underlying conceptual difference that I can see is that RL assumes that behaviour is state or action driven, whereas PCT and other closed loop control theories put behaviour as an emergent property of the control system’s response to a problem over time. Trying to put action or state spaces into PCT anywhere will be problematic.

Could reinforcement learning be used to learn the value for gains? It could, but it would be poor results and overkill. Could reinforcement learning combined with Deep Learning act as a reorganiser for a PCT hierarchy? Yes, but I think as alluded to before, it should be clear where needs reorganising anyway if the rate of error is high. You don’t need RL or DNN (Deep Neural Networks) to do that. Could Reinforcement Learning combined with Deep Neural Networks act as a perception generator? That’s a much more plausible possibility, but I don’t know where you’d start. It could definitely learn to identify patterns in the world that relate to specific problems, and then a PCT hierarchy could incorporate it and minimise error. After all, if you have the right perceptions, HPCT should be more than sufficient. It’s where the perceptions come from that I think is the thing PCT doesn’t answer.

Hope this helps,

Ben

···

On 28 September 2017 at 13:33, Erling Jorgensen EJorgensen@riverbendcmhc.org wrote:

[Erling Jorgensen (2017.09.28 0739 EDT)]

Rupert Young (2017.09.28 10.30)

Hi Rupert,

Thanks for the reply. I was wondering about your reactions to what I wrote.

RY: Well, I think “intrinsic desirability” is related to PCT goals in that certain states represent sub-goals on the way to achieving a man [main?] goal, and some states will help attain that better than others. However, this may only be valid in well-defined, constrained environments such as games where you are dealing with discrete states rather than dynamic environments where the reference values continually change.

EJ: This doesn’t sound very “intrinsic” to me. The claim of Reinforcement Learning seemed to be that reward value is an intrinsic property of certain states. Whereas with PCT, the goal / the reference is the source of that desirability. Remove that goal, or choose a different sub-goal, and the supposed ‘intrinsic-ness’ evaporates. (Upon re-reading your statement, we may be saying essentially the same thing: Goals = Desirability, intrinsically, or we could say, by definition.)

EJ: You had asked about the overlap or differences of a mapping of Reinforcement Learning onto Perceptual Control Theory. At present in PCT we do not have a good switching mechanism to see the change of goals in action, other than ‘Bump it Upstairs’ with postulating a higher level reference standard. We also view a Program level perception as a network of contingencies that, once a higher level goal has been determined, can navigate among descending reference specifications, depending on the perceptual results achieved so far (i.e., those are the contingency nodes.)

EJ: It is possible that “the four main elements of a reinforcement learning system” that you outlined [in Rupert Young (2017.09.26 9.45)] could fill that gap. RL includes a mapping from goal-related environmental states, to reward values, to expected accumulation of reward values, to planning among action options. So there are inserted two potentially measurable functions – reward values and expected reward values – to guide selection of goals and sub-goals.

EJ: My proposal and preference is for a gain-adjusting system, (where a zero level of gain can suspend a goal). I believe that error-reduction in PCT already fills that intermediary role that RL assigns to rewards and the expected value of accumulating those rewards. But PCT still needs to have some mechanism to switch goals on and off. It is not yet very specified to say, in effect, “the HPCT hierarchy does it.” Moreover, we already see (perhaps as a mixture of theory and phenomenon) that goals are carried out with various levels of gain. So PCT has the problem of discerning a mechanism for adjusting gain anyway. I would like to see a more direct tackling of that issue, rather than importing concepts (and perhaps assumptions) from Reinforcement Learning.

EJ: That’s my take on it, at any rate. All the best.

Erling


Disclaimer: This message is intended only for the use of the individual or entity to which it is addressed, and may contain information that is privileged, confidential and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, or the employer or agent responsible for delivering the message to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please notify the sender immediately by telephone and delete the material from your computer. Thank you for your cooperation.

Hi Ben, I dont know much about RL but that was a very helpful email for me! I think that most of psychology is plagued by the idea that learning is a fundamental feature of the living organism. It’s not. Evolution has done most of the hard work! But most psychologists seem wary of such an apparently nativist philosophy. Whereas PCT makes it clear that control is fundamental on both an ontological and evolutionary basis, discovers that control must be control of input to be suitable flexible, builds the functional architecture to achieve this, and only then starts thinking about how to fit learning into the system!

All the best

Warren

···

On 28 September 2017 at 13:33, Erling Jorgensen EJorgensen@riverbendcmhc.org wrote:

[Erling Jorgensen (2017.09.28 0739 EDT)]

Rupert Young (2017.09.28 10.30)

Hi Rupert,

Thanks for the reply. I was wondering about your reactions to what I wrote.

RY: Well, I think “intrinsic desirability” is related to PCT goals in that certain states represent sub-goals on the way to achieving a man [main?] goal, and some states will help attain that better than others. However, this may only be valid in well-defined, constrained environments such as games where you are dealing with discrete states rather than dynamic environments where the reference values continually change.

EJ: This doesn’t sound very “intrinsic” to me. The claim of Reinforcement Learning seemed to be that reward value is an intrinsic property of certain states. Whereas with PCT, the goal / the reference is the source of that desirability. Remove that goal, or choose a different sub-goal, and the supposed ‘intrinsic-ness’ evaporates. (Upon re-reading your statement, we may be saying essentially the same thing: Goals = Desirability, intrinsically, or we could say, by definition.)

EJ: You had asked about the overlap or differences of a mapping of Reinforcement Learning onto Perceptual Control Theory. At present in PCT we do not have a good switching mechanism to see the change of goals in action, other than ‘Bump it Upstairs’ with postulating a higher level reference standard. We also view a Program level perception as a network of contingencies that, once a higher level g
oal has been determined, can navigate among descending reference specifications, depending on the perceptual results achieved so far (i.e., those are the contingency nodes.)

EJ: It is possible that “the four main elements of a reinforcement learning system” that you outlined [in Rupert Young (2017.09.26 9.45)] could fill that gap. RL includes a mapping from goal-related environmental states, to reward values, to expected accumulation of reward values, to planning among action options. So there are inserted two potentially measurable functions – reward values and expected reward values – to guide selection of goals and sub-goals.

EJ: My proposal and preference is for a gain-adjusting system, (where a zero level of gain can suspend a goal). I believe that error-reduction in PCT already fills that intermediary role that RL assigns to rewards and the expected value of a
ccumulating those rewards. But PCT still needs to have some mechanism to switch goals on and off. It is not yet very specified to say, in effect, “the HPCT hierarchy does it.” Moreover, we already see (perhaps as a mixture of theory and phenomenon) that goals are carried out with various levels of gain. So PCT has the problem of discerning a mechanism for adjusting gain anyway. I would like to see a more direct tackling of that issue, rather than importing concepts (and perhaps assumptions) from Reinforcement Learning.

EJ: That’s my take on it, at any rate. All the best.

Erling


Disclaimer: This message is intended only for the use of the individual or entity to which it is addressed, and may contain information that is privileged, confidential and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, or the employer or agent responsible for delivering the message to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please notify the sender immediately by telephone and delete the material from your computer. Thank you for your cooperation.

[From Rupert Young (2017.09.29 12.30) ]

Here are some musings, rather than concrete views.

(Erling Jorgensen (2017.09.28 0739 EDT)]

      EJ:  This doesn't sound very "intrinsic" to me.  The claim

of Reinforcement Learning seemed to be that reward value is an
intrinsic property of certain states. Whereas with PCT, the
goal / the reference is the source of that desirability.
Remove that goal, or choose a different sub-goal, and the
supposed ‘intrinsic-ness’ evaporates. (Upon re-reading your
statement, we may be saying essentially the same thing: Goals
= Desirability, intrinsically, or we could say, by
definition.)

Yes, I think we are saying the same thing, more or less.
      EJ:  You had asked about the overlap or differences of a

mapping of Reinforcement Learning onto Perceptual Control
Theory. At present in PCT we do not have a good switching
mechanism to see the change of goals in action, other than
‘Bump it Upstairs’ with postulating a higher level reference
standard. We also view a Program level perception as a
network of contingencies that, once a higher level goal has
been determined, can navigate among descending reference
specifications, depending on the perceptual results achieved
so far (i.e., those are the contingency nodes.)

Yes, it seems to me that RL is largely addressing what we would call

the program level as they are environments not of continuous control
systems but discrete events, such as game moves. They are calling
these actions, which we’d call sub-goals. So in PCT terms the output
of one control system would not be to vary a dynamic reference
signal for a lower level but to select between (switch) discrete
events that would achieve its own goal.

I think this is like switching between the brake and the throttle in

a car, you’re not just varying a signal but selecting a different
control system. In this case there would then be a dynamic varying
perception being controlled, such as the angle of the foot. But in a
game simulation such a control system would be represented by a
binary reference signal and output. In computational schemes, such
as RL, this is represented as a simple action rule.

There is a challenge here for PCT, in how could these discrete-state

problem solving type environments be represented and addressed by
PCT. And how learning would take place. Currently PCT can even
provide a solution to learn how to play something as simple as
tic-tac-toe.

      EJ:  It is possible that "the four main elements of a

reinforcement learning system" that you outlined [in Rupert
Young (2017.09.26 9.45)] could fill that gap. RL includes a
mapping from goal-related environmental states, to reward
values, to expected accumulation of reward values, to planning
among action options. So there are inserted two potentially
measurable functions – reward values and expected reward
values – to guide selection of goals and sub-goals.

I'm wondering whether some of the differences are just a matter of

terminology; reward = goal, action = sub-goal. Although we’d
probably reject the concept of “a set of stimulus-response rules”
we’d need to reframe the methodology in the PCT context.

      EJ:  My proposal and preference is for a gain-adjusting

system, (where a zero level of gain can suspend a goal).

This is what Bill had in his arm reorganisation in LCS3, but is it

appropriate for the program level environment? In this case, rather
than varying a dynamic reference signal, do you not have to switch
between control systems?

      I believe that error-reduction in PCT already fills that

intermediary role that RL assigns to rewards and the expected
value of accumulating those rewards. But PCT still needs
to have some mechanism to switch goals on and off.

RL is an error-reduction methodology; error related to the reward

function. A question for PCT is how does the PCT error affect the
output when that output is a set of discrete choices (game moves)?
Is this gain adjustment?

      It is not yet very specified to say, in effect, "the HPCT

hierarchy does it." Moreover, we already see (perhaps as a
mixture of theory and phenomenon) that goals are carried out
with various levels of gain. So PCT has the problem of
discerning a mechanism for adjusting gain anyway. I would
like to see a more direct tackling of that issue, rather than
importing concepts (and perhaps assumptions) from
Reinforcement Learning.

I agree, but the more I look at RL the less stark are some of the

differences.

Rupert

[From Rupert Young (2017.09.29 12.35) ]

Very useful response Ben. My comments below.

      BH: The purpose of the state based approach is to allow it

to solve problems that do not compliment simple error
reduction of a single controller. For example, reaching an
exit point of a room where one must bypass a wall first which
involves going in the wrong direction. Simply trying to get as
close to the exit as possible would fail, hence the weighting
of states allows it to bypass this problem by learning that
the gap(s) in the wall are key places to reach in the
trajectory. Do I think this is a sensible approach? Not
necessarily at all. A properly structured hierarchy of error
reducing nodes should solve the problem.

Do such examples actually exist yet? I've been thinking about the

Mountain Car problem
() which would
seem a good and reasonably simple learning problem, of continuous
variables, that PCT should be able to address, and is analogous to
your problem above. Do we have a hierarchy of error reducing nodes
that can solve that problem?
I guess this RL assumption comes out of the trad AI problem solving
legacy which involved discrete state spaces. PCT still needs to
provide ways of solving such cases, which would address the
high-level reasoning (or program level) aspects of cognition.
Perhaps its a matter of re-framing state spaces and actions as
perceptions and goals.
Yes, I think that although there is some work on learning gains (arm
reorg in LCS3), what is lacking is how perceptual functions are
learned. Though we should be able to take techniques from neural
networks for this; autoencoders or convolutional nns.
On a slightly different matter I came across this paper which uses
RL for adaptive control systems, rather than discrete state spaces,
which is much closer to what PCT normally addresses.
“Reinforcement learning and optimal adaptive control: An overview
and implementation examples” I’m still looking at it so not sure what it is doing yet. Are you
familiar with this approach?
Rupert

···

https://en.wikipedia.org/wiki/Mountain_car_problem

        To actually help address the

original point, reinforcement learning could be used in a
variety of ways. Would it be effective? Probably not. The
biggest underlying conceptual difference that I can see is
that RL assumes that behaviour is state or action driven,
whereas PCT and other closed loop control theories put
behaviour as an emergent property of the control system’s
response to a problem over time. Trying to put action or
state spaces into PCT anywhere will be problematic.

        Could reinforcement learning

be used to learn the value for gains? It could, but it would
be poor results and overkill. Could reinforcement learning
combined with Deep Learning act as a reorganiser for a PCT
hierarchy? Yes, but I think as alluded to before, it should
be clear where needs reorganising anyway if the rate of
error is high. You don’t need RL or DNN (Deep Neural
Networks) to do that. Could Reinforcement Learning combined
with Deep Neural Networks act as a perception generator?
That’s a much more plausible possibility, but I don’t know
where you’d start. It could definitely learn to identify
patterns in the world that relate to specific problems, and
then a PCT hierarchy could incorporate it and minimise
error. After all, if you have the right perceptions, HPCT
should be more than sufficient. It’s where the perceptions
come from that I think is the thing PCT doesn’t answer.

https://pdfs.semanticscholar.org/b4cf/9d3847979476af5b1020c61367fa03626d22.pdf

[From Rick Marken (2017.09.29.0845)]

···

On Fri, Sep 29, 2017 at 2:30 AM, Warren Mansell wmansell@gmail.com wrote:

WM: … PCT makes it clear that control is fundamental on both an ontological and evolutionary basis, discovers that control must be control of input to be suitable flexible

RM: Control is only control of input. There is no such thing as control of output. Just as there is no such thing as reinforcement, which is why I find this continuing discussion of reinforcement learning rather puzzling. Engineers do often refer to the variable controlled by a control system as"output". But that is just a terminology thing based on the fact that they are analyzing the system from the outside. A control loop – a closed negative feedback loop – is always organized around the control of its perceptual input.Â

BestÂ

Rick

Â

On 29 Sep 2017, at 02:50, B Hawker bhawker1@sheffield.ac.uk wrote:

Hi all,

Thought I’d pitch in as someone working on reorganisation of hierarchies and also have studied reinforcement learning.

Well done on your find there, Rupert! Digney’s work is hard to track down. His work past about 2001 is near impossible to find due to him working in the military…

Some observations, I’m afraid not in chronological order…

 EJ: This is one of those extra intermediary concepts that takes on a driving force in RL. A more direct link that fills this role in PCT is the one listed above: reduction of error in attaining a goal. I don’t see that there is a need for “intrinsic desirability” of various states. What makes something desirable is attaching a goal to it. When the goal changes, it is no longer desirable.Â

BH:
The purpose of the state based approach is to allow it to solve problems that do not compliment simple error reduction of a single controller. For example, reaching an exit point of a room where one must bypass a wall first which involves going in the wrong direction. Simply trying to get as close to the exit as possible would fail, hence the weighting of states allows it to bypass this problem by learning that the gap(s) in the wall are key places to reach in the trajectory. Do I think this is a sensible approach? Not necessarily at all. A properly structured hierarchy of error reducing nodes should solve the problem. Furthermore, as you stated, when the goal changes, it’s no longer desirable… and reinforcement learning collapses here. A robust hierarchy of error reducing controllers does not suffer in the same way. However, engineers do not have the patience or skill in many cases to design appropriately robust hierarchies. Therefore, reinforcement learning.

<
br>

EJ: I believe there is room for PCT to entertain the notion of a gain-adjusting system, which I connect with the emotion system. This would be driven by the rate of change in error, (does that make it a first derivative?) I am not sure how broadly error would be sampled for this. I don’t know whether the Amygdala is a possible site for such sampling. But it does seem that humans and seemingly other animals utilize broad hormonal effects to adjust the gain of their control systems. Because certain chemicals may lean the system in certain directions, this may be one place where the RL notion of “reward value” could map onto a PCT understanding of "error reduction." However, I still don’t like the tendency of reward to become a “dormitive principle,” as Bruce Abbott (2017.09.14.2005 EDT) pointed out.Â

BH: I’ve been thinking about (and would definitely test if I had time in my PhD) into the notion of stress being a motivator in changing the “gains” of a control system (as the system gets “stressed”, it is more likely to change gains and also more rapidly). So, I’m also someone who’s been thinking about emotions being a gain changer. It’s still a very open topic with lots of room to explore, however. “Reward Value” mapping onto “error reduction” is very simple. To maximise reward is to minimise negative reward (or punishment). Many reinforcement learning experiments actually use negative reward, so it’s commonly used, and they aim to minimise that. So, in both cases, minimisation is occurring. It’s quite plausible that the “reward value” signal could be considered a perceptual reference to the system and then this removes a key difference. Obvi
ously, they have huge differences in how they handle the understanding of the world (perceptual signals versus state based representation).

EJ: This doesn’t sound very “intrinsic” to me. The claim of Reinforcement Learning seemed to be that reward value is an intrinsic property of certain states. Whereas with PCT, the goal / the reference is the source of that desirability. Remove that goal, or choose a different sub-goal, and the supposed ‘intrinsic-ness’ evaporates. (Upon re-reading your statement, we may be saying essentially the same thing: Goals = Desirability, intrinsically, or we could say, by definition.)Â

The reward value of the states is not the source of the desirability in Reinforcement Learning, the reward function is. The reward function dictates the desirability of a state. Technically, this does mean the reward value of the state is the motivator (i.e. what is desirable) for the agent, but this isn’t in reflection to current reinforcement learning work. There are a variety of paradigms of reinforcement learning, many of which have much better results than state reward value driven approaches. In particular, some approaches aim to select the best policy  which maximises reward, which dodges the whole explicit state action declaration altogether. To attempt to analogise to PCT, this would be like an individual PCT node learning how to change the gain to best keep the difference between input and reference as low as possible. Q-Learning, which is very popular, gives state-action pairs reward values rather than specific states. This is because, obviously, specific actions may have different values in different states but also the other way around.

Reinforcement learning still suffers from many of the problems from when it was initially proposed. The curse of dimensionality means that considering every action and state combination is expensive. Discretising the action and state space can be complicated or impossible. Digney’s attempt to make it hierarchical was fascinating, and it’s definitely the right way to think about things. One of the lead researchers at google deep mind was talking to me about how the “curriculum” of deep neural networks is their next hurdle to go over.

I find it amusing that the “bump it upstairs” argument is referenced here, as it genuinely is the answer. “Changing the goal” is nothing more than different high level nodes firing and referencing lower nodes instead of the higher level nodes that relate to the previous goal. A diverse hierarchy not only allows adaptive and varied behaviour, but robust approaches that can handle disruption. The problem is that in traditional approaches, deriving these hierarchies proved to be difficult and often didn’t have anything like the intended positives (see Subsumption Architecture). This is why approaches such as reinforcement learning and deep neural networks have gained traction, using high amounts of data to develop complex linear transforms that allow one to bypass the need for developmental learning of layers of behaviour.

To actually help address the original point, reinforcement learning could be used in a variety of ways. Would it be effective? Probably not. The biggest underlying conceptual difference that I can see is that RL assumes that behaviour is state or action driven, whereas PCT and other closed loop control theories put behaviour as an emergent property of the control system’s response to a problem over time. Trying to put action or state spaces into PCT anywhere will be problematic.Â

Could reinforcement learning be used to learn the value for gains? It could, but it would be poor results and overkill. Could reinforcement learning combined with Deep Learning act as a reorganiser for
a PCT hierarchy? Yes, but I think as alluded to before, it should be clear where needs reorganising anyway if the rate of error is high. You don’t need RL or DNN (Deep Neural Networks) to do that. Could Reinforcement Learning combined with Deep Neural Networks act as a perception generator? That’s a much more plausible possibility, but I don’t know where you’d start. It could definitely learn to identify patterns in the world that relate to specific problems, and then a PCT hierarchy could incorporate it and minimise error. After all, if you have the right perceptions, HPCT should be more than sufficient. It’s where the perceptions come from that I think is the thing PCT doesn’t answer.

Hope this helps,

Ben

On 28 September 2017 at 13:33, Erling Jorgensen EJorgensen@riverbendcmhc.org wrote:

[Erling Jorgensen (2017.09.28 0739 EDT)]

Rupert Young (2017.09.28 10.30)Â

Hi Rupert,

Thanks for the reply. I was wondering about your reactions to what I wrote.Â

RY:Â Well, I think “intrinsic desirability” is related to PCT goals in that certain states represent sub-goals on the way to achieving a man [main?]Â goal, and some states will help attain that better than others. However, this may only be valid in well-defined, constrained environments such as games where you are dealing with discrete states rather than dynamic environments where the reference values continually change.

EJ: This doesn’t sound very “intrinsic” to me. The claim of Reinforcement Learning seemed to be that reward value is an intrinsic property of certain states. Whereas with PCT, the goal / the reference is the source of that desirability. Remove that goal, or choose a different sub-goal, and the supposed ‘intrinsic-ness’ evaporates. (Upon re-reading your statement, we may be saying essentially the same thing: Goals = Desirability, intrinsically, or we could say, by definition.)Â

EJ: You had asked about the overlap or differences of a mapping of Reinforcement Learning onto Perceptual Control Theory. At present in PCT we do not have a good switching mechanism to see the change of goals in action, other than ‘Bump it Upstairs’ with postulating a higher level reference standard. We also view a Program level perception as a network of contingencies that, once a higher level g
oal has been determined, can navigate among descending reference specifications, depending on the perceptual results achieved so far (i.e., those are the contingency nodes.)Â

EJ: It is possible that “the four main elements of a reinforcement learning system” that you outlined [in Rupert Young (2017.09.26 9.45)] could fill that gap. RL includes a mapping from goal-related environmental states, to reward values, to expected accumulation of reward values, to planning among action options. So there are inserted two potentially measurable functions – reward values and expected reward values – to guide selection of goals and sub-goals.Â

EJ: My proposal and preference is for a gain-adjusting system, (where a zero level of gain can suspend a goal). I believe that error-reduction in PCT already fills that intermediary role that RL assigns to rewards and the expected value of a
ccumulating those rewards. But PCT still needs to have some mechanism to switch goals on and off. It is not yet very specified to say, in effect, "the HPCT hierarchy does it." Moreover, we already see (perhaps as a mixture of theory and phenomenon) that goals are carried out with various levels of gain. So PCT has the problem of discerning a mechanism for adjusting gain anyway. I would like to see a more direct tackling of that issue, rather than importing concepts (and perhaps assumptions) from Reinforcement Learning.Â

EJ: That’s my take on it, at any rate. All the best.Â

Erling

Disclaimer: This message is intended only for the use of the individual or entity to which it is addressed, and may contain information that is privileged, confidential and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, or the employer or agent responsible for delivering the message to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please notify the sender immediately by telephone and delete the material from your computer. Thank you for your cooperation.

Â


Richard S. MarkenÂ

"Perfection is achieved not when you have nothing more to add, but when you
have nothing left to take away.�
                --Antoine de Saint-Exupery

Hi,

RY: Do such examples actually exist yet? I’ve been thinking about the Mountain Car problem (https://en.wikipedia.org/wiki/Mountain_car_problem) which would seem a good and reasonably simple learning problem, of continuous variables, that PCT should be able to address, and is analogous to your problem above. Do we have a hierarchy of error reducing nodes that can solve that problem?

Nope. Very complicated problem. Would take a long time to derive the hierarchy, it’s got a lot of complex perceptions there.

RY: I guess this RL assumption comes out of the trad AI problem solving legacy which involved discrete state spaces. PCT still needs to provide ways of solving such cases, which would address the high-level reasoning (or program level) aspects of cognition. Perhaps its a matter of re-framing state spaces and actions as perceptions and goals.

It has one, which is higher levels. I get confused when people laugh at the “bump it upstairs” argument, as it genuinely is the answer. Of course a complex hierarchy is needed to solve complex problems… and it’s no wonder we can’t hand derive it ourselves, it’s obscenely complex! The cost of blasting it with big data is that one doesn’t get the robustness of having the complex layers of behaviour we have. See: Like, any robot, ever. So, sure, it’s a problem that other techniques to some extent have solved, but they haven’t remotely reached a point where their behaviour or learning is robust enough to be let loose.

RY: Yes, I think that although there is some work on learning gains (arm reorg in LCS3), what is lacking is how perceptual functions are learned. Though we should be able to take techniques from neural networks for this; autoencoders or convolutional nns.

RY: On a slightly different matter I came across this paper which uses RL for adaptive control systems, rather than discrete state spaces, which is much closer to what PCT normally addresses.
“Reinforcement learning and optimal adaptive control: An overview and implementation examples”
https://pdfs.semanticscholar.org/b4cf/9d3847979476af5b1020c61367fa03626d22.pdf

RY: I’m still looking at it so not sure what it is doing yet. Are you familiar with this approach?

I am familiar with actor-critic approaches, which is what this is employing through Q-Learning. Actor-Critic is quite nice, so I’ll try give a good high level description of it.

To clarify, it is not using RL for adaptive control systems instead of discrete state spaces. The critic is using state and action spaces with a value function to advise the actor (which is a PD controller I think?). So, it still uses state action pairings and appropriate values of those to motivate changes in behaviour.

Formally, there are two learning functions in an Actor Critic model. One learns how to optimise the value function (The Critic) and one to optimise the policy (the actor). The critic uses the weights from the value functions to update the policy, presenting new behaviour.

Less formally, the critic is learning how to advise the actor on how to do its job. So, the critic may say “do it a little more like this” or “emphasise more on getting this right”, and the actor does so. The actor does a better job, but still has some flaws. So, the critic thinks internally “actually, telling them to do this didn’t work out. I need to rethink what’s worthwhile here…” and then says “Ah, try this instead!”. So, it’s a constant learning process of the critic learning how to criticise (as in, learning what is actually valuable and thus what to tell the actor) and then telling this to the actor who learns how to better solve the problem.

Actor critic models are common along with Q learning, but somewhat analogous models of higher level transforms learning how to influence lower level reactive controllers is common. This is the posited behaviour of the cerebellum (can throw you a load of work on this). The cerebellum is proposed to be a feed forward controller on top of a standard closed loop controller, which allows the closed loop controller to handle reactive control and the feed forward controller to allow it to minimise expected error (such as anticipating a response). So, models with a reactive controller and then some other transform on top of it influencing the reactive layer is becoming very common.

Does that make sense? Sorry, felt a bit ramble-y.

Ben

···

On 29 September 2017 at 12:34, Rupert Young rupert@perceptualrobots.com wrote:

[From Rupert Young (2017.09.29 12.35) ]

Very useful response Ben. My comments below.

On 29/09/2017 02:50, B Hawker wrote:

      BH: The purpose of the state based approach is to allow it

to solve problems that do not compliment simple error
reduction of a single controller. For example, reaching an
exit point of a room where one must bypass a wall first which
involves going in the wrong direction. Simply trying to get as
close to the exit as possible would fail, hence the weighting
of states allows it to bypass this problem by learning that
the gap(s) in the wall are key places to reach in the
trajectory. Do I think this is a sensible approach? Not
necessarily at all. A properly structured hierarchy of error
reducing nodes should solve the problem.

Do such examples actually exist yet? I've been thinking about the

Mountain Car problem
(https://en.wikipedia.org/wiki/Mountain_car_problem ) which would
seem a good and reasonably simple learning problem, of continuous
variables, that PCT should be able to address, and is analogous to
your problem above. Do we have a hierarchy of error reducing nodes
that can solve that problem?

        To actually help address the

original point, reinforcement learning could be used in a
variety of ways. Would it be effective? Probably not. The
biggest underlying conceptual difference that I can see is
that RL assumes that behaviour is state or action driven,
whereas PCT and other closed loop control theories put
behaviour as an emergent property of the control system’s
response to a problem over time. Trying to put action or
state spaces into PCT anywhere will be problematic.

I guess this RL assumption comes out of the trad AI problem solving

legacy which involved discrete state spaces. PCT still needs to
provide ways of solving such cases, which would address the
high-level reasoning (or program level) aspects of cognition.
Perhaps its a matter of re-framing state spaces and actions as
perceptions and goals.

        Could reinforcement learning

be used to learn the value for gains? It could, but it would
be poor results and overkill. Could reinforcement learning
combined with Deep Learning act as a reorganiser for a PCT
hierarchy? Yes, but I think as alluded to before, it should
be clear where needs reorganising anyway if the rate of
error is high. You don’t need RL or DNN (Deep Neural
Networks) to do that. Could Reinforcement Learning combined
with Deep Neural Networks act as a perception generator?
That’s a much more plausible possibility, but I don’t know
where you’d start. It could definitely learn to identify
patterns in the world that relate to specific problems, and
then a PCT hierarchy could incorporate it and minimise
error. After all, if you have the right perceptions, HPCT
should be more than sufficient. It’s where the perceptions
come from that I think is the thing PCT doesn’t answer.

Yes, I think that although there is some work on learning gains (arm

reorg in LCS3), what is lacking is how perceptual functions are
learned. Though we should be able to take techniques from neural
networks for this; autoencoders or convolutional nns.

On a slightly different matter I came across this paper which uses

RL for adaptive control systems, rather than discrete state spaces,
which is much closer to what PCT normally addresses.

"Reinforcement learning and optimal adaptive control: An overview

and implementation examples"

https://pdfs.semanticscholar.org/b4cf/9d3847979476af5b1020c61367fa03626d22.pdf

I'm still looking at it so not sure what it is doing yet. Are you

familiar with this approach?

Rupert

Exactly

···

https://en.wikipedia.org/wiki/Mountain_car_problem

        To actually help address the

original point, reinforcement learning could be used in a
variety of ways. Would it be effective? Probably not. The
biggest underlying conceptual difference that I can see is
that RL assumes that behaviour is state or action driven,
whereas PCT and other closed loop control theories put
behaviour as an emergent property of the control system’s
response to a problem over time. Trying to put action or
state spaces into PCT anywhere will be problematic.

        Could reinforcement learning

be used to learn the value for gains? It could, but it would
be poor results and overkill. Could reinforcement learning
combined with Deep Learning act as a reorganiser for a PCT
hierarchy? Yes, but I think as alluded to before, it should
be clear where needs reorganising anyway if the rate of
error is high. You don’t need RL or DNN (Deep Neural
Networks) to do that. Could Reinforcement Learning combined
with Deep Neural Networks act as a perception generator?
That’s a much more plausible possibility, but I don’t know
where you’d start. It could definitely learn to identify
patterns in the world that relate to specific problems, and
then a PCT hierarchy could incorporate it and minimise
error. After all, if you have the right perceptions, HPCT
should be more than sufficient. It’s where the perceptions
come from that I think is the thing PCT doesn’t answer.

https://pdfs.semanticscholar.org/b4cf/9d3847979476af5b1020c61367fa03626d22.pdf