Reinforcement Learning

B_Hawker · October 2, 2017, 2:35pm

[Ben Hawker 2017.10.02 15:00]

···

MT: PCT is also an adaptive system in that same sense. Organisms growing
up in different environments develop different perceptual functions
and connect them to different outputs to produce controllable
perceptions that help maintain the intrinsic variables near their
optima. a city dweller doesn’t easily perceive deer tracks in the
bush or know how to get lunch when he can perceive the tracks, a
hunter from the bush doesn’t easily perceive when it is safe to
cross the street at a city intersection, or where to get lunch when
he can cross the street safely.

BH: Yes, in theory. But, are there working algorithms for deriving new perceptions and placing them in hierarchies of control? If so, let me know, as that’s my PhD! As I’m aware, that hasn’t been done yet. With the ability to identify new relevant perceptions and place them in the hierarchy, it can be considered an adaptive system. Without that, surely it is just reactive?

MT: PCT builds perceptual functions and connects up control units by
reorganization rather than reinforcement. The difference between the
two concepts is analogous to the difference between “Come here” and
“Don’t go there”. Reinforcement says “what you did was good”,
whereas reorganization says “What you are doing isn’t bad.” In other
analogous mottos, Reinforcement is “Getting better every day”,
whereas PCT is “If it ain’t broke, don’t fix it.” There’s no
technical reason why both could not work together in building more
effective systems, just as in the brain there seem to be mutually
supportive systems of perception “It’s like this” and “it’s
different from that”, or at higher levels “intuition” and
“analysis”.

BH: That’s one of the nicest descriptions of the conceptual differences between reinforcement and reorganization that I’ve seen, may well pinch that. Given the requirement for us as living agents to control effort, reorganization seems the more sensible approach. However, aren’t we yet to still find something that robustly reorganises the hierarchy? (Insert my PhD work here, but that’s a long story so I’m sure I can cover that privately if wished!).

MT: Why is that a downside? It's built into the very heart of PCT.
There’s no prohibition in PCT against one control system seeing a
rock coming at your head and sending a reference signal to a lower
level set of systems to move your head to a position at which the
rock is seen to be missing you and going past.

MT: I have no idea how complex, but I do think it needs prior experience
in order to reorganize those perceptions and the corresponding
controls. Before you have ever been hit by something you could have
seen coming while you were balancing, you would not counter it in
advance under either adaptive development system.

BH: Without producing new perceptions (which is something that isn’t covered in PCT currently, hence Rupert asking about RL I assume), an agent wouldn’t be able to solve anticipatory problems. For example, if you were balancing and then you were warned that you would receive an impact one second after, a cognitive agent could prepare. In short, Perceptual Control Theory allows an agent to minimise the difference between perception and error… but currently, there is functionally no way of generating perceptions that allow an agent to learn new perceptions from information it is given. Current reorganization algorithms do handle changing of weights between PCT nodes and internal gains, but not for actually finding new perceptions. This is what I am arguing as part of my work is part of the “generative” side of cognition which is not something I strictly cover. But, without the ability to generate new perceptions, PCT cannot minimise more than what it is programmed to (hence reactive).

BH: So, for example, an agent could learn a perceptual signal for a warning, and then learn that this relates to the temporal property of a second which is a relative perception that raises expectation after a second. You could provide the first, and the second, and our current reorganization methods may find it. But, it cannot make the perception it needs.

MT: Why do you say that? In the book I am currently trying to write
(working title “Powers of Perceptual Control” – pun intended), I
have two whole chapters on how perceptions of different aspects of
language would be expected to develop. Language is a tricky example,
which is why I used it. I have been drafting also a response in some
detail to your earlier comment "* It’s where the perceptions come
from that I think is the thing PCT doesn’t answer.* ", a comment
that greatly surprised me.

BH: Let’s take the previous example. Basically, my point boils down to whatever perceptions the system is given makes or breaks the ability for it to function in a task. Moreover, they also decide how hard it would be to learn to control the problem, if learning is needed at all.

BH: Let’s take an agent that is trying to avoid pain in the classical blinking eye conditioning task. So, the agent is strapped to a machine which will administer an uncomfortable eye puff into the eye. However, the agent is able to shut its eyes. The agent is presented a warning stimulus (a sound perhaps?) which indicates a puff is incoming. Research shows, obviously, we can learn to solve this task. What would a PCT agent need to solve this task? A perception of eyes closed and open, which leads to an actuator to change the position of the eyelids would allow higher levels to close or open the eyes. A perception of sound is needed to invoke some preparatory response. A perceptual representation of how long it is going to be before the eye puff which would basically be our perception of “warning”. Then, with higher level references of the minimisation of pain and maximisation of eyes open for survival reasons would be all that’s needed. So, the problem is certainly solvable with PCT. But, where do those perceptions come from in the first place? What method tracks them down and identifies which perception is relevant then adds them to the hierarchy in the right place? The problem is simple with the right perceptions, but the latter part is easier said than done. How does a cognitive agent find them autonomously?

MT: Exactly! That's a feature, not a bug. It enables ranges of behaviour
that control ranges of perceptions, some of which serve to maintain
the intrinsic variables in good condition, others of which don’t.
The latter tend to get pruned, the former tend to stick around. But
the word “choose” is unfortunate, as it suggests some directed
control process, whereas in PCT it is the effects of interactions
with the environment that “choose”, as part of the reorganization
process. Or are you talking about designed robots only?

BH: See above. Part of the problem is there is still no algorithm for generating these perceptions (from raw input to the system or even as perceptual transformations of other perceptions) and then inserting them into the hierarchy. Am I missing something? Everyone in the PCT community seems to assume there’s a solid way of finding new perceptions and knowing where to add them to the hierarchy (or how to rearrange the hierarchy given them) but I am yet to see this…

MT: What engineer? The Earth Mother who directs evolution? It's true
that humans will not find behaviours that require massive electrical
impulses that kill dangerous predators, and that eels will not find
much use for levers their muscles cannot move. Is that a hit against
PCT? How would reinforcement handle those problems?

BH: The engineer who builds the PCT solution… as above, I am yet to find an algorithm that appropriately generates and applies perceptions that haven’t already been explicitly given to the system!

MT: True. The sensor systems must vary what they report to create any
and all perceptions that might ever be controlled, and we can be
killed by things our sensors do not report, such as X-ray or gamma
radiation. Is there a difference between PCT and reinforcement
learning in this respect?

BH: RL, in combination with some model base to work from (Neural Networks) uses big data to find more useful transformations of the data, conceptually. This allows it to recognise patterns or shapes from only raw input data, for example. The neural networks on the front effectively act as the pattern generator that produces more complex perceptions. Then, usually, some reactive controller is stuck on the end of that to minimise the problem. I assume this is why RY might be quite interested in this… perhaps using RL and DNN to produce relevant perceptions for a PCT system could be quite effective.

BH: True, we cannot perceive X-ray or gamma radiation… but we can learn about it, which effectively allows us to produce a new internal perceptual hierarchy about how to avoid it even though we can’t directly perceive it. It’s clear we as cognitive agents can learn new perceptions and highlight ones we need. Reorganization is lacking this (which is also part of my PhD).

MT: Reorganization doesn't actually stop, either. It just slows greatly
when control is good and the intrinsic variables stay near their
optimum values. In everyday terms (as I said above) “If it ain’t
broke, don’t fix it.” Put another way, if you are feeling good about
yourself and the way things are, you won’t be a revolutionary. But
even then, small changes allowed by what (using a quantum physics
analogy) we might call “zero-point” reorganization, good things
might get even better. If the small zero-point changes make things
worse, the tendency is to return them to the way they were, in
effect creating a stopping point.

BH: Which I think is a much better approach… agents clearly adopt a “If it ain’t broke, don’t fix it.” model. This is not how reinforcement learning works. Furthermore, we’re constantly learning and reinforcement learning is not temporally stable. That to me is one of the biggest conceptual criticisms of it as a realistic learning algorithm for agents. Reorganization is thought about in a temporally continuous manner, but needs additions to handle producing new perceptions. I will however say that’s a seriously complicated problem, but one worth thinking about!

MT: Well, that's exactly what a living control system is supposed to do.
It starts life with the systems evolution has provided (humans can’t
walk at birth, newborn deer can). Check out the work of Franz
Plooij, for example, for stages of human development interpreted as
stages in building control hierarchies.

BH: Will take a look, very excited!

Ben

          BH: PCT is indeed a reactive controller (which is the

useful term I’ve found used) and RL is more of an adaptive
function (it learns how to adapt to the environment,

          BH: but isn't built on learning the reactive behaviour to

simply reduce error for example). Obviously, there are
downsides to both. RL requires a lot of training time to
learn adaptive behaviour, and PCT isn’t able to do
anticipatory control without extra perceptions and
probably a variety of levels to the hierarchy.

          BH: For example, anticipating an impact while balancing.

Before it hits, one can lean forward to improve your
balance post-impact. This is an example of something that
is not innate to reactive controllers and is thus quite a
difficult problem. While solvable under PCT, it definitely
isn’t something easy to derive and requires complex
perceptions.

          BH: So to speak, yes. The behaviour in RL is encoded in

the action-state pairs and their weightings which take
ages to compute. In PCT, the behaviour is in minimising
the error between the perceptual inputs and the
references. The complexity there comes in at what the
perceptions are, which the system itself doesn’t handle.

          BH: So, what PCT actually does is more simple and elegant.

However, what perceptions you choose for a system
massively affect the behaviour!

          BH: RL just crunches every possible interaction of action

and state with big data, but for PCT, a lot of this has
been pruned out by simply what perceptions are selected.
PCT will not find behaviour that the engineer doesn’t
effectively allow it to have

          BH: since the complexity can only be as much as the

perceptions it’s given. A system could not consider or
control a perception which it doesn’t have as an input.

          BH: While RL can, it will take a long time and a lot of

good data to crunch this… as well as stopping the
learning at the right time, as often it doesn’t know when
to stop.

          BH: So, to avoid the problem of either the engineer

having to hand select perceptions and order them to
produce hierarchies or to leave big data to crunch it, it
would be great if there were some way to have a control
system learn how to hierarchically arrange control nodes
such that it can learn how to control perceptions. (Hey,
that’s my PhD!)

B_Hawker · October 2, 2017, 2:46pm

[Ben Hawker 2017.10.02 15:38]

···

BH: In short, I agree with you… Reinforcement learning is not the appropriate approach to control. Reinforcement learning just uses large amounts of data to generate the right perceptions, but the actual “learning” side of it is temporally out of touch. Basically, it’s just complicated maths.

BH: I am certain reinforcement learning has solved the compensatory tracking task, unless I misunderstand the task. The literature is huge… and being honest, it’s such a basic problem compared to what is being done now that it is quite difficult to find anything on it. I’m not in the lab so can’t access many papers off my home WiFi (blasted paywalls!) but something I did see recently was a reinforcement learning driven deep learning approach playing Atari: https://www.nature.com/nature/journal/v518/n7540/abs/nature14236.html

BH: This not only has to track the objects, which move in an albeit predictable but complicated way, but then have the time-delayed shot it fires hit the specific space invader it wishes to kill. I can try find a link to it actually doing compensatory reference tracking, but that will have to be when I have the time I’m afraid! I’m struggling to have the time to answer all the emails in this thread…

RM: You may be right. But it wasn’t apparent to me that that was the case. It looked to me like reinforcement learning (which I presume is what Q-learning is) was used to select actions. Producing particular actions is not the same as producing controlled results – results of action that are brought to and maintained in pre-selected states in the face of unpredictable (and usually undetectable) variations in disturbances.

RM: I would be convinced that reinforcement learning could be used to learn to control if I could see a reinforcement learning algorithm used to learn to do a simple control task: the compensatory tracking task: www.mindreadings.com/ControlDemo/BasicTrack.html. We know that the PCT reorganization algorithm can learn to control (see Demo 8 in the list of demos that Bruce Abbott just posted: https://sites.google.com/site/perceptualcontroldemos/home/available-demo-files).

RM: The reason I think reinforcement learning can’t be used to learn control is because the idea of reinforcement learning is based on a misconception about the nature of behavior. The concept of reinforcement is based on the idea that behavior is emitted output; what are called actions, like pressing bars or pecking keys. Reinforcements are consequences of these actions that appear to increase their probability. PCT shows that behavior is a process of control where actions are the means of keeping variables in organism-defined reference states, protected from disturbances. So what have been seen as reinforcements turn out to be controlled variables. These “reinforcements” aren’t strengthening actions; they are being maintained in reference states by those actions (see Yin, H. (2013) restoring purpose to behavior, In Baldassarre and Mirolli (eds.), Computational and Robotic Models of the Hierarchical Organization of Behavior, ,Springer-Verlag Berlin Heidelberg; particularly Figure 5).

RM: Anyway, if you can produce a demonstration of reinforcement learning resulting in learning a control task I will withdraw my criticisms (and be all astonishment, as the “bad” characters in Jane Austen novels always say;-)

_Erling_Jorgensen · October 2, 2017, 4:23pm

[From Erling Jorgensen (2017.10.02 1212 EDT)]

Bruce Abbott (2017.10.01.1645 EDT)

Bruce Abbott (2017.10.01.1050 EDT)

BA: I may already have a demo that â€œsolvesâ€? this problem, although it is based on the inverted pendulum rather than the mountain car example. Starting with the pendulum hanging straight down, it brings the pendulum up to the inverted position. The solution involves a very minor change to the original inverted pendulum demo (as featured in LCS III).

BA: Iâ€™d be interested in hearing from anyone who has run this magical demo. Without looking at the source code, can you guess how self-inversion was achieved? The solution was so simple I was surprised when it actually worked!

Hi Bruce,

I have not yet looked at the demo or the source code, just being familiar with Bill P’s inverted pendulum demo. But based on your description that it begins with the pendulum hanging straight down, I would guess that the cart/pivot point would rapidly move in one lateral direction the X length of the pendulum, to begin to bring it up an arc of potential energy, then rapidly move 2X laterally in the opposite direction. Once the bob cleared the horizontal baseline, the cart would need to move in the original direction, to try to get under the now inverted pendulum.

Now I’ll go check it out! All the best.

Erling

···

Disclaimer: This message is intended only for the use of the individual or entity to which it is addressed, and may contain information that is privileged, confidential and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, or the employer or agent responsible for delivering the message to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please notify the sender immediately by telephone and delete the material from your computer. Thank you for your cooperation.

Abbott · October 2, 2017, 5:39pm

[From Bruce Abbott (2017.10.02.1345 EDT)]

Erling Jorgensen (2017.10.02 1212 EDT) –

Bruce Abbott (2017.10.01.1645 EDT)

Bruce Abbott (2017.10.01.1050 EDT)

BA: I may already have a demo that â€œsolvesâ€? this problem, although it is based on the inverted pendulum rather than the mountain car example. Starting with the pendulum hanging straight down, it brings the pendulum up to the inverted position. The solution involves a very minor change to the original inverted pendulum demo (as featured in LCS III).

BA: Iâ€™d be interested in hearing from anyone who has run this magical demo. Without looking at the source code, can you guess how self-inversion was achieved? The solution was so simple I was surprised when it actually worked!

EJ: Hi Bruce,

EJ: I have not yet looked at the demo or the source code, just being familiar with Bill P’s inverted pendulum demo. But based on your description that it begins with the pendulum hanging straight down, I would guess that the cart/pivot point would rapidly move in one lateral direction the X length of the pendulum, to begin to bring it up an arc of potential energy, then rapidly move 2X laterally in the opposite direction. Once the bob cleared the horizontal baseline, the cart would need to move in the original direction, to try to get under the now inverted pendulum.

BA: Yes, that describes what the system does; the question is, how does it do it?

Bruce

···

Disclaimer: This message is intended only for the use of the individual or entity to which it is addressed, and may contain information that is privileged, confidential and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, or the employer or agent responsible for delivering the message to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please notify the sender immediately by telephone and delete the material from your computer. Thank you for your cooperation.

Heather_Broccard-Bel · October 2, 2017, 10:43pm

Here is where I see RL being most different from PCT, and Ben summed it up nicely:

BH: The biggest underlying conceptual difference that I can see is that RL assumes that behaviour is state or action driven, whereas PCT and other closed loop control theories put behaviour as an emergent property of the control system’s response to a problem over time. Trying to put action or state spaces into PCT anywhere will be problematic.

HB: In other words, RL, although it is closed-loop, is not a truly a truly dynamic, real-time control system. Actions are selected from a finite set of possible states (which, as mentioned, is cognitively expensive – and for any decently high-level goal, prohibitively so in my opinion). Actions are then completed, and only upon completion is the value function updated. The model suffers from exactly the same limitations as Dickinson and Balleine’s Associative Cybernetic Model (also an action-selection model), and in my opinion, that is the necessity of having separate, unique things called “actions”, the instructions for which presumably need to be stored somewhere.

···

On Thu, Sep 28, 2017 at 6:50 PM, B Hawker bhawker1@sheffield.ac.uk wrote:

Hi all,

Thought I’d pitch in as someone working on reorganisation of hierarchies and also have studied reinforcement learning.

Well done on your find there, Rupert! Digney’s work is hard to track down. His work past about 2001 is near impossible to find due to him working in the military…

Some observations, I’m afraid not in chronological order…

EJ: This is one of those extra intermediary concepts that takes on a driving force in RL. A more direct link that fills this role in PCT is the one listed above: reduction of error in attaining a goal. I don’t see that there is a need for “intrinsic desirability” of various states. What makes something desirable is attaching a goal to it. When the goal changes, it is no longer desirable.

BH: The purpose of the state based approach is to allow it to solve problems that do not compliment simple error reduction of a single controller. For example, reaching an exit point of a room where one must bypass a wall first which involves going in the wrong direction. Simply trying to get as close to the exit as possible would fail, hence the weighting of states allows it to bypass this problem by learning that the gap(s) in the wall are key places to reach in the trajectory. Do I think this is a sensible approach? Not necessarily at all. A properly structured hierarchy of error reducing nodes should solve the problem. Furthermore, as you stated, when the goal changes, it’s no longer desirable… and reinforcement learning collapses here. A robust hierarchy of error reducing controllers does not suffer in the same way. However, engineers do not have the patience or skill in many cases to design appropriately robust hierarchies. Therefore, reinforcement learning.

EJ: I believe there is room for PCT to entertain the notion of a gain-adjusting system, which I connect with the emotion system. This would be driven by the rate of change in error, (does that make it a first derivative?) I am not sure how broadly error would be sampled for this. I don’t know whether the Amygdala is a possible site for such sampling. But it does seem that humans and seemingly other animals utilize broad hormonal effects to adjust the gain of their control systems. Because certain chemicals may lean the system in certain directions, this may be one place where the RL notion of “reward value” could map onto a PCT understanding of “error reduction.” However, I still don’t like the tendency of reward to become a “dormitive principle,” as Bruce Abbott (2017.09.14.2005 EDT) pointed out.

BH: I’ve been thinking about (and would definitely test if I had time in my PhD) into the notion of stress being a motivator in changing the “gains” of a control system (as the system gets “stressed”, it is more likely to change gains and also more rapidly). So, I’m also someone who’s been thinking about emotions being a gain changer. It’s still a very open topic with lots of room to explore, however. “Reward Value” mapping onto “error reduction” is very simple. To maximise reward is to minimise negative reward (or punishment). Many reinforcement learning experiments actually use negative reward, so it’s commonly used, and they aim to minimise that. So, in both cases, minimisation is occurring. It’s quite plausible that the “reward value” signal could be considered a perceptual reference to the system and then this removes a key difference. Obviously, they have huge differences in how they handle the understanding of the world (perceptual signals versus state based representation).

EJ: This doesn’t sound very “intrinsic” to me. The claim of Reinforcement Learning seemed to be that reward value is an intrinsic property of certain states. Whereas with PCT, the goal / the reference is the source of that desirability. Remove that goal, or choose a different sub-goal, and the supposed ‘intrinsic-ness’ evaporates. (Upon re-reading your statement, we may be saying essentially the same thing: Goals = Desirability, intrinsically, or we could say, by definition.)

The reward value of the states is not the source of the desirability in Reinforcement Learning, the reward function is. The reward function dictates the desirability of a state. Technically, this does mean the reward value of the state is the motivator (i.e. what is desirable) for the agent, but this isn’t in reflection to current reinforcement learning work. There are a variety of paradigms of reinforcement learning, many of which have much better results than state reward value driven approaches. In particular, some approaches aim to select the best policy which maximises reward, which dodges the whole explicit state action declaration altogether. To attempt to analogise to PCT, this would be like an individual PCT node learning how to change the gain to best keep the difference between input and reference as low as possible. Q-Learning, which is very popular, gives state-action pairs reward values rather than specific states. This is because, obviously, specific actions may have different values in different states but also the other way around.

Reinforcement learning still suffers from many of the problems from when it was initially proposed. The curse of dimensionality means that considering every action and state combination is expensive. Discretising the action and state space can be complicated or impossible. Digney’s attempt to make it hierarchical was fascinating, and it’s definitely the right way to think about things. One of the lead researchers at google deep mind was talking to me about how the “curriculum” of deep neural networks is their next hurdle to go over.

I find it amusing that the “bump it upstairs” argument is referenced here, as it genuinely is the answer. “Changing the goal” is nothing more than different high level nodes firing and referencing lower nodes instead of the higher level nodes that relate to the previous goal. A diverse hierarchy not only allows adaptive and varied behaviour, but robust approaches that can handle disruption. The problem is that in traditional approaches, deriving these hierarchies proved to be difficult and often didn’t have anything like the intended positives (see Subsumption Architecture). This is why approaches such as reinforcement learning and deep neural networks have gained traction, using high amounts of data to develop complex linear transforms that allow one to bypass the need for developmental learning of layers of behaviour.

To actually help address the original point, reinforcement learning could be used in a variety of ways. Would it be effective? Probably not. The biggest underlying conceptual difference that I can see is that RL assumes that behaviour is state or action driven, whereas PCT and other closed loop control theories put behaviour as an emergent property of the control system’s response to a problem over time. Trying to put action or state spaces into PCT anywhere will be problematic.

Could reinforcement learning be used to learn the value for gains? It could, but it would be poor results and overkill. Could reinforcement learning combined with Deep Learning act as a reorganiser for a PCT hierarchy? Yes, but I think as alluded to before, it should be clear where needs reorganising anyway if the rate of error is high. You don’t need RL or DNN (Deep Neural Networks) to do that. Could Reinforcement Learning combined with Deep Neural Networks act as a perception generator? That’s a much more plausible possibility, but I don’t know where you’d start. It could definitely learn to identify patterns in the world that relate to specific problems, and then a PCT hierarchy could incorporate it and minimise error. After all, if you have the right perceptions, HPCT should be more than sufficient. It’s where the perceptions come from that I think is the thing PCT doesn’t answer.

Hope this helps,

Ben

–

On 28 September 2017 at 13:33, Erling Jorgensen EJorgensen@riverbendcmhc.org wrote:

[Erling Jorgensen (2017.09.28 0739 EDT)]

Rupert Young (2017.09.28 10.30)

Hi Rupert,

Thanks for the reply. I was wondering about your reactions to what I wrote.

RY: Well, I think “intrinsic desirability” is related to PCT goals in that certain states represent sub-goals on the way to achieving a man [main?] goal, and some states will help attain that better than others. However, this may only be valid in well-defined, constrained environments such as games where you are dealing with discrete states rather than dynamic environments where the reference values continually change.

EJ: This doesn’t sound very “intrinsic” to me. The claim of Reinforcement Learning seemed to be that reward value is an intrinsic property of certain states. Whereas with PCT, the goal / the reference is the source of that desirability. Remove that goal, or choose a different sub-goal, and the supposed ‘intrinsic-ness’ evaporates. (Upon re-reading your statement, we may be saying essentially the same thing: Goals = Desirability, intrinsically, or we could say, by definition.)

EJ: You had asked about the overlap or differences of a mapping of Reinforcement Learning onto Perceptual Control Theory. At present in PCT we do not have a good switching mechanism to see the change of goals in action, other than ‘Bump it Upstairs’ with postulating a higher level reference standard. We also view a Program level perception as a network of contingencies that, once a higher level goal has been determined, can navigate among descending reference specifications, depending on the perceptual results achieved so far (i.e., those are the contingency nodes.)

EJ: It is possible that “the four main elements of a reinforcement learning system” that you outlined [in Rupert Young (2017.09.26 9.45)] could fill that gap. RL includes a mapping from goal-related environmental states, to reward values, to expected accumulation of reward values, to planning among action options. So there are inserted two potentially measurable functions – reward values and expected reward values – to guide selection of goals and sub-goals.

EJ: My proposal and preference is for a gain-adjusting system, (where a zero level of gain can suspend a goal). I believe that error-reduction in PCT already fills that intermediary role that RL assigns to rewards and the expected value of accumulating those rewards. But PCT still needs to have some mechanism to switch goals on and off. It is not yet very specified to say, in effect, “the HPCT hierarchy does it.” Moreover, we already see (perhaps as a mixture of theory and phenomenon) that goals are carried out with various levels of gain. So PCT has the problem of discerning a mechanism for adjusting gain anyway. I would like to see a more direct tackling of that issue, rather than importing concepts (and perhaps assumptions) from Reinforcement Learning.

EJ: That’s my take on it, at any rate. All the best.

Erling

Disclaimer: This message is intended only for the use of the individual or entity to which it is addressed, and may contain information that is privileged, confidential and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, or the employer or agent responsible for delivering the message to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please notify the sender immediately by telephone and delete the material from your computer. Thank you for your cooperation.

Heather C. Broccard-Bell, Ph.D.

Postdoctoral Fellow

University of California San Diego

Honey Bee Communication, Nieh Lab

Division of Biological Sciences

Section of Ecology, Behavior, and Evolution

Lecturer

University of San Diego; University of California San Diego

Departments of Psychological Sciences (Psychology); Departments of Biology

619.757.4694

B_Hawker · October 2, 2017, 11:33pm

[Ben Hawker 2017.10.03 00:33]

···

BH: If one assumes the inverted pendulum problem is analogous to the compensatory tracking task, then see the link below. Otherwise, do explain the difference and I’ll try forage for something else!

RM: Anyway, if you can produce a demonstration of reinforcement learning resulting in learning a control task I will withdraw my criticisms (and be all astonishment, as the “bad” characters in Jane Austen novels always say;-)

http://ieeexplore.ieee.org/abstract/document/24809/?reload=true

MartinT · October 3, 2017, 3:04am

[Martin Taylor 2017.10.02.23.03]

[Ben Hawker 2017.10.03 00:33]

Do you have a pdf you could send to the group? The link allows

access only to the abstract.

Martin

···

BH: If one assumes the inverted pendulum problem is analogous
to the compensatory tracking task, then see the link below.
Otherwise, do explain the difference and I’ll try forage for
something else!

                  RM: Anyway, if you can produce a demonstration

of reinforcement learning resulting in learning a
control task I will withdraw my criticisms (and be
all astonishment, as the “bad” characters in Jane
Austen novels always say;-)

http://ieeexplore.ieee.org/abstract/document/24809/?reload=true

rsmarken · October 3, 2017, 5:40am

[From Rick Marken (2017.10.02.2240)]

···

On Mon, Oct 2, 2017 at 3:43 PM, Heather Bell heather.bell@uleth.ca wrote:

Hi Heather.Â

HB: Â In other words, RL, although it is closed-loop, is not a truly a truly dynamic, real-time control system.Â Actions are selected from a finite set of possible states (which, as mentioned, is cognitively expensive – and for any decently high-level goal, prohibitively so in my opinion).Â Actions are then completed, and only upon completion is the value function updated.Â The model suffers from exactly the same limitations as Dickinson and Balleine’s Associative Cybernetic Model (also an action-selection model), and in my opinion, that is the necessity of having separate, unique things called “actions”, the instructions for which presumably need to be stored somewhere.

RM: I think this is basically the right idea. Reinforcement learning is based on the mistaken idea that behavior is caused (or selected) output; PCT shows that behavior is actually controlled input. The appearance of consequences selecting (or strengthening) the behavior of a control system is an illusion. So reinforcement can’t possibly be the way control systems learn. In a control system, what is seen as reinforcement is just an aspect of an input variable that is being controlled – a controlled variable – such as grams of reinforcement/sec.Â Â

RM: Again, I recommendÂ Â Yin, H. (2013) Restoring purpose to behavior. in G. Baldassarre and M. Mirolli (eds.), Computational and Robotic Models of the Hierarchical Organization of Behavior. particularly the discussion of the data in Figure 5, to see that reinforcement doesn’t select action but, rather, is selected by action (controlled).Â My “Selection of consequences” demo (http://www.mindreadings.com/ControlDemo/Select.html) is relevant here as well. As is Powers lovely “Feedback model of behavior” paper, reprinted starting on p. 47 of LCS I.Â

RM: But I suppose this will be written off as Rick’s PCT and many people – even fans of PCT – will continue to chase the illusion that reinforcement learning is relevant to the development of controlling. C’est la vie.Â

BestÂ

Rick

–
Richard S. MarkenÂ

"Perfection is achieved not when you have nothing more to add, but when you
have nothing left to take away.â€?
Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â --Antoine de Saint-Exupery

On Thu, Sep 28, 2017 at 6:50 PM, B Hawker bhawker1@sheffield.ac.uk wrote:

Hi all,

Thought I’d pitch in as someone working on reorganisation of hierarchies and also have studied reinforcement learning.

Well done on your find there, Rupert! Digney’s work is hard to track down. His work past about 2001 is near impossible to find due to him working in the military…

Some observations, I’m afraid not in chronological order…

Â EJ:Â This is one of those extra intermediary concepts that takes on a driving force in RL.Â A more direct link that fills this role in PCT is the one listed above: reduction of error in attaining a goal.Â I don’t see that there is a need for “intrinsic desirability” of various states.Â What makes something desirable is attaching a goal to it.Â When the goal changes, it is no longer desirable.Â

BH: The purpose of the state based approach is to allow it to solve problems that do not compliment simple error reduction of a single controller. For example, reaching an exit point of a room where one must bypass a wall first which involves going in the wrong direction. Simply trying to get as close to the exit as possible would fail, hence the weighting of states allows it to bypass this problem by learning that the gap(s) in the wall are key places to reach in the trajectory. Do I think this is a sensible approach? Not necessarily at all. A properly structured hierarchy of error reducing nodes should solve the problem. Furthermore, as you stated, when the goal changes, it’s no longer desirable… and reinforcement learning collapses here. A robust hierarchy of error reducing controllers does not suffer in the same way. However, engineers do not have the patience or skill in many cases to design appropriately robust hierarchies. Therefore, reinforcement learning.

EJ:Â I believe there is room for PCT to entertain the notion of a gain-adjusting system, which I connect with the emotion system.Â This would be driven by the rate of change in error, (does that make it a first derivative?)Â I am not sure how broadly error would be sampled for this.Â I don’t know whether the Amygdala is a possibleÂ site for such sampling.Â But it does seem that humans and seemingly other animals utilize broad hormonal effects to adjust the gain of their control systems.Â Because certain chemicals may lean the system in certain directions, this may be one place where the RL notion of “reward value” could map onto a PCT understanding of "error reduction."Â However, I still don’t like the tendency of reward to become a “dormitive principle,” as Bruce Abbott (2017.09.14.2005 EDT) pointed out.Â

BH: I’ve been thinking about (and would definitely test if I had time in my PhD) into the notion of stress being a motivator in changing the “gains” of a control system (as the system gets “stressed”, it is more likely to change gains and also more rapidly). So, I’m also someone who’s been thinking about emotions being a gain changer. It’s still a very open topic with lots of room to explore, however. “Reward Value” mapping onto “error reduction” is very simple. To maximise reward is to minimise negative reward (or punishment). Many reinforcement learning experiments actually use negative reward, so it’s commonly used, and they aim to minimise that. So, in both cases, minimisation is occurring. It’s quite plausible that the “reward value” signal could be considered a perceptual reference to the system and then this removes a key difference. Obviously, they have huge differences in how they handle the understanding of the world (perceptual signals versus state based representation).

EJ:Â This doesn’t sound very “intrinsic” to me.Â The claim of Reinforcement LearningÂ seemed to be that reward value is an intrinsic property of certain states.Â Whereas with PCT, the goal /Â the reference is the source of that desirability.Â Remove that goal, or choose a different sub-goal, and the supposed ‘intrinsic-ness’ evaporates.Â (Upon re-reading your statement, we may be saying essentially the same thing:Â Goals = Desirability, intrinsically, or we could say, by definition.)Â

The reward value of the states is not the sourceÂ of the desirability in Reinforcement Learning, the reward function is. The reward function dictates the desirability of a state. Technically, this does mean the reward value of the state is the motivator (i.e. what is desirable) for the agent, but this isn’t in reflection to current reinforcement learning work. There are a variety of paradigms of reinforcement learning, many of which have much better results than state reward value driven approaches. In particular, some approaches aim to select the best policyÂ which maximises reward, which dodges the whole explicit state action declaration altogether. To attempt to analogise to PCT, this would be like an individual PCT node learning how to change the gain to best keep the difference between input and reference as low as possible. Q-Learning, which is very popular, gives state-action pairs reward values rather than specific states. This is because, obviously, specific actions may have different values in different states but also the other way around.

Reinforcement learning still suffers from many of the problems from when it was initially proposed. The curse of dimensionality means that considering every action and state combination is expensive. Discretising the action and state space can be complicated or impossible. Digney’s attempt to make it hierarchical was fascinating, and it’s definitely the right way to think about things. One of the lead researchers at google deep mind was talking to me about how the “curriculum” of deep neural networks is their next hurdle to go over.

I find it amusing that the “bump it upstairs” argument is referenced here, as it genuinely is the answer. “Changing the goal” is nothing more than different high level nodes firing and referencing lower nodes instead of the higher level nodes that relate to the previous goal. A diverse hierarchy not only allows adaptive and varied behaviour, but robust approaches that can handle disruption. The problem is that in traditional approaches, deriving these hierarchies proved to be difficult and often didn’t have anything like the intended positives (see Subsumption Architecture). This is why approaches such as reinforcement learning and deep neural networks have gained traction, using high amounts of data to develop complex linear transforms that allow one to bypass the need for developmental learning of layers of behaviour.

To actually help address the original point, reinforcement learning could be used in a variety of ways. Would it be effective? Probably not. The biggest underlying conceptual difference that I can see is that RL assumes that behaviour is state or action driven, whereas PCT and other closed loop control theories put behaviour as an emergent property of the control system’s response to a problem over time. Trying to put action or state spaces into PCT anywhere will be problematic.Â

Could reinforcement learning be used to learn the value for gains? It could, but it would be poor results and overkill. Could reinforcement learning combined with Deep Learning act as a reorganiser for a PCT hierarchy? Yes, but I think as alluded to before, it should be clear where needs reorganising anyway if the rate of error is high. You don’t need RL or DNN (Deep Neural Networks) to do that. Could Reinforcement Learning combined with Deep Neural Networks act as a perception generator? That’s a much more plausible possibility, but I don’t know where you’d start. It could definitely learn to identify patterns in the world that relate to specific problems, and then a PCT hierarchy could incorporate it and minimise error. After all, if you have the right perceptions, HPCT should be more than sufficient. It’s where the perceptions come from that I think is the thing PCT doesn’t answer.

Hope this helps,

Ben

–
Heather C. Broccard-Bell, Ph.D.

Postdoctoral Fellow

University of California San Diego

Honey Bee Communication, Nieh Lab

Division of Biological Sciences

Section of Ecology, Behavior, and Evolution

Lecturer

University of San Diego; University of California San Diego

Departments of Psychological Sciences (Psychology); Departments of Biology

619.757.4694

On 28 September 2017 at 13:33, Erling Jorgensen EJorgensen@riverbendcmhc.org wrote:

[Erling Jorgensen (2017.09.28 0739 EDT)]

Rupert Young (2017.09.28 10.30)Â

Hi Rupert,

Thanks for the reply.Â I was wondering about your reactions to what I wrote.Â

RY:Â Well, I think “intrinsic desirability” is related to PCT goals in that certain states represent sub-goals on the way to achieving a man [main?]Â goal, and some states will help attain that better than others. However, this may only be valid in well-defined, constrained environments such as games where you are dealing with discrete states rather than dynamic environments where the reference values continually change.

EJ:Â This doesn’t sound very “intrinsic” to me.Â The claim of Reinforcement LearningÂ seemed to be that reward value is an intrinsic property of certain states.Â Whereas with PCT, the goal /Â the reference is the source of that desirability.Â Remove that goal, or choose a different sub-goal, and the supposed ‘intrinsic-ness’ evaporates.Â (Upon re-reading your statement, we may be saying essentially the same thing:Â Goals = Desirability, intrinsically, or we could say, by definition.)Â

EJ:Â You had asked about the overlap or differences of a mapping of Reinforcement Learning onto Perceptual Control Theory.Â At present in PCT we do not have a good switching mechanism to see the change of goals in action, other than ‘Bump it Upstairs’ with postulating a higher level reference standard.Â We also view a Program level perception as a network of contingencies that, once a higher level goalÂ has beenÂ determined, can navigate among descending reference specifications, depending on the perceptual results achieved so far (i.e., those are the contingency nodes.)Â

EJ:Â It is possible that “the four main elements of a reinforcement learning system” that you outlined [in Rupert Young (2017.09.26 9.45)]Â could fill that gap.Â RL includes a mapping from goal-related environmental states, to reward values, to expected accumulation of reward values, to planning among action options.Â So there are inserted two potentially measurable functions – reward values and expected reward values – to guide selection of goals and sub-goals.Â

EJ:Â My proposal and preference is for a gain-adjusting system, (where a zero level of gain can suspend a goal).Â I believe that error-reduction in PCT already fills that intermediary role that RL assigns to rewards and the expected value of accumulating those rewards.Â But PCT still needs toÂ have some mechanism to switch goals on and off.Â It is not yet very specified to say, in effect, "the HPCT hierarchy does it."Â Moreover, we already see (perhaps as a mixture of theory and phenomenon) that goals are carried out with various levels of gain.Â So PCT has the problem of discerning a mechanism for adjusting gain anyway.Â I would like to see a more direct tackling of that issue, rather than importing concepts (and perhaps assumptions) from Reinforcement Learning.Â

EJ:Â That’s my take on it, at any rate.Â All the best.Â

Erling

Disclaimer: This message is intended only for the use of the individual or entity to which it is addressed, and may contain information that is privileged, confidential and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, or the employer or agent responsible for delivering the message to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please notify the sender immediately by telephone and delete the material from your computer. Thank you for your cooperation.

Â

rsmarken · October 3, 2017, 5:48am

[From Rick Marken (2017.10.02.2250)]

···

Ben Hawker (2017.10.03 00:33)Â

RM: The inverted pendulum is, indeed, a control problem. And if they really have a reinforcement learning algorithm that develops the system that controls the pendulum, then I will have to accept the fact that reinforcement can be used to develop control systems and I’ll have to figure out why I (and Bill Powers) had it wrong. But I can’t do it without seeing the whole paper so I agree with Martin that we need a copy of the paper to see what’s going on.Â

RM: But I still would prefer to see reinforcement learning develop a control system that does the compensatory tracking task. It’s a lot simpler (and, therefore, easier to understand) than the inverted pendulum. And if reinforcement learning can build a control system that can balance an inverted pendulum, developing a system that can control a cursor in a compensatory tracking task should be a piece of cake.Â

Best regards

Rick

–

BH: If one assumes the inverted pendulum problem is analogous to the compensatory tracking task, then see the link below. Otherwise, do explain the difference and I’ll try forage for something else!

RM: Anyway, if you can produce a demonstration of reinforcement learning resulting in learning a control task I will withdraw my criticisms (and be all astonishment, as the “bad” characters in Jane Austen novels always say;-)

http://ieeexplore.ieee.org/abstract/document/24809/?reload=true

Richard S. MarkenÂ

"Perfection is achieved not when you have nothing more to add, but when you
have nothing left to take away.â?
Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â --Antoine de Saint-Exupery

B_Hawker · October 3, 2017, 9:45am

[Ben Hawker 2017.10.03 10:44]

RLNeuralNetPendulum.pdf (776 KB)

···

Do you have a pdf you could send to the group? The link allows

access only to the abstract.

BH: Sorry about that! Find attached.

wmansell · October 3, 2017, 12:50pm

Just to reiterate what I said earlier, I think that the primitives of our perceptual references are innate. So RL tries to learn functions that are surely evolved and not learned? Therefore we need a theory of how innate PCT systems are evolved and we need a refinement of how reorganisation forms novel perceptual functions from innate ones?

Warren

···

Do you have a pdf you could send to the group? The link allows

access only to the abstract.

BH: Sorry about that! Find attached.

MartinT · October 3, 2017, 3:13pm

[Martin Taylor.2017.10.03.10.39]

If I understand this paper correctly, the problem posed is one I

have not seen addressed by the PCT community. The cart and pendulum are the usual, but the controller cannot see
the pendulum at all. The only perceptual connection between the
controller and the pendulum is the bang when the pendulum falls (I
guess there’s some kind of a wall affixed to the cart into which the
falling pendulum bangs, because “failure” is defined as an angle
greater than 12° from vertical). The outputs available are also
limited to applying a fixed force to the cart either leftward or
rightward. An unmoving cart is not allowed, so if the pendulum
starts off vertical, the cart movement disturbs it. The controller
can sense time elapsed, such as between the last move left or right
and the moment of failure.
Even if the algorithm can lead to a sequence of actions that can
keep an originally vertical inverted pendulum from falling, I cannot
see how it could possibly act to counter a disturbance from a source
such as a gusty wind if it cannot perceive anything of the pendulum
other than that it has or has not fallen. On those grounds, I would
treat it as a system that can learn a maze in space-time, but not a
control system that could function in a dynamic universe. Since, if I understand correctly, the system described in the paper
does not control, I do not think it solves the problem posed by Ben,
of developing appropriate perceptual control functions for doing
what Bruce and Bill did for LCS III or that Bruce has now done in
extending that demo.
Ben’s question can be asked in two distinct ways. (1) In a Universe that contains only the inverted pendulum problem,
how would perceptual functions be discovered, the control of which
would solve the problem?
(2) In a Universe in which real organisms live, what perceptual
control systems are likely to have been created by normal
interactions with the environment that would be useful together in
solving the inverted pendulum problem?
I think an answer to the second form of the question would be of
more general use than an answer to the first, though I imagine that
an answer to the first might well be possible, since the structure
of the perceiving part of a control hierarchy is that of a neural
network and each layer in the network is potentially available as a
set of controllable perceptions using either the ability to push the
cart in either direction or the ability to provide a reference value
for a lower-level controlled perception.
Martin

···

On 2017/10/2 7:33 PM, B Hawker wrote:

[Ben Hawker 2017.10.03 00:33]

                  RM: Anyway, if you can produce a demonstration
of reinforcement learning resulting in learning a
control task I will withdraw my criticisms (and be
all astonishment, as the “bad” characters in Jane
Austen novels always say;-)

      BH: If one assumes the inverted pendulum problem is analogous
to the compensatory tracking task, then see the link below.
Otherwise, do explain the difference and I’ll try forage for
something else!

http://ieeexplore.ieee.org/abstract/document/24809/?reload=true

B_Hawker · October 3, 2017, 4:05pm

[Ben Hawker 2017.10.03 16:43]

···

If I understand this paper correctly, the problem posed is one I
have not seen addressed by the PCT community.
The cart and pendulum are the usual, but the controller cannot see
the pendulum at all. The only perceptual connection between the
controller and the pendulum is the bang when the pendulum falls (I
guess there’s some kind of a wall affixed to the cart into which the
falling pendulum bangs, because “failure” is defined as an angle
greater than 12° from vertical). The outputs available are also
limited to applying a fixed force to the cart either leftward or
rightward. An unmoving cart is not allowed, so if the pendulum
starts off vertical, the cart movement disturbs it. The controller
can sense time elapsed, such as between the last move left or right
and the moment of failure.
Even if the algorithm can lead to a sequence of actions that can
keep an originally vertical inverted pendulum from falling, I cannot
see how it could possibly act to counter a disturbance from a source
such as a gusty wind if it cannot perceive anything of the pendulum
other than that it has or has not fallen. On those grounds, I would
treat it as a system that can learn a maze in space-time, but not a
control system that could function in a dynamic universe.
Since, if I understand correctly, the system described in the paper
does not control, I do not think it solves the problem posed by Ben,
of developing appropriate perceptual control functions for doing
what Bruce and Bill did for LCS III or that Bruce has now done in
extending that demo.
Ben's question can be asked in two distinct ways.



(1) In a Universe that contains only the inverted pendulum problem,
how would perceptual functions be discovered, the control of which
would solve the problem?
(2) In a Universe in which real organisms live, what perceptual
control systems are likely to have been created by normal
interactions with the environment that would be useful together in
solving the inverted pendulum problem?
I think an answer to the second form of the question would be of
more general use than an answer to the first, though I imagine that
an answer to the first might well be possible, since the structure
of the perceiving part of a control hierarchy is that of a neural
network and each layer in the network is potentially available as a
set of controllable perceptions using either the ability to push the
cart in either direction or the ability to provide a reference value
for a lower-level controlled perception.

BH: I do not believe you’re correct. Quote from the paper:

“A traditional control approach when the
dynamic equations of motion are known is
to assume that the control force F, is a linear
function of the four state variables…”

The four state variables are presumably angle, angular velocity, cart position and cart velocity which are the standard ones used in many solutions… including PCT. You can clearly see these in figure 2 going into the state space quantization section of the diagram. This means that the “state” is made of those four input quantities, and the action is some function which manipulates those to get an output force (for example, balancing four gains which attach to the four state values). Because it can read the angle and angular velocity, surely it can “see” the pendulum? Am I misunderstanding what you mean by “see” the pendulum?

As for the punishment signal, indeed it is a simple binary failure signal. I guess this is the equivalent of having a reference at the top of the hierarchy of “fallen or not fallen”. It has however been done in lots of ways, as they indicated in the paper. Ones that set a reference state of 0,0,0,0 for the four parameters were mentioned in that paper. Mathematically, that’s identical to PCT as long as the top reference is 0, so that one may be worth a look. However, there are lots of solutions, particularly ones far more updated but solve a lot more complicated versions of the problem, presumably. Google scholar searching for ““Inverted Pendulum” reinforcement learning” produces hundreds of results. I’m afraid I really don’t have time to provide a full literature survey on this, but highly recommend examining the one with four 0s as the desired state (i.e. the reference).

rsmarken · October 3, 2017, 5:49pm

[From Rick Marken (2017.10.03.1050)]

···

Ben Hawker (2017.10.03 10:44)Â

RM: Thanks for this. I haven’t had time to read this in detail but I it looks to me like this may be a reinforcement learning model in name only. At first glance it looks like the reorganization model of PCT.Â This was suggested to me when I happen by chance to notice this statement:Â

Do you have a pdf you could send to the group? The link allows

access only to the abstract.

BH: Sorry about that! Find attached.Â

Initial values of the weights are zero, making the two actions equally probable. The action unit learns via a reinforcement learning method. It tries actions at random and makes incremental adjustments to its weights, and, thus, to its action probabilities, after receiving nonzero reinforcements.

RM: So parameters (weights) are being adjusted continuously based on the consequences of random variations in output (actions). The adjustment is based on whether there is a nonzero reinforcement, which sounds a lot like an error signal. The diagram shows that this error signal (called a Failure Signal) comes from the inverted pendulum system, which suggests that the effectiveness of controlling the pendulum with the current control parameters is being evaluated.Â

RM: If this “reinforcement learning” system is, indeed, functionally equivalent to the PCT reorganizing system then it is not a reinforcement system, in the sense that its actions are not selected by its consequences; it is actually a control system controlling the time between failure signals – the controlled variable.Â But I’ll look at it more carefully when I get a chance.

Best regards

Rick

–
Richard S. MarkenÂ

"Perfection is achieved not when you have nothing more to add, but when you
have nothing left to take away.â€?
Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â --Antoine de Saint-Exupery

wmansell · October 3, 2017, 7:24pm

This is a great thread.

MT: “In a Universe in which real organisms live, what perceptual control systems are likely to have been created by normal interactions with the environment that would be useful together in solving the inverted pendulum problem?”

This is what I meant in part by existing perceptual functions (and control system architecture) being prior (pre-evolved) to the learning of a new problem for a real living organism…

···

Ben Hawker (2017.10.03 10:44)

RM: Thanks for this. I haven’t had time to read this in detail but I it looks to me like this may be a reinforcement learning model in name only. At first glance it looks like the reorganization model of PCT. This was suggested to me when I happen by chance to notice this statement:

Do you have a pdf you could send to the group? The link allows

access only to the abstract.

BH: Sorry about that! Find attached.

Initial values of the weights are zero, making the two actions equally probable. The action unit learns via a reinforcement learning method. It tries actions at random and makes incremental adjustments to its weights, and, thus, to its action probabilities, after receiving nonzero reinforcements.

RM: So parameters (weights) are being adjusted continuously based on the consequences of random variations in output (actions). The adjustment is based on whether there is a nonzero reinforcement, which sounds a lot like an error signal. The diagram shows that this error signal (called a Failure Signal) comes from the inverted pendulum system, which suggests that the effectiveness of controlling the pendulum with the current control parameters is being evaluated.

RM: If this “reinforcement learning” system is, indeed, functionally equivalent to the PCT reorganizing system then it is not a reinforcement system, in the sense that its actions are not selected by its consequences; it is actually a control system controlling the time between failure signals – the controlled variable. But I’ll look at it more carefully when I get a chance.

Best regards

Rick

–
Richard S. Marken

"Perfection is achieved not when you have nothing more to add, but when you
have nothing left to take away.â€?
–Antoine de Saint-Exupery

rupert · October 4, 2017, 8:10am

[From Rupert Young (2017.10.04 9.10)]

(Ben Hawker 2017.10.02 15:00]

I'd say that, even without learning, a PCT system is inherently

adaptive, in that it adapts to what the world throws at it, in terms
of unknown disturbances. This is because it is more than just being
reactive, it is a control system, controlling a variable. S-R
systems, on the other hand, such as Braitenberg’s vehicles are
reactive, but are not control systems.

Rupert

···

                            MT: PCT is also an adaptive system in that same
sense. Organisms growing up in different environments
develop different perceptual functions and connect them
to different outputs to produce controllable perceptions
that help maintain the intrinsic variables near their
optima. a city dweller doesn’t easily perceive deer
tracks in the bush or know how to get lunch when he can
perceive the tracks, a hunter from the bush doesn’t
easily perceive when it is safe to cross the street at a
city intersection, or where to get lunch when he can
cross the street safely.

          BH:  Yes, in theory. But, are there working algorithms

for deriving new perceptions and placing them in
hierarchies of control? If so, let me know, as that’s my
PhD! As I’m aware, that hasn’t been done yet. With the
ability to identify new relevant perceptions and place
them in the hierarchy, it can be considered an adaptive
system. Without that, surely it is just reactive?

                        BH: PCT is indeed a reactive controller

(which is the useful term I’ve found used)
and RL is more of an adaptive function (it
learns how to adapt to the environment,

Abbott · October 4, 2017, 1:35pm

[From Bruce Abbott (2017.10.03.0935 EDT)]

Reinforcement learning algorithms, such as actor-critic, that were developed for real-time engineering applications appear to require that rather complex mathematical computations be carried out at high speed. Such algorithms are unlikely to be employed by the neuronal wetware of biological organisms, a point the Bill Powers often drove home. Organisms may employ relatively simple methods that, while not perfect, accomplish much the same end while usually not overtaxing the neural machinery. The reorganization process appears to meet this process, although as I have noted, as currently envisioned it lacks a mechanism for limiting reorganization to those systems that need to be reorganized and to the “right” parts of those systems (input function, output function, system gain, etc.).

A biologically feasible reinforcement learning mechanism certainly will be more complex than ecoli-style reorganization. At a minimum it requires some means to associate the organism’s ongoing behavior with the so-called reinforcing consequence. The biological RL system must isolate which of its activities, if any, resulted in the occurrence of the reinforcing event, a problem known as the “assignment of credit” problem. Given that the effective act and the reinforcing event may be separated in time, such a system needs the ability to “look back” at least a few seconds – effectively to perform the equivalent of autocorrelation. This requires at least a short-term memory for recent experience; typically this involves a delay discounting function.

I am not familiar with current machine reinforcement learning efforts, so perhaps they converge on a solution more rapidly than the method described in Anderson’s 1989 inverted pendulum paper that Ben Hawker kindly provided. That method required thousands of trials to approach asymptotic performance. Biological systems clearly operate much more efficiently in finding the right behavior, at least when delays and other possible sources of confusion are not too significant.

Bruce

B_Hawker · October 5, 2017, 9:53am

[From Ben Hawker 2017.10.05 10:49]

···

Â

BA: Reinforcement learning algorithms, such as actor-critic, that were developed for real-time engineering applications appear to require that rather complex mathematical computations be carried out at high speed.Â Such algorithms are unlikely to be employed by the neuronal wetware of biological organisms, a point the Bill Powers often drove home.Â Organisms may employ relatively simple methods that, while not perfect, accomplish much the same end while usually not overtaxing the neural machinery.Â The reorganization process appears to meet this process, although as I have noted, as currently envisioned it lacks a mechanism for limiting reorganization to those systems that need to be reorganized and to the â€œrightâ€? parts of those systems (input function, output function, system gain, etc.).Â

Â BH: Agreed, yes. They produce a model of the behaviour of various parts of the brain (The Basal Ganglia) but this doesn’t mean they actually produce any resembling behaviour at a lower level, or during learning. So, I have my doubts about reinforcement learning as a function of how the brain works.

Â

BA: A biologically feasible reinforcement learning mechanism certainly will be more complex than ecoli-style reorganization.Â At a minimum it requires some means to associate the organismâ€™s ongoing behavior with the so-called reinforcing consequence.Â The biological RL system must isolate which of its activities, if any, resulted in the occurrence of the reinforcing event, a problem known as the â€œassignment of creditâ€? problem.Â Given that the effective act and the reinforcing event may be separated in time, such a system needs the ability to â€œlook backâ€? at least a few seconds – effectively to perform the equivalent of autocorrelation.ÂÂ This requires at least a short-term memory for recent experience; typically this involves a delay discounting function.

Â

BA: I am not familiar with current machine reinforcement learning efforts, so perhaps they converge on a solution more rapidly than the method described in Andersonâ€™s 1989 inverted pendulum paper that Ben Hawker kindly provided.Â That method required thousands of trials to approach asymptotic performance.Â Biological systems clearly operate much more efficiently in finding the right behavior, at least when delays and other possible sources of confusion are not too significant.

BH: There are some that can learn in as little as 12 trials, but often the maths is obscenely complex and convoluted. Furthermore, further criticisms with reinforcement learning and neural networks are how the learning must be manually stopped when it’s optimised to avoid “overtraining”. Not very realistic or robust…Â

Â

Bruce

Â

rupert · October 5, 2017, 8:55pm

[From Rupert Young (2017.10.05 22.00)]

One reason I am interested in this is that I think artificial PCT

systems will need embody learning to progress beyond fairly basic
systems. This requires a formal definition of PCT learning, which,
unfortunately is currently lacking. I think it will also be
important for PCT to be taken seriously by the mainstream.
Although, we it may be correct that learning methods, such as RL,
have substantial drawbacks, they do have a formal definition. They
also appear to be quite successful, in some domains, of constrained
states and discrete actions, such as games. It would be useful to understand why they are successful. It would
also be great if we could formulate a definition of PCT learning not
only for continuous variable control systems, but also for the above
game-type scenarios.
So, any thoughts on how PCT could be applied to learning, or even
just playing, games would be of great interest.
Rupert

···

On 02/10/2017 23:43, Heather Bell
wrote:

      Here is where I see RL being most different from PCT, and

Ben summed it up nicely:

BH: The biggest underlying
conceptual difference that I can see is that RL assumes that
behaviour is state or action driven, whereas PCT and other
closed loop control theories put behaviour as an emergent
property of the control system’s response to a problem over
time. Trying to put action or state spaces into PCT anywhere
will be problematic.

        HB:  In
other words, RL, although it is closed-loop, is not a truly
a truly dynamic, real-time control system. Actions are
selected from a finite set of possible states (which, as
mentioned, is cognitively expensive – and for any decently
high-level goal, prohibitively so in my opinion). Actions
are then completed, and only upon completion is the value
function updated. The model suffers from exactly the same
limitations as Dickinson and Balleine’s Associative
Cybernetic Model (also an action-selection model), and in my
opinion, that is the necessity of having separate, unique
things called “actions”, the instructions for which
presumably need to be stored somewhere.

      On Thu, Sep 28, 2017 at 6:50 PM, B
Hawker bhawker1@sheffield.ac.uk
wrote:

Hi all,

            Thought I'd pitch in as someone working on
reorganisation of hierarchies and also have studied
reinforcement learning.

            Well done on your find there, Rupert! Digney's work
is hard to track down. His work past about 2001 is near
impossible to find due to him working in the military…

            Some observations, I'm afraid not in chronological

order…

                EJ:  This is one of
those extra intermediary concepts that takes on a
driving force in RL. A more direct link that fills
this role in PCT is the one listed above: reduction
of error in attaining a goal. I don’t see that
there is a need for “intrinsic desirability” of
various states. What makes something desirable is
attaching a goal to it. When the goal changes, it
is no longer desirable.

            BH: The purpose of the state based approach is to
allow it to solve problems that do not compliment simple
error reduction of a single controller. For example,
reaching an exit point of a room where one must bypass a
wall first which involves going in the wrong direction.
Simply trying to get as close to the exit as possible
would fail, hence the weighting of states allows it to
bypass this problem by learning that the gap(s) in the
wall are key places to reach in the trajectory. Do I
think this is a sensible approach? Not necessarily at
all. A properly structured hierarchy of error reducing
nodes should solve the problem. Furthermore, as you
stated, when the goal changes, it’s no longer
desirable… and reinforcement learning collapses here.
A robust hierarchy of error reducing controllers does
not suffer in the same way. However, engineers do not
have the patience or skill in many cases to design
appropriately robust hierarchies. Therefore,
reinforcement learning.

                EJ:  I believe there
is room for PCT to entertain the notion of a
gain-adjusting system, which I connect with the
emotion system. This would be driven by the rate of
change in error, (does that make it a first
derivative?) I am not sure how broadly error would
be sampled for this. I don’t know whether the
Amygdala is a possible site for such sampling. But
it does seem that humans and seemingly other animals
utilize broad hormonal effects to adjust the gain of
their control systems. Because certain chemicals
may lean the system in certain directions, this may
be one place where the RL notion of “reward value”
could map onto a PCT understanding of “error
reduction.” However, I still don’t like the
tendency of reward to become a “dormitive
principle,” as Bruce Abbott (2017.09.14.2005 EDT)
pointed out.

              BH: I've been thinking
about (and would definitely test if I had time in my
PhD) into the notion of stress being a motivator in
changing the “gains” of a control system (as the
system gets “stressed”, it is more likely to change
gains and also more rapidly). So, I’m also someone
who’s been thinking about emotions being a gain
changer. It’s still a very open topic with lots of
room to explore, however. “Reward Value” mapping onto
“error reduction” is very simple. To maximise reward
is to minimise negative reward (or punishment). Many
reinforcement learning experiments actually use
negative reward, so it’s commonly used, and they aim
to minimise that. So, in both cases, minimisation is
occurring. It’s quite plausible that the “reward
value” signal could be considered a perceptual
reference to the system and then this removes a key
difference. Obviously, they have huge differences in
how they handle the understanding of the world
(perceptual signals versus state based
representation).

                > EJ:  This doesn't
sound very “intrinsic” to me. The claim of
Reinforcement Learning seemed to be that reward
value is an intrinsic property of certain states.
Whereas with PCT, the goal / the reference is the
source of that desirability. Remove that goal, or
choose a different sub-goal, and the supposed
‘intrinsic-ness’ evaporates. (Upon re-reading your
statement, we may be saying essentially the same
thing: Goals = Desirability, intrinsically, or we
could say, by definition.)

              The reward value of the
states is not the source of the desirability
in Reinforcement Learning, the reward function is. The
reward function dictates the desirability of a state.
Technically, this does mean the reward value of the
state is the motivator (i.e. what is desirable) for
the agent, but this isn’t in reflection to current
reinforcement learning work. There are a variety of
paradigms of reinforcement learning, many of which
have much better results than state reward value
driven approaches. In particular, some approaches aim
to select the best policy which maximises
reward, which dodges the whole explicit state action
declaration altogether. To attempt to analogise to
PCT, this would be like an individual PCT node
learning how to change the gain to best keep the
difference between input and reference as low as
possible. Q-Learning, which is very popular, gives
state-action pairs reward values rather than specific
states. This is because, obviously, specific actions
may have different values in different states but also
the other way around.

              Reinforcement learning
still suffers from many of the problems from when it
was initially proposed. The curse of dimensionality
means that considering every action and state
combination is expensive. Discretising the action and
state space can be complicated or impossible. Digney’s
attempt to make it hierarchical was fascinating, and
it’s definitely the right way to think about things.
One of the lead researchers at google deep mind was
talking to me about how the “curriculum” of deep
neural networks is their next hurdle to go over.

              I find it amusing that
the “bump it upstairs” argument is referenced here, as
it genuinely is the answer. “Changing the goal” is
nothing more than different high level nodes firing
and referencing lower nodes instead of the higher
level nodes that relate to the previous goal. A
diverse hierarchy not only allows adaptive and varied
behaviour, but robust approaches that can handle
disruption. The problem is that in traditional
approaches, deriving these hierarchies proved to be
difficult and often didn’t have anything like the
intended positives (see Subsumption Architecture).
This is why approaches such as reinforcement learning
and deep neural networks have gained traction, using
high amounts of data to develop complex linear
transforms that allow one to bypass the need for
developmental learning of layers of behaviour.

              To actually help address
the original point, reinforcement learning could be
used in a variety of ways. Would it be effective?
Probably not. The biggest underlying conceptual
difference that I can see is that RL assumes that
behaviour is state or action driven, whereas PCT and
other closed loop control theories put behaviour as an
emergent property of the control system’s response to
a problem over time. Trying to put action or state
spaces into PCT anywhere will be problematic.

              Could reinforcement
learning be used to learn the value for gains? It
could, but it would be poor results and overkill.
Could reinforcement learning combined with Deep
Learning act as a reorganiser for a PCT hierarchy?
Yes, but I think as alluded to before, it should be
clear where needs reorganising anyway if the rate of
error is high. You don’t need RL or DNN (Deep Neural
Networks) to do that. Could Reinforcement Learning
combined with Deep Neural Networks act as a perception
generator? That’s a much more plausible possibility,
but I don’t know where you’d start. It could
definitely learn to identify patterns in the world
that relate to specific problems, and then a PCT
hierarchy could incorporate it and minimise error.
After all, if you have the right perceptions, HPCT
should be more than sufficient. It’s where the
perceptions come from that I think is the thing PCT
doesn’t answer.

Hope this helps,

Ben

                On 28 September 2017 at
13:33, Erling Jorgensen EJorgensen@riverbendcmhc.org
wrote:

[Erling Jorgensen (2017.09.28 0739 EDT)]

Rupert Young (2017.09.28 10.30)

Hi Rupert,

                        Thanks for the reply.  I was wondering

about your reactions to what I wrote.

                        >RY:  Well, I think "intrinsic
desirability" is related to PCT goals in
that certain states represent sub-goals on
the way to achieving a man [main?] goal, and
some states will help attain that better
than others. However, this may only be valid
in well-defined, constrained environments
such as games where you are dealing with
discrete states rather than dynamic
environments where the reference values
continually change.

                        EJ:  This doesn't sound very "intrinsic"
to me. The claim of Reinforcement
Learning seemed to be that reward value is
an intrinsic property of certain states.
Whereas with PCT, the goal / the reference
is the source of that desirability. Remove
that goal, or choose a different sub-goal,
and the supposed ‘intrinsic-ness’
evaporates. (Upon re-reading your
statement, we may be saying essentially the
same thing: Goals = Desirability,
intrinsically, or we could say, by
definition.)

                        EJ:  You had asked about the overlap or
differences of a mapping of Reinforcement
Learning onto Perceptual Control Theory. At
present in PCT we do not have a good
switching mechanism to see the change of
goals in action, other than ‘Bump it
Upstairs’ with postulating a higher level
reference standard. We also view a Program
level perception as a network of
contingencies that, once a higher level
goal has been determined, can navigate among
descending reference specifications,
depending on the perceptual results achieved
so far (i.e., those are the contingency
nodes.)

                        EJ:  It is possible that "the four main
elements of a reinforcement learning system"
that you outlined [in Rupert Young
(2017.09.26 9.45)] could fill that gap. RL
includes a mapping from goal-related
environmental states, to reward values, to
expected accumulation of reward values, to
planning among action options. So there are
inserted two potentially measurable
functions – reward values and expected
reward values – to guide selection of goals
and sub-goals.

                        EJ:  My proposal and preference is for a
gain-adjusting system, (where a zero level
of gain can suspend a goal). I believe that
error-reduction in PCT already fills that
intermediary role that RL assigns to rewards
and the expected value of accumulating those
rewards. But PCT still needs to have some
mechanism to switch goals on and off. It is
not yet very specified to say, in effect,
“the HPCT hierarchy does it.” Moreover, we
already see (perhaps as a mixture of theory
and phenomenon) that goals are carried out
with various levels of gain. So PCT has the
problem of discerning a mechanism for
adjusting gain anyway. I would like to see
a more direct tackling of that issue, rather
than importing concepts (and perhaps
assumptions) from Reinforcement Learning.

                        EJ:  That's my take on it, at any rate.

All the best.

Erling

                           Disclaimer:
This message is intended only for the
use of the individual or entity to which
it is addressed, and may contain
information that is privileged,
confidential and exempt from disclosure
under applicable law. If the reader of
this message is not the intended
recipient, or the employer or agent
responsible for delivering the message
to the intended recipient, you are
hereby notified that any dissemination,
distribution or copying of this
communication is strictly prohibited. If
you have received this communication in
error, please notify the sender
immediately by telephone and delete the
material from your computer. Thank you
for your cooperation.*

–
Heather C.
Broccard-Bell, Ph.D.

                                        Postdoctoral

Fellow

                                        University

of California San Diego

                                        Honey

Bee Communication, Nieh Lab

                                        Division

of Biological Sciences

                                        Section

of Ecology, Behavior, and
Evolution

Lecturer

                                        University

of San Diego; University of
California San Diego

                                        Departments

of Psychological Sciences
(Psychology); Departments of
Biology

                                            619.757.4694

rsmarken · October 7, 2017, 5:01pm

[From Rick Marken (2017.10.07.1000)]

···

Rupert Young (2017.10.05 22.00)–

RY: One reason I am interested in this is that I think artificial PCT
systems will need embody learning to progress beyond fairly basic
systems. This requires a formal definition of PCT learning, which,
unfortunately is currently lacking. I think it will also be
important for PCT to be taken seriously by the mainstream.

RM: Reorganization is “learning” in PCT.Â

RY: Although, we it may be correct that learning methods, such as RL,
have substantial drawbacks, they do have a formal definition. They
also appear to be quite successful, in some domains, of constrained
states and discrete actions, such as games.

RM: But they are based on what PCT shows to be an incorrect understanding of how control works. And what is called RL in robotics is often not actually RL, as is the case in the pendulum learning paper (Learning to Control an Inverted Pendulum Using Neural Networks byÂ Charles W. Anderson).Â

RM: I agree that it’s important to include a capability to learn (in terms of improving performance – control – or learning how to control something that the system was unable to control before).Â But I believe that a PCT-based approach to robotics should start by ruling out RL as an approach to learning and explain why it does; it’s becauseÂ RL is based on an S-R concept of how behavior (control) works.Â

Â

RY: It would be useful to understand why they are successful. It would
also be great if we could formulate a definition of PCT learning not
only for continuous variable control systems, but also for the above
game-type scenarios.

RM: Reorganization should work for all “scenarios”, which I take to mean it should work for learning to control all types of variables, including the higher level variables (like programs and principles) that are controlled in games.

RY: So, any thoughts on how PCT could be applied to learning, or even
just playing, games would be of great interest.

RM: I think the higher level perceptions to be controlled in the game – perceptions like “control of the center”, “maintain material advantage”, “protect queen and king” in a chess game – would be “givens”. I agree with previous discussions that suggested that the types of perceptions we control have developed through evolution; our brains have evolved the ability to perceive the world in terms of principles like “control of the center”, for example. I think it’s highly unlikely that perceptual functions that can perceive these complex variables could be developed within the time frame in which we typically learn things – hours, days or even weeks. So robot learning would consist mainly of reorganizing the output functions that provide the references to the systems that control lower level perceptions that are the means of controlling these higher level perceptions.Â

BestÂ

Rick

Rupert






Â 



Â

–
Richard S. MarkenÂ

"Perfection is achieved not when you have nothing more to add, but when you
have nothing left to take away.â€?
Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â --Antoine de Saint-Exupery

      On Thu, Sep 28, 2017 at 6:50 PM, B

Hawker bhawker1@sheffield.ac.uk
wrote:

Hi all,

            Thought I'd pitch in as someone working on
reorganisation of hierarchies and also have studied
reinforcement learning.

            Well done on your find there, Rupert! Digney's work
is hard to track down. His work past about 2001 is near
impossible to find due to him working in the military…

            Some observations, I'm afraid not in chronological

order…

Â EJ:Â This is one of
those extra intermediary concepts that takes on a
driving force in RL.Â A more direct link that fills
this role in PCT is the one listed above: reduction
of error in attaining a goal.Â I don’t see that
there is a need for “intrinsic desirability” of
various states.Â What makes something desirable is
attaching a goal to it.Â When the goal changes, it
is no longer desirable.Â

            BH: The purpose of the state based approach is to
allow it to solve problems that do not compliment simple
error reduction of a single controller. For example,
reaching an exit point of a room where one must bypass a
wall first which involves going in the wrong direction.
Simply trying to get as close to the exit as possible
would fail, hence the weighting of states allows it to
bypass this problem by learning that the gap(s) in the
wall are key places to reach in the trajectory. Do I
think this is a sensible approach? Not necessarily at
all. A properly structured hierarchy of error reducing
nodes should solve the problem. Furthermore, as you
stated, when the goal changes, it’s no longer
desirable… and reinforcement learning collapses here.
A robust hierarchy of error reducing controllers does
not suffer in the same way. However, engineers do not
have the patience or skill in many cases to design
appropriately robust hierarchies. Therefore,
reinforcement learning.

                EJ:Â  I believe there
is room for PCT to entertain the notion of a
gain-adjusting system, which I connect with the
emotion system.Â This would be driven by the rate of
change in error, (does that make it a first
derivative?)Â I am not sure how broadly error would
be sampled for this.Â I don’t know whether the
Amygdala is a possibleÂ site for such sampling.Â But
it does seem that humans and seemingly other animals
utilize broad hormonal effects to adjust the gain of
their control systems.Â Because certain chemicals
may lean the system in certain directions, this may
be one place where the RL notion of “reward value”
could map onto a PCT understanding of "error
reduction."Â However, I still don’t like the
tendency of reward to become a “dormitive
principle,” as Bruce Abbott (2017.09.14.2005 EDT)
pointed out.Â

              BH: I've been thinking
about (and would definitely test if I had time in my
PhD) into the notion of stress being a motivator in
changing the “gains” of a control system (as the
system gets “stressed”, it is more likely to change
gains and also more rapidly). So, I’m also someone
who’s been thinking about emotions being a gain
changer. It’s still a very open topic with lots of
room to explore, however. “Reward Value” mapping onto
“error reduction” is very simple. To maximise reward
is to minimise negative reward (or punishment). Many
reinforcement learning experiments actually use
negative reward, so it’s commonly used, and they aim
to minimise that. So, in both cases, minimisation is
occurring. It’s quite plausible that the “reward
value” signal could be considered a perceptual
reference to the system and then this removes a key
difference. Obviously, they have huge differences in
how they handle the understanding of the world
(perceptual signals versus state based
representation).

                > EJ:Â  This doesn't
sound very “intrinsic” to me.Â The claim of
Reinforcement LearningÂ seemed to be that reward
value is an intrinsic property of certain states.Â
Whereas with PCT, the goal /Â the reference is the
source of that desirability.Â Remove that goal, or
choose a different sub-goal, and the supposed
‘intrinsic-ness’ evaporates.Â (Upon re-reading your
statement, we may be saying essentially the same
thing:Â Goals = Desirability, intrinsically, or we
could say, by definition.)Â

              The reward value of the
states is not the source Â of the desirability
in Reinforcement Learning, the reward function is. The
reward function dictates the desirability of a state.
Technically, this does mean the reward value of the
state is the motivator (i.e. what is desirable) for
the agent, but this isn’t in reflection to current
reinforcement learning work. There are a variety of
paradigms of reinforcement learning, many of which
have much better results than state reward value
driven approaches. In particular, some approaches aim
to select the best policy Â which maximises
reward, which dodges the whole explicit state action
declaration altogether. To attempt to analogise to
PCT, this would be like an individual PCT node
learning how to change the gain to best keep the
difference between input and reference as low as
possible. Q-Learning, which is very popular, gives
state-action pairs reward values rather than specific
states. This is because, obviously, specific actions
may have different values in different states but also
the other way around.

              Reinforcement learning
still suffers from many of the problems from when it
was initially proposed. The curse of dimensionality
means that considering every action and state
combination is expensive. Discretising the action and
state space can be complicated or impossible. Digney’s
attempt to make it hierarchical was fascinating, and
it’s definitely the right way to think about things.
One of the lead researchers at google deep mind was
talking to me about how the “curriculum” of deep
neural networks is their next hurdle to go over.

              I find it amusing that
the “bump it upstairs” argument is referenced here, as
it genuinely is the answer. “Changing the goal” is
nothing more than different high level nodes firing
and referencing lower nodes instead of the higher
level nodes that relate to the previous goal. A
diverse hierarchy not only allows adaptive and varied
behaviour, but robust approaches that can handle
disruption. The problem is that in traditional
approaches, deriving these hierarchies proved to be
difficult and often didn’t have anything like the
intended positives (see Subsumption Architecture).
This is why approaches such as reinforcement learning
and deep neural networks have gained traction, using
high amounts of data to develop complex linear
transforms that allow one to bypass the need for
developmental learning of layers of behaviour.

              To actually help address
the original point, reinforcement learning could be
used in a variety of ways. Would it be effective?
Probably not. The biggest underlying conceptual
difference that I can see is that RL assumes that
behaviour is state or action driven, whereas PCT and
other closed loop control theories put behaviour as an
emergent property of the control system’s response to
a problem over time. Trying to put action or state
spaces into PCT anywhere will be problematic.Â

              Could reinforcement
learning be used to learn the value for gains? It
could, but it would be poor results and overkill.
Could reinforcement learning combined with Deep
Learning act as a reorganiser for a PCT hierarchy?
Yes, but I think as alluded to before, it should be
clear where needs reorganising anyway if the rate of
error is high. You don’t need RL or DNN (Deep Neural
Networks) to do that. Could Reinforcement Learning
combined with Deep Neural Networks act as a perception
generator? That’s a much more plausible possibility,
but I don’t know where you’d start. It could
definitely learn to identify patterns in the world
that relate to specific problems, and then a PCT
hierarchy could incorporate it and minimise error.
After all, if you have the right perceptions, HPCT
should be more than sufficient. It’s where the
perceptions come from that I think is the thing PCT
doesn’t answer.

Hope this helps,

Ben

–
Heather C.
Broccard-Bell, Ph.D.

                                        Postdoctoral

Fellow

                                        University

of California San Diego

                                        Honey

Bee Communication, Nieh Lab

                                        Division

of Biological Sciences

                                        Section

of Ecology, Behavior, and
Evolution

Lecturer

                                        University

of San Diego; University of
California San Diego

                                        Departments

of Psychological Sciences
(Psychology); Departments of
Biology

                                            619.757.4694

                On 28 September 2017 at

13:33, Erling Jorgensen EJorgensen@riverbendcmhc.org
wrote:

[Erling Jorgensen (2017.09.28 0739 EDT)]

Rupert Young (2017.09.28 10.30)Â

Hi Rupert,

                        Thanks for the reply.Â  I was wondering

about your reactions to what I wrote.Â

                        >RY:Â  Well, I think "intrinsic
desirability" is related to PCT goals in
that certain states represent sub-goals on
the way to achieving a man [main?]Â goal, and
some states will help attain that better
than others. However, this may only be valid
in well-defined, constrained environments
such as games where you are dealing with
discrete states rather than dynamic
environments where the reference values
continually change.

                        EJ:Â  This doesn't sound very "intrinsic"
to me.Â The claim of Reinforcement
LearningÂ seemed to be that reward value is
an intrinsic property of certain states.Â
Whereas with PCT, the goal /Â the reference
is the source of that desirability.Â Remove
that goal, or choose a different sub-goal,
and the supposed ‘intrinsic-ness’
evaporates.Â (Upon re-reading your
statement, we may be saying essentially the
same thing:Â Goals = Desirability,
intrinsically, or we could say, by
definition.)Â

                        EJ:Â  You had asked about the overlap or
differences of a mapping of Reinforcement
Learning onto Perceptual Control Theory.Â At
present in PCT we do not have a good
switching mechanism to see the change of
goals in action, other than ‘Bump it
Upstairs’ with postulating a higher level
reference standard.Â We also view a Program
level perception as a network of
contingencies that, once a higher level
goalÂ has beenÂ determined, can navigate among
descending reference specifications,
depending on the perceptual results achieved
so far (i.e., those are the contingency
nodes.)Â

                        EJ:Â  It is possible that "the four main
elements of a reinforcement learning system"
that you outlined [in Rupert Young
(2017.09.26 9.45)]Â could fill that gap.Â RL
includes a mapping from goal-related
environmental states, to reward values, to
expected accumulation of reward values, to
planning among action options.Â So there are
inserted two potentially measurable
functions – reward values and expected
reward values – to guide selection of goals
and sub-goals.Â

                        EJ:Â  My proposal and preference is for a
gain-adjusting system, (where a zero level
of gain can suspend a goal).Â I believe that
error-reduction in PCT already fills that
intermediary role that RL assigns to rewards
and the expected value of accumulating those
rewards.Â But PCT still needs toÂ have some
mechanism to switch goals on and off.Â It is
not yet very specified to say, in effect,
"the HPCT hierarchy does it."Â Moreover, we
already see (perhaps as a mixture of theory
and phenomenon) that goals are carried out
with various levels of gain.Â So PCT has the
problem of discerning a mechanism for
adjusting gain anyway.Â I would like to see
a more direct tackling of that issue, rather
than importing concepts (and perhaps
assumptions) from Reinforcement Learning.Â

                        EJ:Â  That's my take on it, at any rate.Â

All the best.Â

Erling

                           Disclaimer:
This message is intended only for the
use of the individual or entity to which
it is addressed, and may contain
information that is privileged,
confidential and exempt from disclosure
under applicable law. If the reader of
this message is not the intended
recipient, or the employer or agent
responsible for delivering the message
to the intended recipient, you are
hereby notified that any dissemination,
distribution or copying of this
communication is strictly prohibited. If
you have received this communication in
error, please notify the sender
immediately by telephone and delete the
material from your computer. Thank you
for your cooperation.*

Â