Hi all,
Thought I’d pitch in as someone working on reorganisation of hierarchies and also have studied reinforcement learning.
Well done on your find there, Rupert! Digney’s work is hard to track down. His work past about 2001 is near impossible to find due to him working in the military…
Some observations, I’m afraid not in chronological order…
EJ: This is one of those extra intermediary concepts that takes on a driving force in RL. A more direct link that fills this role in PCT is the one listed above: reduction of error in attaining a goal. I don’t see that there is a need for “intrinsic desirability” of various states. What makes something desirable is attaching a goal to it. When the goal changes, it is no longer desirable.
BH: The purpose of the state based approach is to allow it to solve problems that do not compliment simple error reduction of a single controller. For example, reaching an exit point of a room where one must bypass a wall first which involves going in the wrong direction. Simply trying to get as close to the exit as possible would fail, hence the weighting of states allows it to bypass this problem by learning that the gap(s) in the wall are key places to reach in the trajectory. Do I think this is a sensible approach? Not necessarily at all. A properly structured hierarchy of error reducing nodes should solve the problem. Furthermore, as you stated, when the goal changes, it’s no longer desirable… and reinforcement learning collapses here. A robust hierarchy of error reducing controllers does not suffer in the same way. However, engineers do not have the patience or skill in many cases to design appropriately robust hierarchies. Therefore, reinforcement learning.
EJ: I believe there is room for PCT to entertain the notion of a gain-adjusting system, which I connect with the emotion system. This would be driven by the rate of change in error, (does that make it a first derivative?) I am not sure how broadly error would be sampled for this. I don’t know whether the Amygdala is a possible site for such sampling. But it does seem that humans and seemingly other animals utilize broad hormonal effects to adjust the gain of their control systems. Because certain chemicals may lean the system in certain directions, this may be one place where the RL notion of “reward value” could map onto a PCT understanding of “error reduction.” However, I still don’t like the tendency of reward to become a “dormitive principle,” as Bruce Abbott (2017.09.14.2005 EDT) pointed out.
BH: I’ve been thinking about (and would definitely test if I had time in my PhD) into the notion of stress being a motivator in changing the “gains” of a control system (as the system gets “stressed”, it is more likely to change gains and also more rapidly). So, I’m also someone who’s been thinking about emotions being a gain changer. It’s still a very open topic with lots of room to explore, however. “Reward Value” mapping onto “error reduction” is very simple. To maximise reward is to minimise negative reward (or punishment). Many reinforcement learning experiments actually use negative reward, so it’s commonly used, and they aim to minimise that. So, in both cases, minimisation is occurring. It’s quite plausible that the “reward value” signal could be considered a perceptual reference to the system and then this removes a key difference. Obviously, they have huge differences in how they handle the understanding of the world (perceptual signals versus state based representation).
EJ: This doesn’t sound very “intrinsic” to me. The claim of Reinforcement Learning seemed to be that reward value is an intrinsic property of certain states. Whereas with PCT, the goal / the reference is the source of that desirability. Remove that goal, or choose a different sub-goal, and the supposed ‘intrinsic-ness’ evaporates. (Upon re-reading your statement, we may be saying essentially the same thing: Goals = Desirability, intrinsically, or we could say, by definition.)
The reward value of the states is not the source of the desirability in Reinforcement Learning, the reward function is. The reward function dictates the desirability of a state. Technically, this does mean the reward value of the state is the motivator (i.e. what is desirable) for the agent, but this isn’t in reflection to current reinforcement learning work. There are a variety of paradigms of reinforcement learning, many of which have much better results than state reward value driven approaches. In particular, some approaches aim to select the best policy which maximises reward, which dodges the whole explicit state action declaration altogether. To attempt to analogise to PCT, this would be like an individual PCT node learning how to change the gain to best keep the difference between input and reference as low as possible. Q-Learning, which is very popular, gives state-action pairs reward values rather than specific states. This is because, obviously, specific actions may have different values in different states but also the other way around.
Reinforcement learning still suffers from many of the problems from when it was initially proposed. The curse of dimensionality means that considering every action and state combination is expensive. Discretising the action and state space can be complicated or impossible. Digney’s attempt to make it hierarchical was fascinating, and it’s definitely the right way to think about things. One of the lead researchers at google deep mind was talking to me about how the “curriculum” of deep neural networks is their next hurdle to go over.
I find it amusing that the “bump it upstairs” argument is referenced here, as it genuinely is the answer. “Changing the goal” is nothing more than different high level nodes firing and referencing lower nodes instead of the higher level nodes that relate to the previous goal. A diverse hierarchy not only allows adaptive and varied behaviour, but robust approaches that can handle disruption. The problem is that in traditional approaches, deriving these hierarchies proved to be difficult and often didn’t have anything like the intended positives (see Subsumption Architecture). This is why approaches such as reinforcement learning and deep neural networks have gained traction, using high amounts of data to develop complex linear transforms that allow one to bypass the need for developmental learning of layers of behaviour.
To actually help address the original point, reinforcement learning could be used in a variety of ways. Would it be effective? Probably not. The biggest underlying conceptual difference that I can see is that RL assumes that behaviour is state or action driven, whereas PCT and other closed loop control theories put behaviour as an emergent property of the control system’s response to a problem over time. Trying to put action or state spaces into PCT anywhere will be problematic.
Could reinforcement learning be used to learn the value for gains? It could, but it would be poor results and overkill. Could reinforcement learning combined with Deep Learning act as a reorganiser for a PCT hierarchy? Yes, but I think as alluded to before, it should be clear where needs reorganising anyway if the rate of error is high. You don’t need RL or DNN (Deep Neural Networks) to do that. Could Reinforcement Learning combined with Deep Neural Networks act as a perception generator? That’s a much more plausible possibility, but I don’t know where you’d start. It could definitely learn to identify patterns in the world that relate to specific problems, and then a PCT hierarchy could incorporate it and minimise error. After all, if you have the right perceptions, HPCT should be more than sufficient. It’s where the perceptions come from that I think is the thing PCT doesn’t answer.
Hope this helps,
Ben
···
On 28 September 2017 at 13:33, Erling Jorgensen EJorgensen@riverbendcmhc.org wrote:
[Erling Jorgensen (2017.09.28 0739 EDT)]
Rupert Young (2017.09.28 10.30)
Hi Rupert,
Thanks for the reply. I was wondering about your reactions to what I wrote.
RY: Well, I think “intrinsic desirability” is related to PCT goals in that certain states represent sub-goals on the way to achieving a man [main?] goal, and some states will help attain that better than others. However, this may only be valid in well-defined, constrained environments such as games where you are dealing with discrete states rather than dynamic environments where the reference values continually change.
EJ: This doesn’t sound very “intrinsic” to me. The claim of Reinforcement Learning seemed to be that reward value is an intrinsic property of certain states. Whereas with PCT, the goal / the reference is the source of that desirability. Remove that goal, or choose a different sub-goal, and the supposed ‘intrinsic-ness’ evaporates. (Upon re-reading your statement, we may be saying essentially the same thing: Goals = Desirability, intrinsically, or we could say, by definition.)
EJ: You had asked about the overlap or differences of a mapping of Reinforcement Learning onto Perceptual Control Theory. At present in PCT we do not have a good switching mechanism to see the change of goals in action, other than ‘Bump it Upstairs’ with postulating a higher level reference standard. We also view a Program level perception as a network of contingencies that, once a higher level goal has been determined, can navigate among descending reference specifications, depending on the perceptual results achieved so far (i.e., those are the contingency nodes.)
EJ: It is possible that “the four main elements of a reinforcement learning system” that you outlined [in Rupert Young (2017.09.26 9.45)] could fill that gap. RL includes a mapping from goal-related environmental states, to reward values, to expected accumulation of reward values, to planning among action options. So there are inserted two potentially measurable functions – reward values and expected reward values – to guide selection of goals and sub-goals.
EJ: My proposal and preference is for a gain-adjusting system, (where a zero level of gain can suspend a goal). I believe that error-reduction in PCT already fills that intermediary role that RL assigns to rewards and the expected value of accumulating those rewards. But PCT still needs to have some mechanism to switch goals on and off. It is not yet very specified to say, in effect, “the HPCT hierarchy does it.” Moreover, we already see (perhaps as a mixture of theory and phenomenon) that goals are carried out with various levels of gain. So PCT has the problem of discerning a mechanism for adjusting gain anyway. I would like to see a more direct tackling of that issue, rather than importing concepts (and perhaps assumptions) from Reinforcement Learning.
EJ: That’s my take on it, at any rate. All the best.
Erling
Disclaimer: This message is intended only for the use of the individual or entity to which it is addressed, and may contain information that is privileged, confidential and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, or the employer or agent responsible for delivering the message to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please notify the sender immediately by telephone and delete the material from your computer. Thank you for your cooperation.