Reinforcement Learning

wmansell · September 29, 2017, 4:22pm

Fairly typical!

···

On 29 September 2017 at 12:34, Rupert Young rupert@perceptualrobots.com wrot
e:

[From Rupert Young (2017.09.29 12.35) ]

Very useful response Ben. My comments below.

On 29/09/2017 02:50, B Hawker wrote:
      BH: The purpose of the state based approach is to allow it
to solve problems that do not compliment simple error
reduction of a single controller. For example, reaching an
exit point of a room where one must bypass a wall first which
involves going in the wrong direction. Simply trying to get as
close to the exit as possible would fail, hence the weighting
of states allows it to bypass this problem by learning that
the gap(s) in the wall are key places to reach in the
trajectory. Do I think this is a sensible approach? Not
necessarily at all. A properly structured hierarchy of error
reducing nodes should solve the problem.

Do such examples actually exist yet? I've been thinking about the
Mountain Car problem
(https://en.wikipedia.org/wiki/Mountain_car_problem ) which would
seem a good and reasonably simple learning problem, of continuous
variables, that PCT should be able to address, and is analogous to
your problem above. Do we have a hierarchy of error reducing nodes
that can solve that problem?
        To actually help address the
original point, reinforcement learning could be used in a
variety of ways. Would it be effective? Probably not. The
biggest underlying conceptual difference that I can see is
that RL assumes that behaviour is state or action driven,
whereas PCT and other closed loop control theories put
behaviour as an emergent property of the control system’s
response to a problem over time. Trying to put action or
state spaces into PCT anywhere will be problematic.

I guess this RL assumption comes out of the trad AI problem solving
legacy which involved discrete state spaces. PCT still needs to
provide ways of solving such cases, which would address the
high-level reasoning (or program level) aspects of cognition.
Perhaps its a matter of re-framing state spaces and actions as
perceptions and goals.
        Could reinforcement learning
be used to learn the value for gains? It could, but it would
be poor results and overkill. Could reinforcement learning
combined with Deep Learning act as a reorganiser for a PCT
hierarchy? Yes, but I think as alluded to before, it should
be clear where needs reorganising anyway if the rate of
error is high. You don’t need RL or DNN (Deep Neural
Networks) to do that. Could Reinforcement Learning combined
with Deep Neural Networks act as a perception generator?
That’s a much more plausible possibility, but I don’t know
where you’d start. It could definitely learn to identify
patterns in the world that relate to specific problems, and
then a PCT hierarchy could incorporate it and minimise
error. After all, if you have the right perceptions, HPCT
should be more than sufficient. It’s where the perceptions
come from that I think is the thing PCT doesn’t answer.

Yes, I think that although there is some work on learning gains (arm
reorg in LCS3), what is lacking is how perceptual functions are
learned. Though we should be able to take techniques from neural
networks for this; autoencoders or convolutional nns.
On a slightly different matter I came across this paper which uses
RL for adaptive control systems, rather than discrete state spaces,
which is much closer to what PCT normally addresses.
"Reinforcement learning and optimal adaptive control: An overview
and implementation examples"

https://pdfs.semanticscholar.org/b4cf/9d3847979476af5b1020c61367fa03626d22.pdf
I'm still looking at it so not sure what it is doing yet. Are you
familiar with this approach?
Rupert

rupert · September 29, 2017, 5:40pm

[From Rupert Young (2017.09.29 18.40)]

(Rick Marken (2017.09.29.0845)]

The purpose of the discussion is to clarify what is invalid about

the RL approach. It is a very popular topic at the moment, and
successful in some domains. So PCT should be able to answer what is
wrong with it, and also define an alternative which is better.

As I summarised previously "RL is a method of learning by

interacting with the environment, and evaluating the consequences
of actions according to goals", which also describes PCT learning.
So perhaps there is some overlap between the two methodologies.

Regards,

Rupert

···

RM: Control is only control of input. There is no such
thing as control of output. Just as there is no such thing
as reinforcement, which is why I find this continuing
discussion of reinforcement learning rather puzzling.

rsmarken · September 29, 2017, 10:46pm

[From Rick Marken (2017.09.29.1545)]

···

Rupert Young (2017.09.29 18.40)

RY: The purpose of the discussion is to clarify what is invalid about
the RL approach. It is a very popular topic at the moment, and
successful in some domains. So PCT should be able to answer what is
wrong with it, and also define an alternative which is better.

Â RM: It seems to me that that could be done rather quickly. Reinforcement refers to strengthening; what is strengthened is a causal link; the probability that event AÂ causes event B, where event B is some “action” or “behavior”. So reinforcement learning can be used to make some action more probable than others. It can’t be used to teach control since increasing the probability of an action will not result in the ability to vary actions appropriately to control the consequence of those actions in a disturbance prone world.

RY: As I summarised previously "RL is a method of learning by
interacting with the environment, and evaluating the consequences
of actions according to goals", which also describes PCT learning.
So perhaps there is some overlap between the two methodologies.

RM: The consequences that are evaluated (better term is probably monitored) in PCT learning are the levels of error in the control systems that make up the control hierarchy. The PCT learning systemÂ doesn’t know what consequences are being controlled by these control systems; it just knows whether the systems are maintaining the consequences in their reference states; if they are, error will be small, if not it will be large. It the size of the error in control systems that drives learning. And what learning consists of is varying the parameters of the controlÂ – the gain of the output function and/or the nature of the perceptual function – until there is close to zero error in a control system once again.Â

RY: You might think reinforcement learning could be used to teach control systems how to control better by “strengthening” their gain. But increasing gain will not always improve control (a fast control system has to have lower gain than a slow one).Â

RY: The most important result of learning to control must be a stable control system; stability is achieved by proper adjustment of the parameters of control. I know, from building software control models, that stabilizing a control system is often no mean task. But I’ve started to build my own Lego robot (based on your help) and I’m finding that is true in spades for control systems implemented in hardware (where the feedback loop goes through the real world). This discussion has encouraged me to put a PCT learning (reorganization) system into my robot to see if the E. coli algorithm can produce a more stable robot than I can produce on my own.Â I’ll report back if it works; you won’t hear a peep from me if it doesn’t;-)

Best regards

Rick

Regards,

Rupert

–
Richard S. MarkenÂ

"Perfection is achieved not when you have nothing more to add, but when you
have nothing left to take away.â€?
Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â --Antoine de Saint-Exupery

          RM: Control is only control of input. There is no such

thing as control of output. Just as there is no such thing
as reinforcement, which is why I find this continuing
discussion of reinforcement learning rather puzzling.

B_Hawker · September 29, 2017, 11:18pm

[From Rick Marken (2017.09.29.1545)]
RM: It seems to me that that could be done rather quickly. Reinforcement refers to strengthening; what is strengthened is a causal link; the probability that event A causes event B, where event B is some "action" or "behavior". So reinforcement learning can be used to make some action more probable than others. It can't be used to teach control since increasing the probability of an action will not result in the ability to vary actions appropriately to control the consequence of those actions in a disturbance prone world.

BH: Well, I'm afraid that isn't the case. The literature clearly shows that reinforcement learning can do such, because the encoding of the state encodes information which allows it to vary the choice in actions. Therefore, it is more than able to control by considering relevant features it needs to control in state-action pairings (i.e. Q-Learning).>

RM: The consequences that are evaluated (better term is probably monitored) in PCT learning are the levels of error in the control systems that make up the control hierarchy. The PCT learning system doesn't know what consequences are being controlled by these control systems; it just knows whether the systems are maintaining the consequences in their reference states; if they are, error will be small, if not it will be large. It the size of the error in control systems that drives learning. And what learning consists of is varying the parameters of the control -- the gain of the output function and/or the nature of the perceptual function -- until there is close to zero error in a control system once again.
RM: You might think reinforcement learning could be used to teach control systems how to control better by "strengthening" their gain. But increasing gain will not always improve control (a fast control system has to have lower gain than a slow one).

BH: I think you're misunderstanding what reinforcement learning is, as a mathematical tool, with what reinforcement is as a concept? Reinforcement learning allows an agent to learn suitable actions (so, for example, a specific parameter set of a controller) in some particular state (what your input and reference are, or other details of the world) given some policy (we want to minimise the rate of error). Reinforcement learning doesn't just "strengthen" things, it's an action selection tool based on state information.

···

On 29 September 2017 at 23:46, Richard Marken <<mailto:rsmarken@gmail.com>rsmarken@gmail.com> wrote:

MartinT · September 30, 2017, 3:24am

[Martin Taylor 2017.09.19.23.03]

Not knowing much about it, what you say and what I read elsewhere

about reinforcement learning strikes me as avoiding the specific
thing that PCT does differently and well. Reinforcement could quite
probably do the equivalent of adjusting the linkages among “action
components” (in PCT, lower-level control systems) that PCT assigns
to reorganization, and therefore might be able to build the
structure. But it seems to depend on having essentially perfect
knowledge of the environment state in order to produce the “right”
action at any moment. PCT, on the other hand, works on the basis that things are happening
in the environment of which the system knows nothing in advance.
When the environmental state changes, PCT just does more or less of
whatever is working, at least at the level of continuous variables.
There’s no sense of having to search for the proper way to deal with
the minimally changed circumstances as things evolve, though choice
is often required, it is among a low integer number of
possibilities, such as "shall I walk, cycle, drive, or take the
bus?', not to find a match to the current conditions among the large
number of possible states that could result in pre-learned actions.
Maybe my understanding of reinforcement learning is all wrong, but
the way I see it at the moment, the big difference is one of
magnitude. Instead of an exponentially explosive number of
state-action pairs that must continually be learned in advance and
then re-coordinated as unpredicted things happen in the environment,
PCT develops a repertoire of processes that know not much more than
whether to increase or decrease their output at any moment. Rather
than the exponentially large number of state-action pairs – which
could, in principle but I think not in practice, do the same job –
PCT is based on linearly additive numbers of possibilities for
evoking the kinds of action that might be required for different
kinds of unexpected variations in the environment.
I await correction.
Martin

···

On 2017/09/29 7:18 PM, B Hawker wrote:

        On 29 September 2017 at 23:46,
Richard Marken rsmarken@gmail.com
wrote:

[From Rick Marken (2017.09.29.1545)]

                  RM: It seems to me that that could be done
rather quickly. Reinforcement refers to
strengthening; what is strengthened is a causal
link; the probability that event A causes event
B, where event B is some “action” or “behavior”.
So reinforcement learning can be used to make some
action more probable than others. It can’t be used
to teach control since increasing the probability
of an action will not result in the ability to
vary actions appropriately to control the
consequence of those actions in a disturbance
prone world.

          BH: Well, I'm afraid that isn't the case. The
literature clearly shows that reinforcement learning can
do such, because the encoding of the state encodes
information which allows it to vary the choice in actions.
Therefore, it is more than able to control by considering
relevant features it needs to control in state-action
pairings (i.e. Q-Learning).

                  RM: The consequences that are evaluated (better
term is probably monitored) in PCT learning are
the levels of error in the control systems that
make up the control hierarchy. The PCT learning
system doesn’t know what consequences are being
controlled by these control systems; it just knows
whether the systems are maintaining the
consequences in their reference states; if they
are, error will be small, if not it will be large.
It the size of the error in control systems that
drives learning. And what learning consists of is
varying the parameters of the control –
the gain of the output function and/or the nature
of the perceptual function – until there is close
to zero error in a control system once again.

                  RM: You might think reinforcement learning
could be used to teach control systems how to
control better by “strengthening” their gain. But
increasing gain will not always improve control (a
fast control system has to have lower gain than a
slow one).

          BH: I think you're misunderstanding what reinforcement
learning is, as a mathematical tool, with what
reinforcement is as a concept? Reinforcement learning
allows an agent to learn suitable actions (so, for
example, a specific parameter set of a controller) in some
particular state (what your input and reference are, or
other details of the world) given some policy (we want to
minimise the rate of error). Reinforcement learning
doesn’t just “strengthen” things, it’s an action selection
tool based on state information.

B_Hawker · September 30, 2017, 10:57am

MT: Not knowing much about it, what you say and what I read elsewhere about reinforcement learning strikes me as avoiding the specific thing that PCT does differently and well. Reinforcement could quite probably do the equivalent of adjusting the linkages among "action components" (in PCT, lower-level control systems) that PCT assigns to reorganization, and therefore might be able to build the structure. But it seems to depend on having essentially perfect knowledge of the environment state in order to produce the "right" action at any moment.

BH: You're on the right lines here, but it doesn't need perfect knowledge. Recent algorithms are able to capably solve POMDP (Partially-Observable Markov Decision Processes). Furthermore, Deep Learning with Reinforcement Learning has been able to solve complex motor tasks which involve exploring unknown possibilities. Was it easy for them? Not at all, tons of training data required. But, you're on the right lines that RL requires a lot of training time with good data before it can make good choices.

MT: PCT, on the other hand, works on the basis that things are happening in the environment of which the system knows nothing in advance. When the environmental state changes, PCT just does more or less of whatever is working, at least at the level of continuous variables. There's no sense of having to search for the proper way to deal with the minimally changed circumstances as things evolve, though choice is often required, it is among a low integer number of possibilities, such as "shall I walk, cycle, drive, or take the bus?', not to find a match to the current conditions among the large number of possible states that could result in pre-learned actions.

BH: PCT is indeed a reactive controller (which is the useful term I've found used) and RL is more of an adaptive function (it learns how to adapt to the environment, but isn't built on learning the reactive behaviour to simply reduce error for example). Obviously, there are downsides to both. RL requires a lot of training time to learn adaptive behaviour, and PCT isn't able to do anticipatory control without extra perceptions and probably a variety of levels to the hierarchy. For example, anticipating an impact while balancing. Before it hits, one can lean forward to improve your balance post-impact. This is an example of something that is not innate to reactive controllers and is thus quite a difficult problem. While solvable under PCT, it definitely isn't something easy to derive and requires complex perceptions. >

MT: Maybe my understanding of reinforcement learning is all wrong, but the way I see it at the moment, the big difference is one of magnitude. Instead of an exponentially explosive number of state-action pairs that must continually be learned in advance and then re-coordinated as unpredicted things happen in the environment, PCT develops a repertoire of processes that know not much more than whether to increase or decrease their output at any moment. Rather than the exponentially large number of state-action pairs -- which could, in principle but I think not in practice, do the same job -- PCT is based on linearly additive numbers of possibilities for evoking the kinds of action that might be required for different kinds of unexpected variations in the environment.

BH: So to speak, yes. The behaviour in RL is encoded in the action-state pairs and their weightings which take ages to compute. In PCT, the behaviour is in minimising the error between the perceptual inputs and the references. The complexity there comes in at what the perceptions are, which the system itself doesn't handle. So, what PCT actually does is more simple and elegant. However, what perceptions you choose for a system massively affect the behaviour! RL just crunches every possible interaction of action and state with big data, but for PCT, a lot of this has been pruned out by simply what perceptions are selected. PCT will not find behaviour that the engineer doesn't effectively allow it to have since the complexity can only be as much as the perceptions it's given. A system could not consider or control a perception which it doesn't have as an input. While RL can, it will take a long time and a lot of good data to crunch this... as well as stopping the learning at the right time, as often it doesn't know when to stop.
BH: So, to avoid the problem of either the engineer having to hand select perceptions and order them to produce hierarchies or to leave big data to crunch it, it would be great if there were some way to have a control system learn how to hierarchically arrange control nodes such that it can learn how to control perceptions. (Hey, that's my PhD!)>

MT: I await correction.

BH: Not much correction at all! Just some clarification and putting in terms that help describe the phenomena you put forward from other fields ;)>

rupert · September 30, 2017, 1:02pm

[From Rupert Young (2017.09.30 14.00)]

(Rick Marken (2017.09.29.1545)]

That sounds fine for an individual continuous control system, but

how about when it requires changing (switching) control
systems. For example, to control car speed we have to learn to
switch between control systems for brake and throttle. We learn this
pretty quickly so it seems unlikely to part of the same process (of
varying the parameters of control), so what is involved in
learning in this case?

Regards,

Rupert

···

RM: The consequences that are evaluated (better term is
probably monitored) in PCT learning are the levels of
error in the control systems that make up the control
hierarchy. The PCT learning system doesn’t know what
consequences are being controlled by these control
systems; it just knows whether the systems are maintaining
the consequences in their reference states; if they are,
error will be small, if not it will be large. It the size
of the error in control systems that drives learning. And
what learning consists of is varying the parameters
of the control – the gain of the output function and/or
the nature of the perceptual function – until there is
close to zero error in a control system once again.

rupert · September 30, 2017, 1:14pm

[From Rupert Young (2017.09.30 14.15) ]

RY: Do such examples
actually exist yet? I’ve been thinking about the Mountain Car
problem (https://en.wikipedia.org/wiki/Mountain_car_problem ) which would seem a good and
reasonably simple learning problem, of continuous variables,
that PCT should be able to address, and is analogous to your
problem above. Do we have a hierarchy of error reducing nodes
that can solve that problem?

        Nope. Very complicated
problem. Would take a long time to derive the hierarchy,
it’s got a lot of complex perceptions there.

Is it that bad? Learning aside, I would have thought a hierarchy

could be manually constructed to control such variables as position,
speed and something that represents the oscillation of the car up
each side of the valley.

        RY: On a slightly different
matter I came across this paper which uses RL for adaptive
control systems, rather than discrete state spaces, which is
much closer to what PCT normally addresses.
                  "Reinforcement learning and
optimal adaptive control: An overview and implementation
examples"
      [https://pdfs.semanticscholar.org/b4cf/9d3847979476af5b1020c61367fa03626d22.pdf](https://urldefense.proofpoint.com/v2/url?u=https-3A__pdfs.semanticscholar.org_b4cf_9d3847979476af5b1020c61367fa03626d22.pdf&d=DwMDaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=-dJBNItYEMOLt6aj_KjGi2LMO_Q8QB-ZzxIZIF8DGyQ&m=RTDJOdLtboBa1NPTPNrv7l__n6jQblMmrE6EpEnrIlA&s=ZaJFPfc-lUoNBFPTG6P3ip77lTdUtLT-JX3WdR-uwZk&e=)



                  RY: I'm still looking at it so
not sure what it is doing yet. Are you familiar with this
approach?

        To clarify, it is not using RL for adaptive control
systems instead of discrete state spaces. The critic
is using state and action spaces with a value function to
advise the actor (which is a PD controller I think?). So, it
still uses state action pairings and appropriate values of
those to motivate changes in behaviour.

…

Does that make sense? Sorry, felt a bit ramble-y.

Erm, what do the states actually represent? And what is meant by

actions in this case of an adaptive controller, are they changes to
the parameters on control (e.g. PD gains)?

Regards,

Rupert

···

On 29/09/2017 17:10, B Hawker wrote:

MartinT · September 30, 2017, 4:32pm

[Martin Taylor 2017.09.30.11.26]

PCT is also an adaptive system in that same sense. Organisms growing

up in different environments develop different perceptual functions
and connect them to different outputs to produce controllable
perceptions that help maintain the intrinsic variables near their
optima. a city dweller doesn’t easily perceive deer tracks in the
bush or know how to get lunch when he can perceive the tracks, a
hunter from the bush doesn’t easily perceive when it is safe to
cross the street at a city intersection, or where to get lunch when
he can cross the street safely.
PCT builds perceptual functions and connects up control units by
reorganization rather than reinforcement. The difference between the
two concepts is analogous to the difference between “Come here” and
“Don’t go there”. Reinforcement says “what you did was good”,
whereas reorganization says “What you are doing isn’t bad.” In other
analogous mottos, Reinforcement is “Getting better every day”,
whereas PCT is “If it ain’t broke, don’t fix it.” There’s no
technical reason why both could not work together in building more
effective systems, just as in the brain there seem to be mutually
supportive systems of perception “It’s like this” and “it’s
different from that”, or at higher levels “intuition” and
“analysis”.
Why is that a downside? It’s built into the very heart of PCT.
There’s no prohibition in PCT against one control system seeing a
rock coming at your head and sending a reference signal to a lower
level set of systems to move your head to a position at which the
rock is seen to be missing you and going past. I have no idea how complex, but I do think it needs prior experience
in order to reorganize those perceptions and the corresponding
controls. Before you have ever been hit by something you could have
seen coming while you were balancing, you would not counter it in
advance under either adaptive development system.
Why do you say that? In the book I am currently trying to write
(working title “Powers of Perceptual Control” – pun intended), I
have two whole chapters on how perceptions of different aspects of
language would be expected to develop. Language is a tricky example,
which is why I used it. I have been drafting also a response in some
detail to your earlier comment “”, a comment
that greatly surprised me.
Exactly! That’s a feature, not a bug. It enables ranges of behaviour
that control ranges of perceptions, some of which serve to maintain
the intrinsic variables in good condition, others of which don’t.
The latter tend to get pruned, the former tend to stick around. But
the word “choose” is unfortunate, as it suggests some directed
control process, whereas in PCT it is the effects of interactions
with the environment that “choose”, as part of the reorganization
process. Or are you talking about designed robots only?
What engineer? The Earth Mother who directs evolution? It’s true
that humans will not find behaviours that require massive electrical
impulses that kill dangerous predators, and that eels will not find
much use for levers their muscles cannot move. Is that a hit against
PCT? How would reinforcement handle those problems?
True. The sensor systems must vary what they report to create any
and all perceptions that might ever be controlled, and we can be
killed by things our sensors do not report, such as X-ray or gamma
radiation. Is there a difference between PCT and reinforcement
learning in this respect?
Reorganization doesn’t actually stop, either. It just slows greatly
when control is good and the intrinsic variables stay near their
optimum values. In everyday terms (as I said above) “If it ain’t
broke, don’t fix it.” Put another way, if you are feeling good about
yourself and the way things are, you won’t be a revolutionary. But
even then, small changes allowed by what (using a quantum physics
analogy) we might call “zero-point” reorganization, good things
might get even better. If the small zero-point changes make things
worse, the tendency is to return them to the way they were, in
effect creating a stopping point.
Well, that’s exactly what a living control system is supposed to do.
It starts life with the systems evolution has provided (humans can’t
walk at birth, newborn deer can). Check out the work of Franz
Plooij, for example, for stages of human development interpreted as
stages in building control hierarchies. Thanks for that. I was not at all sure I had the right end of the
stick.
Martin

···

    On 2017/09/30 6:57 AM, B Hawker wrote:On
2017/09/30 6:57 AM, B Hawker wrote:

  It really would be nice if you would start your messages with a

time-stamp ID. The above line is unlikely to be the same on all
the computers receiving your message. On mine, it’s my local time.
An Australian would have a different local time, so if I want to
refer to this message rather than one of yours yesterday, I have
no way to indicate which messages I am commenting on. In this
case, I just use the word “earlier”, below. But someone interested
in putting the quote in context would have to search the body of
all your earlier messages to find it.

            MT: Not knowing much about it, what you say and what I
read elsewhere about reinforcement learning strikes me
as avoiding the specific thing that PCT does differently
and well. Reinforcement could quite probably do the
equivalent of adjusting the linkages among “action
components” (in PCT, lower-level control systems) that
PCT assigns to reorganization, and therefore might be
able to build the structure. But it seems to depend on
having essentially perfect knowledge of the environment
state in order to produce the “right” action at any
moment.

          BH: You're on the right lines here, but it doesn't need
perfect knowledge. Recent algorithms are able to capably
solve POMDP (Partially-Observable Markov Decision
Processes). Furthermore, Deep Learning with Reinforcement
Learning has been able to solve complex motor tasks which
involve exploring unknown possibilities. Was it easy for
them? Not at all, tons of training data required. But,
you’re on the right lines that RL requires a lot of
training time with good data before it can make good
choices.

            MT: PCT, on the other hand, works on the basis that
things are happening in the environment of which the
system knows nothing in advance. When the environmental
state changes, PCT just does more or less of whatever is
working, at least at the level of continuous variables.
There’s no sense of having to search for the proper way
to deal with the minimally changed circumstances as
things evolve, though choice is often required, it is
among a low integer number of possibilities, such as
"shall I walk, cycle, drive, or take the bus?', not to
find a match to the current conditions among the large
number of possible states that could result in
pre-learned actions.

          BH: PCT is indeed a reactive controller (which is the
useful term I’ve found used) and RL is more of an adaptive
function (it learns how to adapt to the environment,

          but isn't built on learning the reactive behaviour to
simply reduce error for example). Obviously, there are
downsides to both. RL requires a lot of training time to
learn adaptive behaviour, and PCT isn’t able to do
anticipatory control without extra perceptions and
probably a variety of levels to the hierarchy.

          For example, anticipating an impact while balancing.
Before it hits, one can lean forward to improve your
balance post-impact. This is an example of something that
is not innate to reactive controllers and is thus quite a
difficult problem. While solvable under PCT, it definitely
isn’t something easy to derive and requires complex
perceptions.

            MT: Maybe my understanding of reinforcement learning is
all wrong, but the way I see it at the moment, the big
difference is one of magnitude. Instead of an
exponentially explosive number of state-action pairs
that must continually be learned in advance and then
re-coordinated as unpredicted things happen in the
environment, PCT develops a repertoire of processes that
know not much more than whether to increase or decrease
their output at any moment. Rather than the
exponentially large number of state-action pairs –
which could, in principle but I think not in practice,
do the same job – PCT is based on linearly additive
numbers of possibilities for evoking the kinds of action
that might be required for different kinds of unexpected
variations in the environment.

          BH: So to speak, yes. The behaviour in RL is encoded in
the action-state pairs and their weightings which take
ages to compute. In PCT, the behaviour is in minimising
the error between the perceptual inputs and the
references. The complexity there comes in at what the
perceptions are, which the system itself doesn’t handle.

```
 It's where the perceptions come
```

from that I think is the thing PCT doesn’t answer.*

          So, what PCT actually does is more simple and elegant.
However, what perceptions you choose for a system
massively affect the behaviour!

          RL just crunches every possible interaction of action
and state with big data, but for PCT, a lot of this has
been pruned out by simply what perceptions are selected.
PCT will not find behaviour that the engineer doesn’t
effectively allow it to have

          since the complexity can only be as much as the
perceptions it’s given. A system could not consider or
control a perception which it doesn’t have as an input.

          While RL can, it will take a long time and a lot of
good data to crunch this… as well as stopping the
learning at the right time, as often it doesn’t know when
to stop.

          BH: So, to avoid the problem of either the engineer
having to hand select perceptions and order them to
produce hierarchies or to leave big data to crunch it, it
would be great if there were some way to have a control
system learn how to hierarchically arrange control nodes
such that it can learn how to control perceptions. (Hey,
that’s my PhD!)

            MT: I await correction.

          BH: Not much correction at all! Just some clarification
and putting in terms that help describe the phenomena you
put forward from other fields

MartinT · September 30, 2017, 4:42pm

[Martin Taylor 2017.09.30.12.32]

[From Rupert Young (2017.09.30 14.00)]

(Rick Marken (2017.09.29.1545)]

  That sounds fine for an individual continuous control system, but
how about when it requires changing (switching) control
systems. For example, to control car speed we have to learn to
switch between control systems for brake and throttle. We learn
this pretty quickly so it seems unlikely to part of the same
process (of varying the parameters of control), so what is
involved in learning in this case?

Rupert, what you say may be true of linear systems, but non-linear

systems with feedback can have abrupt changes of effect with
continuous changes of parameters. Technically, they show
“catastrophe” like this fold catastrophe (which illustrates
perception, but the fold idea is the same for output).

![cusp A-H2.jpg|712x413](upload://tMIca5ACWkEJ78oMjBCWyDDz0v4.jpeg)

The other possibility is the Powers idea that references are the

outputs of associative memories addressed by the current outputs
from higher-level control units (or, I would suggest, by a vector of
current error values). That, too could change abruptly with a
continuous change of perceptual values.

Martin

···

RM: The consequences that are evaluated (better term
is probably monitored) in PCT learning are the levels of
error in the control systems that make up the control
hierarchy. The PCT learning system doesn’t know what
consequences are being controlled by these control
systems; it just knows whether the systems are
maintaining the consequences in their reference states;
if they are, error will be small, if not it will be
large. It the size of the error in control systems that
drives learning. And what learning consists of is
varying the parameters of the control – the
gain of the output function and/or the nature of the
perceptual function – until there is close to zero
error in a control system once again.

rupert · September 30, 2017, 7:11pm

[From Rupert Young (2017.09.30 20.10)]

(Martin Taylor 2017.09.30.12.32]

[From Rupert Young (2017.09.30 14.00)]

    That sounds fine for an individual continuous control system,
but how about when it requires changing (switching)
control systems. For example, to control car speed we have to
learn to switch between control systems for brake and throttle.
We learn this pretty quickly so it seems unlikely to part of the
same process (of varying the parameters of control), so
what is involved in learning in this case?
  Rupert, what you say may be true of linear systems, but non-linear
systems with feedback can have abrupt changes of effect with
continuous changes of parameters. Technically, they show
“catastrophe” like this fold catastrophe (which illustrates
perception, but the fold idea is the same for output).
  ![cusp A-H2.jpg|712x413](upload://tMIca5ACWkEJ78oMjBCWyDDz0v4.jpeg)

Yes, that's a good point. Is this case showing that the output of a

perceptual function, of two linear inputs, is non-linear? Would the
output case be applicable to the brake/throttle example?

  The
other possibility is the Powers idea that references are the
outputs of associative memories addressed by the current outputs
from higher-level control units (or, I would suggest, by a vector
of current error values). That, too could change abruptly with a
continuous change of perceptual values.

I'd been considering this, but didn't see how a weight-adjusting

reorganisation process would account for that. It would seem to me
that memorising requires an instant change in the state of the
control system (or perhaps locking in the current state), as opposed
to gradual changes in parameters (gain, e.g. as in arm reorg in
LS3). For example, with my tea example, the first time you drink tea
you may add sugar, bit by bit, repeatedly tasting, to control your
desired perception of sweetness. It would be laborious, and
impractical, if you had to repeat his process every time you drank
tea, so you remember your perception of adding three spoonfuls, say.
Next time you drink tea you control the desired sweetness by adding
three spoonfuls of sugar without having to taste it.

This doesn't seem like (the same) reorganisation to me, but instant

storage of a perception value, when the error is zero, to be later
used as a reference. Any thoughts on this?

Rupert

wmansell · October 1, 2017, 9:15am

Hi folks, I can see how this mountain car problem would be a great one to turn PCT to and contrast with RL and publish. I have not solved it of course but it seems simpler than it looks. It certainly reminds me of the inverted pendulum problem. It seems to me that controlling the perception of accelerating downwards (getting to increasingly high positions from which one would freefall faster and faster) should be part of the solution.

All the best,

Warren

···

On 30 Sep 2017, at 14:14, Rupert Young rupert@perceptualrobots.com wrote:

[From Rupert Young (2017.09.30 14.15) ]

On 29/09/2017 17:10, B Hawker wrote:

RY: Do such examples
actually exist yet? I’ve been thinking about the Mountain Car
problem (https://en.wikipedia.org/wiki/Mountain_car_problem ) which would seem a good and
reasonably simple learning problem, of continuous variables,
that PCT should be able to address, and is analogous to your
problem above. Do we have a hierarchy of error reducing nodes
that can solve that problem?

        Nope. Very complicated
problem. Would take a long time to derive the hierarchy,
it’s got a lot of complex perceptions there.

Is it that bad? Learning aside, I would have thought a hierarchy
could be manually constructed to control such variables as position,
speed and something that represents the oscillation of the car up
each side of the valley.
        RY: On a slightly different
matter I came across this paper which uses RL for adaptive
control systems, rather than discrete state spaces, which is
much closer to what PCT normally addresses.
                  "Reinforcement learning and
optimal adaptive control: An overview and implementation
examples"
      [https://pdfs.semanticscholar.org/b4cf/9d3847979476af5b1020c61367fa03626d22.pdf](https://urldefense.proofpoint.com/v2/url?u=https-3A__pdfs.semanticscholar.org_b4cf_9d3847979476af5b1020c61367fa03626d22.pdf&d=DwMDaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=-dJBNItYEMOLt6aj_KjGi2LMO_Q8QB-ZzxIZIF8DGyQ&m=RTDJOdLtboBa1NPTPNrv7l__n6jQblMmrE6EpEnrIlA&s=ZaJFPfc-lUoNBFPTG6P3ip77lTdUtLT-JX3WdR-uwZk&e=)



                  RY: I'm still looking at it so
not sure what it is doing yet. Are you familiar with this
approach?

        To clarify, it is not using RL for adaptive control
systems instead of discrete state spaces. The critic
is using state and action spaces with a value function to
advise the actor (which is a PD controller I think?). So, it
still uses state action pairings and appropriate values of
those to motivate changes in behaviour.

…

Does that make sense? Sorry, felt a bit ramble-y.

Erm, what do the states actually represent? And what is meant by
actions in this case of an adaptive controller, are they changes to
the parameters on control (e.g. PD gains)?
Regards,

Rupert

Abbott · October 1, 2017, 2:51pm

[From Bruce Abbott (2017.10.01.1050 EDT)]

I may already have a demo that “solves” this problem, although it is based on the inverted pendulum rather than the mountain car example. Starting with the pendulum hanging straight down, it brings the pendulum up to the inverted position. The solution involves a very minor change to the original inverted pendulum demo (as featured in LCS III).

I placed “solves” in quotes because the inverted pendulum demo does not learn to do its thing; the solution is provided by the designer in the form of hierarchical control systems. But the self-inverting version demonstrates what the end configuration of a solution found via reorganization or RL might look like.

The self-inverting pendulum demo can be downloaded at

https://sites.google.com/site/perceptualcontroldemos/home/available-demo-files

The relevant file is InvtPend2.zip. The file includes Delphi source code and the executable (.exe). The demo runs on PCs and should run on Macs with a PC emulator.

Bruce

···

From: Warren Mansell [mailto:wmansell@gmail.com]
Sent: Sunday, October 1, 2017 5:15 AM
To: csgnet@lists.illinois.edu
Subject: Re: Reinforcement Learning

Hi folks, I can see how this mountain car problem would be a great one to turn PCT to and contrast with RL and publish. I have not solved it of course but it seems simpler than it looks. It certainly reminds me of the inverted pendulum problem. It seems to me that controlling the perception of accelerating downwards (getting to increasingly high positions from which one would freefall faster and faster) should be part of the solution.

All the best,

Warren

On 30 Sep 2017, at 14:14, Rupert Young rupert@perceptualrobots.com wrote:

[From Rupert Young (2017.09.30 14.15) ]

On 29/09/2017 17:10, B Hawker wrote:

RY: Do such examples actually exist yet? I’ve been thinking about the Mountain Car problem (https://en.wikipedia.org/wiki/Mountain_car_problem) which would seem a good and reasonably simple learning problem, of continuous variables, that PCT should be able to address, and is analogous to your problem above. Do we have a hierarchy of error reducing nodes that can solve that problem?

Nope. Very complicated problem. Would take a long time to derive the hierarchy, it’s got a lot of complex perceptions there.

Is it that bad? Learning aside, I would have thought a hierarchy could be manually constructed to control such variables as position, speed and something that represents the oscillation of the car up each side of the valley.

RY: On a slightly different matter I came across this paper which uses RL for adaptive control systems, rather than discrete state spaces, which is much closer to what PCT normally addresses.
“Reinforcement learning and optimal adaptive control: An overview and implementation examples”
https://pdfs.semanticscholar.org/b4cf/9d3847979476af5b1020c61367fa03626d22.pdf

RY: I’m still looking at it so not sure what it is doing yet. Are you familiar with this approach?

To clarify, it is not using RL for adaptive control systems instead of discrete state spaces. The critic is using state and action spaces with a value function to advise the actor (which is a PD controller I think?). So, it still uses state action pairings and appropriate values of those to motivate changes in behaviour.

…

Does that make sense? Sorry, felt a bit ramble-y.

Erm, what do the states actually represent? And what is meant by actions in this case of an adaptive controller, are they changes to the parameters on control (e.g. PD gains)?

Regards,
Rupert

rsmarken · October 1, 2017, 7:55pm

[From Rick Marken (2017.10.01.1250)]

···

On Fri, Sep 29, 2017 at 4:18 PM, B Hawker bhawker1@sheffield.ac.uk wrote:

RM: You may be right. But it wasn’t apparent to me that that was the case. It looked to me like reinforcement learning (which I presume is what Q-learning is) was used to select actions. Producing particular actions is not the same as producing controlled results – results of action that are brought to and maintained in pre-selected states in the face of unpredictable (and usually undetectable) variations in disturbances.

RM: I would be convinced that reinforcement learning could be used to learn to control if I could see a reinforcement learning algorithm used to learn to do a simple control task: the compensatory tracking task: www.mindreadings.com/ControlDemo/BasicTrack.html. We know that the PCT reorganization algorithm can learn to control (see Demo 8 in the list of demos that Bruce Abbott just posted:Â https://sites.google.com/site/perceptualcontroldemos/home/available-demo-files).Â

RM: The reason I think reinforcement learning can’t be used to learn control is because the idea of reinforcement learningÂ is based on a misconception about the nature of behavior. The concept of reinforcement is based on the idea that behavior is emitted output; what are called actions, like pressing bars or pecking keys. Reinforcements are consequences of these actions that appear to increase their probability. PCT shows that behavior is a process of control where actions are the means of keeping variables in organism-defined reference states, protected from disturbances. So what have been seen as reinforcements turn out to be controlled variables. These “reinforcements” aren’t strengthening actions; they are being maintained in reference states by those actions (see Yin, H. (2013) restoring purpose to behavior, In Baldassarre and Mirolli (eds.), Computational and Robotic Models of the Hierarchical Organization of Behavior, ,Springer-Verlag Berlin Heidelberg; particularly Figure 5).Â

RM: Anyway, if you can produce a demonstration of reinforcement learning resulting in learning a control task I will withdraw my criticisms (and be all astonishment, as the “bad” characters in Jane Austen novels always say;-)

BestÂ

Â Rick

–

BH: Well, I’m afraid that isn’t the case. The literature clearly shows that reinforcement learning can do such, because the encoding of the state encodes information which allows it to vary the choice in actions. Therefore, it is more than able to control by considering relevant features it needs to control in state-action pairings (i.e. Q-Learning).

Â RM: It seems to me that that could be done rather quickly. Reinforcement refers to strengthening; what is strengthened is a causal link; the probability that event AÂ causes event B, where event B is some “action” or “behavior”. So reinforcement learning can be used to make some action more probable than others. It can’t be used to teach control since increasing the probability of an action will not result in the ability to vary actions appropriately to control the consequence of those actions in a disturbance prone world.

BH: I think you’re misunderstanding what reinforcement learning is, as a mathematical tool, with what reinforcement is as a concept? Reinforcement learning allows an agent to learn suitable actions (so, for example, a specific parameter set of a controller) in some particular state (what your input and reference are, or other details of the world) given some policy (we want to minimise the rate of error). Reinforcement learning doesn’t just “strengthen” things, it’s an action selection tool based on state information.Â

RM: The consequences that are evaluated (better term is probably monitored) in PCT learning are the levels of error in the control systems that make up the control hierarchy. The PCT learning systemÂ doesn’t know what consequences are being controlled by these control systems; it just knows whether the systems are maintaining the consequences in their reference states; if they are, error will be small, if not it will be large. It the size of the error in control systems that drives learning. And what learning consists of is varying the parameters of the controlÂ – the gain of the output function and/or the nature of the perceptual function – until there is close to zero error in a control system once again.Â

RM: You might think reinforcement learning could be used to teach control systems how to control better by “strengthening” their gain. But increasing gain will not always improve control (a fast control system has to have lower gain than a slow one).Â

Richard S. MarkenÂ

"Perfection is achieved not when you have nothing more to add, but when you
have nothing left to take away.â€?
Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â --Antoine de Saint-Exupery

rsmarken · October 1, 2017, 7:58pm

[From Rick Marken (2017.10.01.1255)]

···

Bruce Abbott (2017.10.01.1050 EDT)

Â

I may already have a demo that âsolvesâ? this problem, although it is based on the inverted pendulum rather than the mountain car example.Â Starting with the pendulum hanging straight down, it brings the pendulum up to the inverted position.Â The solution involves a very minor change to the original inverted pendulum demo (as featured in LCS III).

Â

I placed âsolvesâ? in quotes because the inverted pendulum demo does not learn to do its thing; the solution is provided by the designer in the form of hierarchical control systems.Â But the self-inverting version demonstrates what the end configuration of a solution found via reorganization or RL might look like.

RM: This sounds great but could you explain it a little better. Did you add a reorganization algorithm to the code? If so, where is it in the code? What does “self inverting” mean? Where is that in the code?

BestÂ

Rick

Â

Â

The self-inverting pendulum demo can be downloaded at

https://sites.google.com/site/perceptualcontroldemos/home/available-demo-files

Â

The relevant file is InvtPend2.zip.Â The file includes Delphi source code and the executable (.exe).Â The demo runs on PCs and should run on Macs with a PC emulator.

Â

Bruce

Â

From: Warren Mansell [mailto:wmansell@gmail.com]
Sent: Sunday, October 1, 2017 5:15 AM
To: csgnet@lists.illinois.edu
Subject: Re: Reinforcement Learning

Â

Hi folks, I can see how this mountain car problem would be a great one to turn PCT to and contrast with RL and publish. I have not solved it of course but it seems simpler than it looks. It certainly reminds me of the inverted pendulum problem. It seems to me that controlling the perception of accelerating downwards (getting to increasingly high positions from which one would freefall faster and faster) should be part of the solution.Â

All the best,

Warren

Â

On 30 Sep 2017, at 14:14, Rupert Young rupert@perceptualrobots.com wrote:

[From Rupert Young (2017.09.30 14.15) ]

On 29/09/2017 17:10, B Hawker wrote:

RY:Â Do such examples actually exist yet? I’ve been thinking about the Mountain Car problem (https://en.wikipedia.org/wiki/Mountain_car_problem) which would seem a good and reasonably simple learning problem, of continuous variables, that PCT should be able to address, and is analogous to your problem above. Do we have a hierarchy of error reducing nodes that can solve that problem?

Â

Nope. Very complicated problem. Would take a long time to derive the hierarchy, it’s got a lot of complex perceptions there.

Is it that bad? Learning aside, I would have thought a hierarchy could be manually constructed to control such variables as position, speed and something that represents the oscillation of the car up each side of the valley.

RY: On a slightly different matter I came across this paper which uses RL for adaptive control systems, rather than discrete state spaces, which is much closer to what PCT normally addresses.
"Reinforcement learning and optimal adaptive control: An overview and implementation examples"Â
https://pdfs.semanticscholar.org/b4cf/9d3847979476af5b1020c61367fa03626d22.pdf

RY: I’m still looking at it so not sure what it is doing yet. Are you familiar with this approach?

Â

To clarify, it is not using RL for adaptive control systemsÂ insteadÂ of discrete state spaces. The critic is using state and action spaces with a value function to advise the actor (which is a PD controller I think?). So, it still uses state action pairings and appropriate values of those to motivate changes in behaviour.

…

Does that make sense? Sorry, felt a bit ramble-y.

Erm, what do the states actually represent? And what is meant by actions in this case of an adaptive controller, are they changes to the parameters on control (e.g. PD gains)?

Regards,
Rupert

–

Richard S. MarkenÂ

"Perfection is achieved not when you have nothing more to add, but when you
have nothing left to take away.â?
Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â --Antoine de Saint-Exupery

Abbott · October 1, 2017, 8:46pm

[From Bruce Abbott (2017.10.01.1645 EDT)]

Rick Marken (2017.10.01.1255) –

Bruce Abbott (2017.10.01.1050 EDT)

I may already have a demo that â€œsolvesâ€? this problem, although it is based on the inverted pendulum rather than the mountain car example. Starting with the pendulum hanging straight down, it brings the pendulum up to the inverted position. The solution involves a very minor change to the original inverted pendulum demo (as featured in LCS III).

I placed â€œsolvesâ€? in quotes because the inverted pendulum demo does not learn to do its thing; the solution is provided by the designer in the form of hierarchical control systems. But the self-inverting version demonstrates what the end configuration of a solution found via reorganization or RL might look like.

RM: This sounds great but could you explain it a little better. Did you add a reorganization algorithm to the code? If so, where is it in the code? What does “self inverting” mean? Where is that in the code?

BA:Â No learning is involved, neither reorganization nor RL.Â I simply altered the existing code from the original demo a very slight bit to make the pendulum self-inverting.

BA:Â If you download and run the program, you will see what self-inverting means.Â Try it!

BA: Â Iâ€™d be interested in hearing from anyone who has run this magical demo.Â Without looking at the source code, can you guess how self-inversion was achieved?Â The solution was so simple I was surprised when it actually worked!

Bruce

···

The self-inverting pendulum demo can be downloaded at

https://sites.google.com/site/perceptualcontroldemos/home/available-demo-files

The relevant file is InvtPend2.zip. The file includes Delphi source code and the executable (.exe). The demo runs on PCs and should run on Macs with a PC emulator.

Bruce

From: Warren Mansell [mailto:wmansell@gmail.com]
Sent: Sunday, October 1, 2017 5:15 AM
To: csgnet@lists.illinois.edu
Subject: Re: Reinforcement Learning

Hi folks, I can see how this mountain car problem would be a great one to turn PCT to and contrast with RL and publish. I have not solved it of course but it seems simpler than it looks. It certainly reminds me of the inverted pendulum problem. It seems to me that controlling the perception of accelerating downwards (getting to increasingly high positions from which one would freefall faster and faster) should be part of the solution.

All the best,

Warren

On 30 Sep 2017, at 14:14, Rupert Young rupert@perceptualrobots.com wrote:

[From Rupert Young (2017.09.30 14.15) ]

On 29/09/2017 17:10, B Hawker wrote:

RY: Do such examples actually exist yet? I’ve been thinking about the Mountain Car problem (https://en.wikipedia.org/wiki/Mountain_car_problem) which would seem a good and reasonably simple learning problem, of continuous variables, that PCT should be able to address, and is analogous to your problem above. Do we have a hierarchy of error reducing nodes that can solve that problem?

Nope. Very complicated problem. Would take a long time to derive the hierarchy, it’s got a lot of complex perceptions there.

Is it that bad? Learning aside, I would have thought a hierarchy could be manually constructed to control such variables as position, speed and something that represents the oscillation of the car up each side of the valley.

RY: On a slightly different matter I came across this paper which uses RL for adaptive control systems, rather than discrete state spaces, which is much closer to what PCT normally addresses.
“Reinforcement learning and optimal adaptive control: An overview and implementation examples”
https://pdfs.semanticscholar.org/b4cf/9d3847979476af5b1020c61367fa03626d22.pdf

RY: I’m still looking at it so not sure what it is doing yet. Are you familiar with this approach?

To clarify, it is not using RL for adaptive control systems instead of discrete state spaces. The critic is using state and action spaces with a value function to advise the actor (which is a PD controller I think?). So, it still uses state action pairings and appropriate values of those to motivate changes in behaviour.

…

Does that make sense? Sorry, felt a bit ramble-y.

Erm, what do the states actually represent? And what is meant by actions in this case of an adaptive controller, are they changes to the parameters on control (e.g. PD gains)?

Regards,
Rupert

–

Richard S. Marken

"Perfection is achieved not when you have nothing more to add, but when you
have nothing left to take away.â€?
–Antoine de Saint-Exupery

rupert · October 2, 2017, 11:16am

[From Rupert Young (2017.10.02 12.15)]

(Bruce Abbott (2017.10.01.1645 EDT)]

BA: No learning is involved, neither reorganization nor RL. I simply altered the existing code from the original demo a very slight bit to make the pendulum self-inverting.

BA: If you download and run the program, you will see what self-inverting means. Try it!

BA: I’d be interested in hearing from anyone who has run this magical demo. Without looking at the source code, can you guess how self-inversion was achieved? The solution was so simple I was surprised when it actually worked!

Awesome demo Bruce, I love it!

The trick is very neat, and interesting (though I had to look at the code). Is 0.8 the ground level?

Do you think it is valid, in terms of PCT, as it is introducing a perception by the back door? I think that perception could be taken out into its own control system, which would then switch between two lower systems, in a similar way with the brake/throttle scenario?

Rupert

rupert · October 2, 2017, 12:47pm

[From Rupert Young (2017.10.02 13.45)]

(Rick Marken (2017.10.01.1250)]

Isn't this a matter of terminology? What they are calling selecting

“actions”, we might call selecting sub-goals (to bring about a
resultant higher goal). It seems to me that in RL they are selecting
“actions” in order to produce results, which they describe (or
measure) in terms of reward.

In tic-tac-toe if there are two X's in line and you have an option

of a number of different moves you select the one that gives you the
result/goal of three X’s in a line. This applies to both PCT and RL.
There may be other reasons to discount RL mechanisms, but there
seems to be similarities at this level with PCT that it is the
results that are being “controlled” by said selection.

Do you have any thoughts on how playing tic-tac-toe would be

described in a PCT context? I think it would be very useful to do
this as it would give some insight to program-level perception
control, which is what has largely concerned traditional AI
(including RL).

Regards,

Rupert

···

RM: You may be right. But it wasn’t apparent to me that
that was the case. It looked to me like reinforcement
learning (which I presume is what Q-learning is) was used
to select actions. Producing particular actions is not the
same as producing controlled results – results of action
that are brought to and maintained in pre-selected states
in the face of unpredictable (and usually undetectable)
variations in disturbances.

Abbott · October 2, 2017, 1:43pm

[From Bruce Abbott (2017.01.02.0945 EDT)]

Rupert Young (2017.10.02 12.15) –

(Bruce Abbott (2017.10.01.1645 EDT)]

BA: No learning is involved, neither reorganization nor RL. I simply altered the existing code from the original demo a very slight bit to make the pendulum self-inverting.

BA: If you download and run the program, you will see what self-inverting means. Try it!

BA: I’d be interested in hearing from anyone who has run this magical demo. Without looking at the source code, can you guess how self-inversion was achieved? The solution was so simple I was surprised when it actually worked!

RY: Awesome demo Bruce, I love it!

BA: Thanks!

RY: The trick is very neat, and interesting (though I had to look at the code). Is 0.8 the ground level?

BA: No, ground level is zero. The pendulum is one meter long, so when hanging straight down, the bob is at Y = -1.0, and +1.0 when standing straight up.

RY: Do you think it is valid, in terms of PCT, as it is introducing a perception by the back door?

BA: Not as currently implemented. But I don’t want to discuss this issue too much at this point, as I want to give others a chance at guessing the solution.

RY: I think that perception could be taken out into its own control system, which would then switch between two lower systems, in a similar way with the brake/throttle scenario?

BA: Yes, certainly, but at the cost of increased complexity, of course. Would we really need two nearly identical lower systems that differ only in you-know-what? Or could the higher system manipulate the lower system’s you-know-what directly? What perception would the higher system be controlling for?

Bruce

B_Hawker · October 2, 2017, 1:59pm

[From Ben Hawker (2017.10.02 14:34) ]

···

On 30 September 2017 at 14:14, Rupert Young rupert@perceptualrobots.com wrote

RY: Do such examples
actually exist yet? I’ve been thinking about the Mountain Car
problem (https://en.wikipedia.org/wiki/Mountain_car_problem ) which would seem a good and
reasonably simple learning problem, of continuous variables,
that PCT should be able to address, and is analogous to your
problem above. Do we have a hierarchy of error reducing nodes
that can solve that problem?

        Nope. Very complicated
problem. Would take a long time to derive the hierarchy,
it’s got a lot of complex perceptions there.

RY: Is it that bad? Learning aside, I would have thought a hierarchy
could be manually constructed to control such variables as position,
speed and something that represents the oscillation of the car up
each side of the valley.

BH: I can see how position would be set by a speed controller, but then surely you’d need multiple nodes to then handle the oscillatory effect you put forward? I guess what you’re basically putting forward is a perception of moving as far as possible in either direction (or a setup to increase momentum, and increasing momentum is the goal). Either way, “learning aside” will not anyone impress. The impressive part is that RL learns this and doesn’t need large amounts of complicated theory and tuning. The perceptions to solve it wouldn’t be too outrageous. But having a system learn what perceptions and in what order and then balance them is a mammoth task. Perhaps I’m wrong, feel free to prove such! I’d like to be proved wrong on that.

        BH: To clarify, it is not using RL for adaptive control
systems instead of discrete state spaces. The critic
is using state and action spaces with a value function to
advise the actor (which is a PD controller I think?). So, it
still uses state action pairings and appropriate values of
those to motivate changes in behaviour.

…

Does that make sense? Sorry, felt a bit ramble-y.

RY: Erm, what do the states actually represent? And what is meant by
actions in this case of an adaptive controller, are they changes to
the parameters on control (e.g. PD gains)?

BH: The maths in this paper is mildly diabolical and lots of descriptions of variables are scattered to the four winds. So, I am not 100% sure. U1 and U2 seem to be what the controller is optimising. As it’s two parameters, this is presumably the P and D gain? Bit of a leap here, but seems logical. The X series seem to be the control inputs, and the e series the errors. So, the states are made up of Q-Values for every q*(x,u,d). Note, that d here doesn’t mean the D gain in PD, d is merely a discount factor. So, each state comprises of an X (the control input), U (the gains) and D (some discount factor). The control input presumably is position and velocity of the two joints.