Negative feedback too simple (was Re: W. Ross Ashby: Digital Archive)

[Martin Taylor 2014.01.28.11.27]

Tim Coomb 2014.01.28.16.04
Thanks Martin. As someone only recently starting to understand PCT could you explain what you mean by the term 'perceptual function' you use in the third paragraph. If we take figure 2-3 on page 28 of LCSIII are you just referring to the input function or do you mean something wider.

Yes, it is what is called the "Input Function" in that figure.

I prefer the name "Perceptual function" or "Perceptual Input Function" (sometimes I just use "PIF") for two reasons. The first reason is that it produces a perceptual signal. I suspect Bill used "Input Function" to emphasize that control systems control their inputs and not their outputs.

The second reason is that, though it is not shown in the figure, the control loop has two "input functions", not just the PIF. Since the reference value is influenced by the scalar output values of many higher-level control units, there has also to be some kind of "Reference Input Function" (RIF) that determines how the single scalar value of the reference signal depends on those many higher-level output values.

Some demos of multilevel systems use simple summation as the RIF; the reference value is the sum of the outputs from the higher-level control units. A more sophisticated kind of summation uses different weights, including possibly negative weights, for the different higher level outputs. Those weights are changed by reorganization if the simulation includes reorganization. But in Fig 15-3 of B:CP, Bill shows a more subtle kind of RIF, an associative memory. This memory provides as a reference value a remembered perceptual value. The various contributing outputs from the higher-level control units provide an addressing vector for this memory, so that the control unit has a reference to produce whatever perceptual value was stored at that address at some past time. It provides a reference value for a perception in a particular historical context.

The primary difference between a PIF and an RIF is that the PIF exists as part of the control loop for a single elementary control unit (ECU), whereas the RIF is outside the loop for its specific ECU, though it is part of many higher-level control loops.

Just as there are lots of possible forms of PIF, there are lots of possible forms of RIF, and in both cases, the form may depend on the level in the hierarchy under consideration. So far as I know, there has been little or no research on this question.

Anyway, the fact that there are two equally important input functions for each ECU is the main reason why I treat the word "perceptual" rather than "input" as the primary modifier of "function" when we are dealing with inputs. Not everyone does the same, but I think it reduces possible ambiguity when we treat structures of ECUs rather than individual ones.

Martin

[Philip 2014.01.29.08.00]

Let me see yours and mr Kenway’s work concerning information theory. Mine will soon follow (Ive been working on it everyday).

The Riemann zeta function will get me a fields metal, but reinventing anti-gravity will get me to other galaxies (I seriously don’t want to give you an exaggerated picture of my accomplishments, but my work will warrant a Nobel prize in every field simultaneously, plus the fields medal). I think after I finish my work, the world will be a different place. I don’t think the nobel prize will exist anymore simply because there will be no more occasion to warrant it. This is the end of the game. Final level, final boss. You either beat the game and it’s done or you lose and start from level 1.

The door to the skylight of the basement floor.

Consciousness is the controlled perception of the purpose (or reference values).

I don’t see the resemblance in meaning. One of them is an object of perception, the other is an operation on the perception.

Word salad. I don’t understand these figures of speech. Do you think I’m saying random words? Do you want me to grill you a word steak instead? Is there something peculiar about word salads? Maybe some sort of mixing of ingredients that makes them no longer makes sense?

Since were talking about word choice here,

Maybe if I just removed the word controlled perception, it would make more intuitive sentence:

What if I said,

consciousness is the stability of the purpose of behavior,

then perhaps we’ll not think in terms of controlled perceptions of anything and switch our focus from control->stability.

What’s the difference between these two?

I think stability is merely a coincidence whereas control is purpose?
Toroidal flow through certain geometries is stable, but do we consider tornadoes to be control systems? Do we consider controlled perceptions? No, we think in terms of thermodynamic stability.

My definition makes you think about multiplexing across control systems that would otherwise be in conflict. Yes, very good. We do the hard thing and assume they always will be, via an application of Moores law: anything that can go wrong will go wrong. You are right, this is something very different and very obscure and we will make it very normal and very well understood. We are not going to shift from one controlled variable to another. We are going to stick with our purpose and make sure the universe doesn’t screw with it by bringing the entire environment into our model. Obviously this is way beyond what PCT is about.

You said
if you want to create a structure in which one control system’s controlled perception is the reference value of another, that creates no problem for me. However, I think it would create a problem for the behaving structure of control systems.

I think you’re actually completely right about this and it does potentially invalidate my definition of consciousness. But what I just wrote above I think should give us a new perspective.

By the way: let’s not focus too much on consciousness and reorganization in microbial life forms. I think it’s very important to consider eukaryotic biologies with both epigenetics and advanced extra-cellular signaling because my original consciousness-PCT developments were geared towards explaining the extra hard questions pertaining to cellular communication and differentiation. Prokaryotic life forms generally have no advanced form of communication or differentiation. Given that our language and cognition is the most essential aspect of our consciousness, I think we need to be much more brave in choosing the scope of what we apply PCT to in order to fully understand it. PCT is the theory of everything. As soon as you decide that it isn’t, it no longer is.

···

On Monday, January 27, 2014, Martin Taylor mmt-csg@mmtaylor.net wrote:

[Martin Taylor 2014.01.27.17.58]

[Philip 2014.01.28.1.20]

      As an aside: It surprises me how much information was known

during the time Bill invented his theory and it also surprises
me how little cross-polination has occured between Bill’s
ideas and some of the seriously groundbreaking mathematical
treatises which were available at the time. If I were to
suggest the first point of unification of these theories, I
would direct you towards the mathematical theories of
communication as developed by Shannon in the 50’s. Truly
excellent work. The tipping point which allowed me to
eventually arrive at my definition of consciousness a few
months later was Shannon’s definition of the
information-channel capacity. I will try to show how you
could seamlessly blend perceptual control theory with
information theory. If this has already been done, please let
me know so I can take a look at the work. But I don’t think
anybody has tried to as of yet.

How long have you been reading CSGnet? If it is more than a few

months, you may have seen some of my attempts at tutorials on the
application of information theory (Shannon, not later variants) to
control. It’s simply not true that nobody has tried it yet. Richard
Kennaway says that he has looked into it, but (a few years ago) had
not come up with formal solutions in information-theoretic terms.


          [MT] Please give us links to notes, publications, drafts,

or other material that describes this strengthened
mathematical basis of PCT. At the moment, the best
mathematical analyses we know are those provided by
Richard Kennaway. It would be nice to supplement these
with analyses done from a different viewpoint. And it
would be even nicer to have a mathematical foundation for
consciousness.

       ***
      I can provide you with strong mathematical proofs as soon

as I complete my required study of the entire field of
mathematics. I am moving very fast, and will finish soon (but
math books are very hard to read, and we’re talking about pure
math here - were also talking about pure autodidaction, since
my old professors don’t seem to care much enough about PCT to
get involved with my work). But there’s good news: I have
already arrived at the basic solution strategy for every
single currently unsolved math problem in existence (including
the difficult ones which have eluded us since the time of the
ancients, as well as the modern day problems of quantum
mechanics).

I hadn't realized that we had the greatest mathematician of all

times among us. This is indeed a great honour. Resolving the Riemann
Zeta problem will at least get you a Fields medal. Congratulations
in advance.

      ***              Consciousness is the controlled perception of the

purpose of behavior (i.e. the reference value)
[MT] ??? At a loss to even guess at what you mean by this
???***

      This made me laugh.  I always wondered how the PCT crowd

would react when I tried to give them my exact definition of
consciousness.

When will you do that? The words quoted above say nothing. Perhaps

they mean something to you, but they don’t to me. “The controlled
perception of the purpose of behaviour” uses words that have
specific referents in PCT, but that cannot sensibly be put together
in that way. It makes as much sense to me as “The door to the
skylight of the basement floor”.

      This definition of is something which I developed in order

to explain how it would be possible to control the purpose of
your behavior.

But what does that mean? "To control the purpose of your behaviour"

is word salad. I’ve tried to figure out what it could possibly mean,
but the nearest I can come to it is time multiplexing across control
systems that would otherwise be in conflict. You have to shift from
controlling one variable to controlling another because you have
only one pair of hands and can’t be in two places at once. But that
seems too obvious and mundane to require months of deep thought, so
you must mean something different and obscure.

      Now, I notice my definition has provoked genuine surprise

and uncertainty.

No. Bewilderment and a request for translation or elaboration so

that it might come to mean something – anything. If it is
translated into an intelligible form, then it might provoke surprise
and/or uncertainty. Or it might not.

      I thought this definition of consciousness would be

self-explanatory and would immediately make sense, because I
noticed that such a combination of phrasing had never even
been considered by Powers, even though it is generally of the
same form as the basic mantra of PCT, and so I thought it
would have an intuitive appeal.

Yes, you can make parallel mantras. PCT has two mantras "All

behaviour is the control of perception" and “You can’t tell what
someone is doing by looking at what they are doing”. Yours seems to
parallel neither of these, and I don’t see the metaphoric parallels
either, so it remains mysterious.

      According to PCT, all behavior is the control of a

perceptual variable.

So far, so good.
      In essence, the first thing I have done is to equate

consciousness with behavior (referring to the general term
“consciouness” as “conscious behavior”).

What justifies this equivalence? It seems to me to be arbitrary and

indefensible. Arbitrary because it is by fiat, and indefensible
because it would require every action we produce to be present in
consciousness, which is clearly not the case.

      Then I equated the purpose of behavior with the reference

variable, just as I had seen had been done before.

That's fair enough by itself. The current value of the reference

variable is often described as the current purpose of that
particular elementary control unit.

      I then had to contend with the issue of how one could go

about controlling this variable.

That would be a definite departure from Bill's HPCT, which is fine.

Bill never expected HPCT to be the full story. Personally, I have
preferred to see how far we can go without deviating from HPCT,
though I think of many ways in which it will probably need to be
changed to explain all that we and other living things do. So if you
want to create a structure in which one control system’s controlled
perception is the reference value of another, that creates no
problem for me. However, I think it would create a problem for the
behaving structure of control systems.

Consider the example Bill uses, of steering a car. At one level the

driver is controlling a perception of where in the lane the car is
at the moment. If it is further left than its reference value, that
control system produces output that influences the reference value
for the angle of the steering wheel. What would happen if there was
another control system that actually perceived the reference value
for the steering wheel angle and controlled its perception of that
reference value? The control system that is controlling the
perception of the position of the car in its lane can’t control that
reference value, because it is already controlling the lane position
perception, so it must be quite another control system controlling
the wheel-angle reference value. The control system that controls
the perception of where the car is in its lane could no longer
influence the angle of the steering wheel because it is being
controlled by the other control system, and the car would probably
go off the road.

In HPCT, no reference value is controlled, and I find it hard to

envisage a viable construction in which it would be useful for
reference values to be controlled. But you have envisaged such a
structure. Maybe it would be helpful if you were to show a sketch of
how such a structure would work without getting itself tied in
knots. Better yet, make a simulation in some platform-independent
code, to show that it would actually work.

      PCT essentially points out that the most any control system

can do to control some variable is to control a perception of
that variable.

Not "the most any control system can do" -- it's what a control

system does, and the only thing it does.

      This is absolutely basic.  So I figured the next question

should be, what is it which allows you to actually have a
perception of something.

In the eye, rods and cones, then retinal ganglia and other neural

structures that produce signals in the optic nerve, and so on and so
forth. At every stage, the patterns derived (at least in part) from
the patterns of photons impinging on the retina are “perceptions” in
PCT. Some of these are controlled. Most are not.

      This is a tough question, but I learned it becomes easier

to ask the more specific question, what is it which allows you
to have a controlled perception of something. That’s
when I incorporated information theory and Shanon’s
information channel capacity into the consideration and the
solution eventually became apparent, once we merge the concept
of perception and communication (I will provide you with this
explanation soon).

The sooner the better. At the moment, my information analysis is not

in closed form. Yes, channel capacity is important. So is the
disturbance information rate, the reference information rate, the
loop delay, the information loss rate over storage time (e.g. the
leak – “slowing”-- rate in the Powers leaky integrator output
stage), and a few other small details. In open form, one can at
least say that the quality of control is limited by the disturbance
information rate and the loop transport lag, but that’s a very loose
kind of bound. It would be nice to have a formulation in which all
the relevant aspects are taken into account, even for one elementary
control loop.

      ***              [MT] According to PCT, there is no control of output,

and hence no control of reference values at the level
below. Output is what it needs to be to bring and keep the
controlled variable (the perception) near its reference
value, a value supplied by the output of a higher system
unless the control unit of concern is at the top of the
hierarchy – in which case the reference value is taken to
be a fixed zero because there is no corresponding input
signal.***

      Basically, this is the model of behavior without consciousness

involved. In More Mind Readings, Rick explicitly points out
that control systems can be modeled without any incorporation
of consciousness into the system. This is true. But
remember, consciousness has never been properly defined.

Quite true. And unless you can make your definition both

intelligible and conforming to experience, it remains true so fa as
I am concerned.

      ***
          [MT] "Awareness" seems to entail being conscious of

something, for sure, but I don’t remember Powers ever
defining “consciousness”. Earlier you said it had been
defined as the process of reorganization, which doesn’t
seem much like the definition here.***
*** [Me] Powers defined consciousness as the perception and
awareness of the perception.
[MT] Could you provide an article and page reference for
this?***
In the 1973 edition of B:CoP, this definition is given as
the working definition of consciousness on pg 200:

        Consciousness consists of perceptions (presence of nural

currents in a perceptual pathway) and awareness
(reception by the reorganizing system of duplicates of those
signals).

        Now here's what I say: what if we imagine turning the

reference variable into the variable being controlled by
imagining a separate perceptual control loop (completely
internal to the organism) which receives duplicates of
REFERENCE signals as the perceptual signal then controls
these with a new set of references. Wouldn’t this give you
a picture of consciousness as the voice in your head saying
“now I see I’m doing this one thing, and now I’m going to
try to do this other thing…”. It also gives you a reason
for “why” the conscious experience occurs. This is one way
to look at it. But now that I think more of it, based off
what I was rereading today from Wiener’s cybernetics, this
explanation can also be summed up by considering what Wiener
refers to as “the best characteristic frequency of
filtration”. I will look into this.

I don't think you have the mechanism quite right, if what you want

to achieve is time multiplexing. What you have to do is influence
the loop gains in such a way that the control unit with purpose you
want to give up gets a low or zero loop gain and the control unit
with the purpose you want to start pursuing begins to have an
appreciable (negative) loop gain. For dynamical reasons, the switch
can’t be instant, but it can be fast relative to the time scale of
our shifts of consciousness. And for what it’s worth, I have long
ago proposed on CSGnet that it is this shifting of control that is a
large part of consciousness (another part of my proposal is related
to states of poor control). But this is quite different again from
your two previous definitions of consciousness.

…Powers’ model is totally linear.

No it isn't. Powers has often used linear structures in his

simulations, but he has also used nonlinear ones. Powers and Marken
have at least one paper (apparently unpublished, but I’m sure Rick
would send it to you) about logarithmic perceptual functions and the
behavioral illusion. Furthermore, since the HPCT perceptual
structure is exactly a multi-level perceptron, it cannot work if the
perceptual functions are linear. Linear loops are, however, easier
to analyze, and quite often a linear approximation works quite well,
at least for small excursion. So the units are quite often described
in linear terms without any caveats, so it is easy to assume that
linearity is an element of the theory.

      ***              [MT] Two questions: (1) how do you demonstrate that you

are right? and (2) If you are right, what does it have to
do with PCT? The point of question 2 is not to limit the
scope of PCT, but to ask whether you have a modelable of
integrating consciousness into the PCT structure.***

      Take a person, stick him in the MRI.  Run the experiment

I’m developing.

When you describe it, we will know whether you have answered the

question.

      ...This stuff [MT: esoteric mathematics] is important

because if we are ever to differentiate between the terms
“sensation” and “perception”, we need to understand more about
the mathematics of the Fourier and Laplace Transform.

Which both apply to linear systems. The form of the underlying

maths, though, is exactly what the perceptron structure does. All
the transforms do is rotate the data from one basis space into
another. Doing that with nonlinear scaling allows the multiple
levels of perception to deal with different kinds of structures in
the environment. So yes, understanding the maths of the Fourier
transform helps in understanding how that happens. Different
rotations, such as short-time Fourier transforms, do different
things. But it is indeed good to understand that they are all
squashed rotations.

*** [MT] I’m not clear what you mean be “reference
perception”. In the HPCT structure, each individual
elementary control unit has a “perception” that is in the
feedback loop and a “reference” that is not. When you use
the words as though “reference” were a kind of
“perception” you are outside the structure of PCT defined
by Powers. The word pair may mean something in some other
theory, but it me***

[From Rupert Young (2014.01.28 21.30 UT)]

PJY: The Riemann zeta function will get me a fields metal, but reinventing anti-gravity will get me to other galaxies (I seriously don't want to give you an exaggerated picture of my accomplishments, but my work will warrant a Nobel prize in every field simultaneously, plus the fields medal). I think after I finish my work, the world will be a different place. I don't think the nobel prize will exist anymore simply because there will be no more occasion to warrant it.

Great! Can you sort out perpetual motion while you're at it? Should be a cinch for you.

Regards,
Rupert

[Philip 2014.01.29.08.00]

Let me see yours and mr Kenway's work concerning information theory. Mine will soon follow (Ive been working on it everyday).

The Riemann zeta function will get me a fields metal, but reinventing anti-gravity will get me to other galaxies (I seriously don't want to give you an exaggerated picture of my accomplishments, but my work will warrant a Nobel prize in every field simultaneously, plus the fields medal). I think after I finish my work, the world will be a different place. I don't think the nobel prize will exist anymore simply because there will be no more occasion to warrant it. This is the end of the game. Final level, final boss. You either beat the game and it's done or you lose and start from level 1.

The door to the skylight of the basement floor.

Consciousness is the controlled perception of the purpose (or reference values).
I don't see the resemblance in meaning. One of them is an object of perception, the other is an operation on the perception.

Word salad. I don't understand these figures of speech. Do you think I'm saying random words? Do you want me to grill you a word steak instead? Is there something peculiar about word salads? Maybe some sort of mixing of ingredients that makes them no longer makes sense?
Since were talking about word choice here,

Maybe if I just removed the word controlled perception, it would make more intuitive sentence:

What if I said,
consciousness is the stability of the purpose of behavior,
then perhaps we'll not think in terms of controlled perceptions of anything and switch our focus from control->stability.
What's the difference between these two?
I think stability is merely a coincidence whereas control is purpose?
Toroidal flow through certain geometries is stable, but do we consider tornadoes to be control systems? Do we consider controlled perceptions? No, we think in terms of thermodynamic stability.

···

On 28/01/2014 17:17, PHILIP JERAIR YERANOSIAN wrote:

My definition makes you think about multiplexing across control systems that would otherwise be in conflict. Yes, very good. We do the hard thing and assume they always will be, via an application of Moores law: anything that can go wrong will go wrong. You are right, this is something very different and very obscure and we will make it very normal and very well understood. We are not going to shift from one controlled variable to another. We are going to stick with our purpose and make sure the universe doesn't screw with it by bringing the entire environment into our model. Obviously this is way beyond what PCT is about.
You said
if you want to create a structure in which one control system's controlled perception is the reference value of another, that creates no problem for me. However, I think it would create a problem for the behaving structure of control systems.

I think you're actually completely right about this and it does potentially invalidate my definition of consciousness. But what I just wrote above I think should give us a new perspective.

By the way: let's not focus too much on consciousness and reorganization in microbial life forms. I think it's very important to consider eukaryotic biologies with both epigenetics and advanced extra-cellular signaling because my original consciousness-PCT developments were geared towards explaining the extra hard questions pertaining to cellular communication and differentiation. Prokaryotic life forms generally have no advanced form of communication or differentiation. Given that our language and cognition is the most essential aspect of our consciousness, I think we need to be much more brave in choosing the scope of what we apply PCT to in order to fully understand it. PCT is the theory of everything. As soon as you decide that it isn't, it no longer is.

On Monday, January 27, 2014, Martin Taylor <<mailto:mmt-csg@mmtaylor.net>mmt-csg@mmtaylor.net> wrote:

[Martin Taylor 2014.01.27.17.58]

[Philip 2014.01.28.1.20]

As an aside: It surprises me how much information was known during the time Bill invented his theory and it also surprises me how little cross-polination has occured between Bill's ideas and some of the seriously groundbreaking mathematical treatises which were available at the time. If I were to suggest the first point of unification of these theories, I would direct you towards the mathematical theories of communication as developed by Shannon in the 50's. Truly excellent work. The tipping point which allowed me to eventually arrive at my definition of consciousness a few months later was Shannon's definition of the information-channel capacity. I will try to show how you could seamlessly blend perceptual control theory with information theory. If this has already been done, please let me know so I can take a look at the work. But I don't think anybody has tried to as of yet.

How long have you been reading CSGnet? If it is more than a few months, you may have seen some of my attempts at tutorials on the application of information theory (Shannon, not later variants) to control. It's simply not true that nobody has tried it yet. Richard Kennaway says that he has looked into it, but (a few years ago) had not come up with formal solutions in information-theoretic terms.

[MT] Please give us links to notes, publications, drafts, or other material that describes this strengthened mathematical basis of PCT. At the moment, the best mathematical analyses we know are those provided by Richard Kennaway. It would be nice to supplement these with analyses done from a different viewpoint. And it would be even nicer to have a mathematical foundation for consciousness.

I can provide you with strong mathematical proofs as soon as I complete my required study of the entire field of mathematics. I am moving very fast, and will finish soon (but math books are very hard to read, and we're talking about pure math here - were also talking about pure autodidaction, since my old professors don't seem to care much enough about PCT to get involved with my work). But there's good news: I have already arrived at the basic solution strategy for every single currently unsolved math problem in existence (including the difficult ones which have eluded us since the time of the ancients, as well as the modern day problems of quantum mechanics).

I hadn't realized that we had the greatest mathematician of all times among us. This is indeed a great honour. Resolving the Riemann Zeta problem will at least get you a Fields medal. Congratulations in advance.

Consciousness is the controlled perception of the purpose of behavior (i.e. the reference value)
[MT] ??? At a loss to even guess at what you mean by this ???

This made me laugh. I always wondered how the PCT crowd would react when I tried to give them my exact definition of consciousness.

When will you do that? The words quoted above say nothing. Perhaps they mean something to you, but they don't to me. "The controlled perception of the purpose of behaviour" uses words that have specific referents in PCT, but that cannot sensibly be put together in that way. It makes as much sense to me as "The door to the skylight of the basement floor".

This definition of is something which I developed in order to explain how it would be possible to control the purpose of your behavior.

But what does that mean? "To control the purpose of your behaviour" is word salad. I've tried to figure out what it could possibly mean, but the nearest I can come to it is time multiplexing across control systems that would otherwise be in conflict. You have to shift from controlling one variable to controlling another because you have only one pair of hands and can't be in two places at once. But that seems too obvious and mundane to require months of deep thought, so you must mean something different and obscure.

Now, I notice my definition has provoked genuine surprise and uncertainty.

No. Bewilderment and a request for translation or elaboration so that it might come to mean something -- anything. If it is translated into an intelligible form, then it might provoke surprise and/or uncertainty. Or it might not.

I thought this definition of consciousness would be self-explanatory and would immediately make sense, because I noticed that such a combination of phrasing had never even been considered by Powers, even though it is generally of the same form as the basic mantra of PCT, and so I thought it would have an intuitive appeal.

Yes, you can make parallel mantras. PCT has two mantras "All behaviour is the control of perception" and "You can't tell what someone is doing by looking at what they are doing". Yours seems to parallel neither of these, and I don't see the metaphoric parallels either, so it remains mysterious.

According to PCT, all behavior is the control of a perceptual variable.

So far, so good.

In essence, the first thing I have done is to equate consciousness with behavior (referring to the general term "consciouness" as "conscious behavior").

What justifies this equivalence? It seems to me to be arbitrary and indefensible. Arbitrary because it is by fiat, and indefensible because it would require every action we produce to be present in consciousness, which is clearly not the case.

Then I equated the purpose of behavior with the reference variable, just as I had seen had been done before.

That's fair enough by itself. The current value of the reference variable is often described as the current purpose of that particular elementary control unit.

I then had to contend with the issue of how one could go about controlling this variable.

That would be a definite departure from Bill's HPCT, which is fine. Bill never expected HPCT to be the full story. Personally, I have preferred to see how far we can go without deviating from HPCT, though I think of many ways in which it will probably need to be changed to explain all that we and other living things do. So if you want to create a structure in which one control system's controlled perception is the reference value of another, that creates no problem for me. However, I think it would create a problem for the behaving structure of control systems.

Consider the example Bill uses, of steering a car. At one level the driver is controlling a perception of where in the lane the car is at the moment. If it is further left than its reference value, that control system produces output that influences the reference value for the angle of the steering wheel. What would happen if there was another control system that actually perceived the reference value for the steering wheel angle and controlled its perception of that reference value? The control system that is controlling the perception of the position of the car in its lane can't control that reference value, because it is already controlling the lane position perception, so it must be quite another control system controlling the wheel-angle reference value. The control system that controls the perception of where the car is in its lane could no longer influence the angle of the steering wheel because it is being controlled by the other control system, and the car would probably go off the road.

In HPCT, no reference value is controlled, and I find it hard to envisage a viable construction in which it would be useful for reference values to be controlled. But you have envisaged such a structure. Maybe it would be helpful if you were to show a sketch of how such a structure would work without getting itself tied in knots. Better yet, make a simulation in some platform-independent code, to show that it would actually work.

PCT essentially points out that the most any control system can do to control some variable is to control a perception of that variable.

Not "the most any control system can do" -- it's what a control system does, and the only thing it does.

This is absolutely basic. So I figured the next question should be, what is it which allows you to actually have a perception of something.

In the eye, rods and cones, then retinal ganglia and other neural structures that produce signals in the optic nerve, and so on and so forth. At every stage, the patterns derived (at least in part) from the patterns of photons impinging on the retina are "perceptions" in PCT. Some of these are controlled. Most are not.

This is a tough question, but I learned it becomes easier to ask the more specific question, what is it which allows you to have a controlled perception of something. That's when I incorporated information theory and Shanon's information channel capacity into the consideration and the solution eventually became apparent, once we merge the concept of perception and communication (I will provide you with this explanation soon).

The sooner the better. At the moment, my information analysis is not in closed form. Yes, channel capacity is important. So is the disturbance information rate, the reference information rate, the loop delay, the information loss rate over storage time (e.g. the leak -- "slowing"-- rate in the Powers leaky integrator output stage), and a few other small details. In open form, one can at least say that the quality of control is limited by the disturbance information rate and the loop transport lag, but that's a very loose kind of bound. It would be nice to have a formulation in which all the relevant aspects are taken into account, even for one elementary control loop.

[MT] According to PCT, there is no control of output, and hence no control of reference values at the level below. Output is what it needs to be to bring and keep the controlled variable (the perception) near its reference value, a value supplied by the output of a higher system unless the control unit of concern is at the top of the hierarchy -- in which case the reference value is taken to be a fixed zero because there is no corresponding input signal.

Basically, this is the model of behavior without consciousness involved. In More Mind Readings, Rick explicitly points out that control systems can be modeled without any incorporation of consciousness into the system. This is true. But remember, consciousness has never been properly defined.

Quite true. And unless you can make your definition both intelligible and conforming to experience, it remains true so fa as I am concerned.

[MT] "Awareness" seems to entail being conscious of something, for sure, but I don't remember Powers ever defining "consciousness". Earlier you said it had been defined as the process of reorganization, which doesn't seem much like the definition here.
[Me] Powers defined consciousness as the perception and awareness of the perception.
[MT] Could you provide an article and page reference for this?
In the 1973 edition of B:CoP, this definition is given as the working definition of consciousness on pg 200:
Consciousness consists of perceptions (presence of nural currents in a perceptual pathway) and awareness (reception by the reorganizing system of duplicates of those signals).
Now here's what I say: what if we imagine turning the reference variable into the variable being controlled by imagining a separate perceptual control loop (completely internal to the organism) which receives duplicates of REFERENCE signals as the perceptual signal then controls these with a new set of references. Wouldn't this give you a picture of consciousness as the voice in your head saying "now I see I'm doing this one thing, and now I'm going to try to do this other thing...". It also gives you a reason for "why" the conscious experience occurs. This is one way to look at it. But now that I think more of it, based off what I was rereading today from Wiener's cybernetics, this explanation can also be summed up by considering what Wiener refers to as "the best characteristic frequency of filtration". I will look into this.

I don't think you have the mechanism quite right, if what you want to achieve is time multiplexing. What you have to do is influence the loop gains in such a way that the control unit with purpose you want to give up gets a low or zero loop gain and the control unit with the purpose you want to start pursuing begins to have an appreciable (negative) loop gain. For dynamical reasons, the switch can't be instant, but it can be fast relative to the time scale of our shifts of consciousness. And for what it's worth, I have long ago proposed on CSGnet that it is this shifting of control that is a large part of consciousness (another part of my proposal is related to states of poor control). But this is quite different again from your two previous definitions of consciousness.

...Powers' model is totally linear.

No it isn't. Powers has often used linear structures in his simulations, but he has also used nonlinear ones. Powers and Marken have at least one paper (apparently unpublished, but I'm sure Rick would send it to you) about logarithmic perceptual functions and the behavioral illusion. Furthermore, since the HPCT perceptual structure is exactly a multi-level perceptron, it cannot work if the perceptual functions are linear. Linear loops are, however, easier to analyze, and quite often a linear approximation works quite well, at least for small excursion. So the units are quite often described in linear terms without any caveats, so it is easy to assume that linearity is an element of the theory.

[MT] Two questions: (1) how do you demonstrate that you are right? and (2) If you are right, what does it have to do with PCT? The point of question 2 is not to limit the scope of PCT, but to ask whether you have a modelable of integrating consciousness into the PCT structure.

Take a person, stick him in the MRI. Run the experiment I'm developing.

When you describe it, we will know whether you have answered the question.

...This stuff [MT: esoteric mathematics] is important because if we are ever to differentiate between the terms "sensation" and "perception", we need to understand more about the mathematics of the Fourier and Laplace Transform.

Which both apply to linear systems. The form of the underlying maths, though, is exactly what the perceptron structure does. All the transforms do is rotate the data from one basis space into another. Doing that with nonlinear scaling allows the multiple levels of perception to deal with different kinds of structures in the environment. So yes, understanding the maths of the Fourier transform helps in understanding how that happens. Different rotations, such as short-time Fourier transforms, do different things. But it is indeed good to understand that they are all squashed rotations.

[MT] I'm not clear what you mean be "reference perception". In the HPCT structure, each individual elementary control unit has a "perception" that is in the feedback loop and a "reference" that is not. When you use the words as though "reference" were a kind of "perception" you are outside the structure of PCT defined by Powers. The word pair may mean something in some other theory, but it me

-_- i am not impressed by this discussion

···

On Tuesday, January 28, 2014, Rupert Young rupert@moonsit.co.uk wrote:

  [From Rupert Young (2014.01.28 21.30

UT)]

  PJY: The Riemann zeta function will get me a fields metal, but

reinventing anti-gravity will get me to other galaxies (I
seriously don’t want to give you an exaggerated picture of my
accomplishments, but my work will warrant a Nobel prize in every
field simultaneously, plus the fields medal). I think after I
finish my work, the world will be a different place. I don’t think
the nobel prize will exist anymore simply because there will be no
more occasion to warrant it.

  Great! Can you sort out perpetual motion while you're at it?

Should be a cinch for you.

Regards,
Rupert

On 28/01/2014 17:17, PHILIP JERAIR YERANOSIAN wrote:

          [Philip

2014.01.29.08.00]

  Let me see yours and mr Kenway's work concerning information

theory. Mine will soon follow (Ive been working on it everyday).

    The Riemann zeta function will get me a fields metal, but

reinventing anti-gravity will get me to other galaxies (I
seriously don’t want to give you an exaggerated picture of my
accomplishments, but my work will warrant a Nobel prize in every
field simultaneously, plus the fields medal). I think after I
finish my work, the world will be a different place. I don’t
think the nobel prize will exist anymore simply because there
will be no more occasion to warrant it. This is the end of
the game. Final level, final boss. You either beat the game and
it’s done or you lose and start from level 1.

The door to the skylight of the basement floor.

    Consciousness is the controlled perception of the purpose

(or reference values).

    I don't see the resemblance in meaning. One of them is an

object of perception, the other is an operation on the
perception.

    Word salad. I don't understand these figures of speech. Do

you think I’m saying random words? Do you want me to grill you a
word steak instead? Is there something peculiar about
word salads? Maybe some sort of mixing of ingredients that makes
them no longer makes sense?

Since were talking about word choice here,

    Maybe if I just removed the word controlled perception, it

would make more intuitive sentence:

What if I said,

consciousness is the stability of the purpose of behavior,

    then perhaps we'll not think in terms of controlled

perceptions of anything and switch our focus from
control->stability.

What’s the difference between these two?

  I think stability is merely a coincidence whereas control

is purpose?
Toroidal flow through certain geometries is stable, but do we
consider tornadoes to be control systems? Do we consider
controlled perceptions? No, we think in terms of thermodynamic
stability.

    My definition makes you think about multiplexing across

control systems that would otherwise be in conflict. Yes, very
good. We do the hard thing and assume they always will be, via
an application of Moores law: anything that can go wrong will go
wrong. You are right, this is something very different and very
obscure and we will make it very normal and very well
understood. We are not going to shift from one controlled
variable to another. We are going to stick with our purpose and
make sure the universe doesn’t screw with it by bringing the
entire environment into our model. Obviously this is way beyond
what PCT is about.

You said
if
you want to create a structure in which one control system’s
controlled perception is the reference value of another,
that creates no problem for me. However, I think it would
create a problem for the behaving structure of control
systems.

    I think you're actually completely right about this and it

does potentially invalidate my definition of consciousness. But
what I just wrote above I think should give us a new
perspective.

    By the way: let's not focus too much on consciousness and

reorganization in microbial life forms. I think it’s very
important to consider eukaryotic biologies with both epigenetics
and advanced extra-cellular signaling because my original
consciousness-PCT developments were geared towards explaining
the extra hard questions pertaining to cellular communication
and differentiation. Prokaryotic life forms generally have no
advanced form of communication or differentiation. Given that
our language and cognition is the most essential aspect of our
consciousness, I think we need to be much more brave in choosing
the scope of what we apply PCT to in order to fully understand
it. PCT is the theory of everything.

But yes

···

On Tuesday, January 28, 2014, Rupert Young rupert@moonsit.co.uk wrote:

  [From Rupert Young (2014.01.28 21.30

UT)]

  PJY: The Riemann zeta function will get me a fields metal, but

reinventing anti-gravity will get me to other galaxies (I
seriously don’t want to give you an exaggerated picture of my
accomplishments, but my work will warrant a Nobel prize in every
field simultaneously, plus the fields medal). I think after I
finish my work, the world will be a different place. I don’t think
the nobel prize will exist anymore simply because there will be no
more occasion to warrant it.

  Great! Can you sort out perpetual motion while you're at it?

Should be a cinch for you.

Regards,
Rupert

On 28/01/2014 17:17, PHILIP JERAIR YERANOSIAN wrote:

          [Philip

2014.01.29.08.00]

  Let me see yours and mr Kenway's work concerning information

theory. Mine will soon follow (Ive been working on it everyday).

    The Riemann zeta function will get me a fields metal, but

reinventing anti-gravity will get me to other galaxies (I
seriously don’t want to give you an exaggerated picture of my
accomplishments, but my work will warrant a Nobel prize in every
field simultaneously, plus the fields medal). I think after I
finish my work, the world will be a different place. I don’t
think the nobel prize will exist anymore simply because there
will be no more occasion to warrant it. This is the end of
the game. Final level, final boss. You either beat the game and
it’s done or you lose and start from level 1.

The door to the skylight of the basement floor.

    Consciousness is the controlled perception of the purpose

(or reference values).

    I don't see the resemblance in meaning. One of them is an

object of perception, the other is an operation on the
perception.

    Word salad. I don't understand these figures of speech. Do

you think I’m saying random words? Do you want me to grill you a
word steak instead? Is there something peculiar about
word salads? Maybe some sort of mixing of ingredients that makes
them no longer makes sense?

Since were talking about word choice here,

    Maybe if I just removed the word controlled perception, it

would make more intuitive sentence:

What if I said,

consciousness is the stability of the purpose of behavior,

    then perhaps we'll not think in terms of controlled

perceptions of anything and switch our focus from
control->stability.

What’s the difference between these two?

  I think stability is merely a coincidence whereas control

is purpose?
Toroidal flow through certain geometries is stable, but do we
consider tornadoes to be control systems? Do we consider
controlled perceptions? No, we think in terms of thermodynamic
stability.

    My definition makes you think about multiplexing across

control systems that would otherwise be in conflict. Yes, very
good. We do the hard thing and assume they always will be, via
an application of Moores law: anything that can go wrong will go
wrong. You are right, this is something very different and very
obscure and we will make it very normal and very well
understood. We are not going to shift from one controlled
variable to another. We are going to stick with our purpose and
make sure the universe doesn’t screw with it by bringing the
entire environment into our model. Obviously this is way beyond
what PCT is about.

You said
if
you want to create a structure in which one control system’s
controlled perception is the reference value of another,
that creates no problem for me. However, I think it would
create a problem for the behaving structure of control
systems.

    I think you're actually completely right about this and it

does potentially invalidate my definition of consciousness. But
what I just wrote above I think should give us a new
perspective.

    By the way: let's not focus too much on consciousness and

reorganization in microbial life forms. I think it’s very
important to consider eukaryotic biologies with both epigenetics
and advanced extra-cellular signaling because my original
consciousness-PCT developments were geared towards explaining
the extra hard questions pertaining to cellular communication
and differentiation. Prokaryotic life forms generally have no
advanced form of communication or differentiation. Given that
our language and cognition is the most essential aspect of our
consciousness, I think we need to be much more brave in choosing
the scope of what we apply PCT to in order to fully understand
it. PCT is the theory of everything.

[Martin Taylor 2014.01.28.23.05]

Rupert, have you considered the possibility that Philip is running a

Test for the Controlled Variable?
Martin

···

[From Rupert Young (2014.01.28 21.30
UT)]

    PJY: The Riemann zeta function will get me a fields metal, but

reinventing anti-gravity will get me to other galaxies (I
seriously don’t want to give you an exaggerated picture of my
accomplishments, but my work will warrant a Nobel prize in every
field simultaneously, plus the fields medal). I think after I
finish my work, the world will be a different place. I don’t
think the nobel prize will exist anymore simply because there
will be no more occasion to warrant it.

    Great! Can you sort out perpetual motion while you're at it?

Should be a cinch for you.

Regards,
Rupert

[Martin Taylor 2014.01.28.23.59]

I can't speak for Kennaway, but as a sample of my approach, I can

refer you to two of several messages from about a year ago. In case
you haven’t been keeping your own archive of CSGnet messages for
that long, I copy the main body of the messages here, despite their
length. The first was after quite a long thread on the topic, rife
with misunderstandings about the nature of information as Shannon
described it. The second was intended as part of a tutorial series
on information theory, in part to dispel these misunderstandings. I
think these two should be enough to be getting on with.
Martin

AliceScales.jpg

ExponentialTrajectories.jpg

ExponentialTrajectories2.jpg

Re Negative feedback too simpl3.jpg

Re Negative feedback too simpl4.jpg

HistoryInformation.jpg

···
          [Philip

2014.01.29.08.00]

  Let me see yours and mr Kenway's work concerning information

theory. Mine will soon follow (Ive been working on it everyday).

  [Martin Taylor

2012.12.08.11.32]

          [From

Bruce Abbott (2012.12.07.1830 EST)]

        Rick Marken

(2012.12.07.1150)] –

          One

of us is confused here – and I don’t think it’s me! (nor
Martin)

Martin Taylor (2012.12.06.23.52)

              MT:... Somehow, the information

from the disturbance appears at the output, and the
only path through which this can happen is by way of
the internal circuitry of the control system.

            RM: You are assuming that the only way for information

about the disturbance to appear at the output is for
this information to have gone through the organism.

              How

else could it get there? Magic?

            This is just a fancy way of saying

that the causal (S-R) model of behavior must be true.

              Not

so. It’s a circular loop of causality, but each
element within the loop transmits information only one
way, from CV to perceptual signal to error signal to
output and back to CV. And because of loop delay,
these effects do not propagate instantly and
simultaneously around the loop.

  Let's see if we can take a different tack to make it more

intuitively easy to see how information from the disturbance gets
to the output even though the input quantity – the controlled
environmental variable – is influenced by both the disturbance
and the output of the control system. I’ll borrow from the usual
way differential calculus is introduced, by treating a smooth
waveform as though it was created by making a number of step
changes, and reducing the time between changes so they become
indefinitely small. But first, I want to illustrate how control
can be seen as measurement.

  To see why control can be considered to be measurement, think of

this example. Alice wants to know how heavy is a stone she has
picked up. She has a balance scale and a set of weights weighing
2, 1, 1/2, 1/4 … kilos. She puts the stone on one pan, and the
scale tilts down to the side on which she put the stone, so she
puts a weight on the other pan. The tilt stays the same, so she
adds another weight and the scale tilts the other way. Alice
removes the last weight she added and adds the next lighter one.
She keeps adding and removing weights until the scale stays level
or she runs out of ever smaller weights.

  What is Alice doing? Alice is performing the actions of the output

function of a control loop, looking at the error that is shown by
the tilt of the scale, and altering her output (the weights on the
other pan) until the error is zero. The output, which is the sum
of the weights in the other pan, now is a measure of the weight
of Alice’s stone in exactly the same way that the output of any
control system measures the disturbance to its controlled
environmental variable.

  Of course, there need be no human Alice in this story. The

perceptual function signals only the direction in which the scales
tilt, so the error is only a binary value, which could be called
“1” or “0”, “left” or “right”, “too heavy” or “too light”, or any
other contrasting labels. I will call the values “left” and
“right” according to which side of the scale is heavier. Likewise,
there is no need for Alice to provide the output function. It
could be a mechanical device that is provided with the weights
that have values in powers of two times 1 kilo, with 2 kilos the
heaviest. We can assume that the scale would break if the stone
was over 4 kilos!

  The output device would load and unload these weights onto and off

the scale pan according to the following algorithm. The stone is
on the left pan.

  1. Add to the  right-hand pan the heaviest weight not yet tried

(initially, since none have been tried, that means the 2 kilo
weight).

  2. If all the available weights have been tried, stop. Else...

  3a. If the error is "left" go to step 1 (there is not enough

weight in the right hand pan)

  3b, if the error is "right", remove the lightest weight in the pan

and add the heaviest weight not yet tried.

  At the end of this process, the balance is as close as the machine

(or Alice) can make it using the available weights. Anyone who
wanted to know the weight of the stone could simply read out the
weights in the pan as a binary number of kilos, with the units
starting at the 2 kilo weight. Those weights are the current
output value of the control system., which, without Alice, is a
perfectly standard control loop.

  One could continue this example using a continuously varying

weight instead of the stone, but I think it is better to talk
about the general case of an idealized control system with no
noise and no transport lag. Transport lag introduces an
independent limit on the quality of control, a limit that has
already been mentioned in this complex of threads. It also
introduces some complication in the other computations,
complication that is unnecessary in this illustrative discussion.

  We consider the "basic-standard" control system that has a pure

integrator as its output function.

  First, let's treat the effect of one step change in the

disturbance value. We assume that the error is initially zero, and
at time t0 the disturbance suddenly goes from whatever value it
may have had previously to d0. We treat this single step, and then
consider the effects of a series of steps. We start by showing how
the integrator rate parameter controls how fast information about
the disturbance arrives at the output – how fast the control
system measures the disturbance value.

  Let us trace the signal values at various places in the loop as

time passes after the step at time t0. At time t0+e (e means
epsilon, an indefinitely small amount), the error value is -d and
the output value qo is zero. At this early time the output value
is whatever it was before the step, and has an unchanged influence
on the value of the input quantity qi. As time passes, the output
value changes exponentially until after infinite time it reaches
-d. Assuming q0 was initially zero, then at time t:

  qo(t) = -d0*(1-e^(-G*(t-t0))).

  At the same time, the input value qi approaches zero by the same

exponential:

  qi(t) = 1 - e^(-G*(t-t0))

  Exponential changes in value are quite interesting in information

theory, as they often indicate a constant bit rate of information
transfer. To see this, you must understand that although the
absolute uncertainty about the value of a continuous variable
either before or after a measurement or observation is always
infinity, the information gained through the measurement or
observation is finite. If the range of uncertainty about the value
(such as standard deviation of measurement) is r0 before the
measure and r1 after the measure, the information gained is
log2(r0/r1) bits.

  Measurements and observations are never infinitely precise. When

someone says their perception is subjectively noise-free and
precise, they nevertheless would be unable to tell by eye if an
object changed location by a nanometer or declined in brightness
by one part in a billion. There is always some limit to the
resolution of any observation, so the issue of there being an
infinite uncertainty about the value of a continuous variable is
never a problem. In any case in which a problem of infinite
uncertainty might seem to be presenting itself, the solution is
usually to look at the problem from a different angle, but if that
is not possible it is always possible to presume a resolution
limit, a limit that might be several orders of magnitude more
precise than any physically realizable limit, but a limit
nevertheless.

  (If you want to get a more accurate statement, and see the

derivation of all this, I suggest you get Shannon’s book from
Amazon at the link I provided earlier
).
In the above example, q0 could have started at any value whatever,
and would still have approached -d exponentially with the same
time constant. The output value qo has the character of a
continuous measurement of the value of the disturbance, and this
measurement gains precision at a rate given by e^(-G*(t-t0)). (The
units of G are 1/seconds).
What is this rate of gain in precision, in bits/sec? One bit of
information has been gained when the uncertainty about the value
of d0 has been halved. In other words, the output has gained one
bit of information about the disturbance by the time tx such that
e^(-G*(tx-t0)) = 0.5
Taking logarithms (base 2) of both sides, we have
log2(e)(-G(tx-t0)) = -1
tx-t0 = 1/(Glog2(e)) = 1/(G1.443…) seconds
That is the time it takes for q0 to halve its distance to its
final value -d0 no matter what its starting value might have been,
which means it has gained one bit of information about the value
of the disturbance. The bit rate is therefore G1.443… bits/sec.
That is the rate at which the output quantity q0 gains information
about the disturbance, and also is the rate at which the input
quantity loses information about the disturbance. The input
quantity must lose information about the disturbance, because it
always approaches the reference value, no matter what the value of
the disturbance step. I suppose you could look upon G
1.443 as the bit rate of the loop
considered as a communication channel, but it is probably not
helpful do do so, as the communication is from the input quantity
back to itself. Without the complete loop, q0 does not act as a
measure of d0, but continues negatively increasing linearly
without limit at a rate Gd0. It is better just to think of
G
1.443 as the rate at which information from the disturbance
appears at the output quantity.
As a further matter of clarification, I suppose we ought to
mention the behavioural illusion here. If the environmental
feedback path is not a simple connector with a gain of unity, the
output quantity differs from the disturbance value. The mismatch
does not affect the information relationship between the
disturbance and the output quantity, since the environmental
feedback path provides a single-valued environmental feedback
function. It is always possible to compute the influence of the
output on the input quantity if you know the output value q0. This
means that q0 and the influence of the output on the input
quantity are informationally identical, whatever their functional
relationship, providing the environmental feedback function does
not change. If it does change, the effect is similar to adding a
noise source to a communication channel, a situation that was
Shannon’s primary concern and provides no particular complication
to the analysis. (Or in the case of Marken’s demo that involves
switching the sense of the environmental feedback path, the
communication analogy is the band-switching secret communication
procedure patented by the movie star Hedy Lamarr; in both cases,
the switching must be separately detected or otherwise known).
As a specific example, if the environmental feedback path has a
gain of 2, the output quantity will converge to a value of d/2.
The convergence to d/2 is at the same exponential rate as would
have been the case for en environmental feedback path gain of 1.0,
providing an information gain rate that is determined only by the
integration rate parameter G. Information from the disturbance
still arrives at the output quantity (and is lost from the input
quantity) at a rate G1.443… From now on, we will give the label “T” to the rate at which the
output gains information about the disturbance. T = G
1.443… bits/sec.
The rate of information gain about the disturbance value at the
output quantity is independent of the disturbance waveform and of
the environmental feedback function. It is a property of the
internal circuitry of the control system. Accordingly, we can make
some estimates about the achievable precision of control for
disturbances that are not simple step functions. First, however,
we should look at the effect of a second and subsequent step
change in the disturbance value.
The disturbance value changed abruptly to d0 at time t0. Now at
time t1 the disturbance value jumps to d1. The input quantity adds
d1-d0 to whatever value it had arrived at by t1. The error value
jumps by the same amount, but the output value qo does not change
immediately, because of the integrator output function. It starts
changing exponentially toward (the negative of) the new value of
the disturbance.
At t1+e, the output quantity has a value that is unknown, because
ever since t0 it has been in the process of approaching -d0
exponentially from some unknown starting point. The uncertainty of
its current value is T*(t1-t0) bits less than it had been at time
t0, because it was performing a process functionally equivalent to
measuring the value d0. Now it has to start forgetting the value
d0 and start measuring the value d1. Less colloquially, the mutual
information between the output quantity and the disturbance value
is instantly decreased by the uncertainty of the change in
disturbance value, while the mutual information between the
disturbance value and the input value is increased by the same
amount.
By how much are the mutual information values changed by the new
step? Since the absolute uncertainty of a continuous variable is
infinite, can that change be measured? Shannon shows again that it
can. The amount is log2(r_new/r_old) where r_new is the new range
of uncertainty (such as standard deviation) and r_old is the old
range of uncertainty. The new range of uncertainty depends on the
probability distribution of values that can be taken by the
disturbance at that time, which may be unknown. It is not a new
unknown, however, since the original disturbance step came from
the same distribution, and the original value of the output
quantity probably came from the same distribution (modified, of
course, by the inverse of the environmental feedback function) if
the system had been controlling prior to the first step that we
analyzed.
After the second step change in the disturbance value, the output
quantity is gaining information, still about the disturbance
value, but not about d0. Now it is gaining information about d1
and losing information about d0. At the same time, the input
quantity is losing information about d1, as well as about any
remanent information about d0 that might still have been there at
time t1. All this is happening at a rate of T = G*1.44.3…
bits/sec.
The effects of third and subsequent steps in the value of the
disturbance are the same. The output continues to gain information
about the most recent value of the disturbance at T bits/sec while
losing information about prior values at the same rate, and while
the input also loses information about the new and old values at T
bits/sec. Making the steps smaller and closer in time, we can
approximate a smooth waveform with any desired degree of
precision. No matter what the disturbance waveform, the output is
still getting information about its value at T bits/sec, and
losing information about its history at the same rate. So long as the disturbance variation produces less than T
bits/sec, control is possible, but if the disturbance variation is
greater than T bits/sec, control is lost. The spike in the input
value at a step change in the disturbance is an example of that,
where control is completely lost at the moment of the step, and
gradually regained in the period when the disturbance remains
steady, not generating uncertainty.

This is getting rather too long for a single message, so I will
stop there. I hope it explains just how information gets from the
disturbance to the output, and how fast it does so in the case of
the ideal noiseless control system with no lag and a pure
integrator output function.
Martin

http://www.amazon.com/Mathematical-Theory-Communication-Claude-Shannon/dp/0252725484/ref=sr_1_1?s=books&ie=UTF8&qid=1354736909&sr=1-1&keywords=Shannon%2C+C.+E.

      [Martin Taylor

2013.01.01.12.40]

        As you can see by the date stamp, I started this on New

Year’s Day, expecting it to be quite short and to be
finished that day. I was wrong on both counts.

        This message is a follow-up to [Martin Taylor

2012.12.08.11.32] and the precursor of what I hope will be
messages more suited to Richard Kennaway’s wish for a
mathematical discussion of information in control. It is
intended as part 1 of 2. Part 1 deals with the concepts and
algebra of uncertainty, information, and mutual information.
Part 2 applies the concepts to control.

        Since writing [Martin Taylor 2012.12.08.11.32], which showed

the close link between control and measurement, and followed
the information “flow” around the loop for a particular
control system, I have been trying to write a tutorial
message that would offer a mathematically accurate
description of the general case while being intuitively
clear about what is going on. Each of four rewrites from
scratch has become long, full of equations, and hardly
intuitive to anyone not well versed in information theory.
So now I am trying to separate the objectives. This message
attempts to make information-theoretic concepts more
intuitive with few equations beyond the basic ones that
provide definitions. It is nevertheless very long, but I
have avoided most equations in an attempt to make the
principles intelligible. I hope it can be easily followed,
and that it is worth reading carefully.

        My hope is that introducing these concepts with no reference

to control will make it easier to follow the arguments in
later messages that will apply the concepts and the few
equations to control.

        This tutorial message has several sections, as follows:
        Basic background
           concepts
           definitions
              uncertainty
                  comments on "uncertainty"
              information
                  Information as change in uncertainty
                  mutual information
        Time and rate
           Uncertainty
           Information
           Information and history
        Some worked out example cases
           1. A sequence like "ababababa..."
           2. A sequence like "abacacabacababacab...."
           3. A sequence in which neighbour is likely to follow

neighbour
4. A perfect adder to which the successive input values
are independend
5. A perfect adder to which the successive input values
are not independent
6. Same as (5) except that there is a gap in the history
– the most recent N input samples are unknown.

        [Because I don't know whether subscripts or Greek will come

out in everyone’s mail reader, I use “_i” for subscript “i”,
and Sum_i for “Sum over i”. "p(v)’ means “the probability of
v being the case”, so p(x_i) means "the probability X is in
state i"and p_x_i(v) means “the probability that v is the
case, given that x is in state i”, which can also be written
p(v|x_i).]

        ------------Basic equations and concepts: definitions-------

        1. "Probability" and "Uncertainty"

        In information theory analysis, "uncertainty" has a precise

meaning, which overlaps the everyday meaning in much the
same way as the meaning of “perception” in PCT overlaps the
everyday meaning of that word. In everyday language, one is
uncertain about something if one has some idea about its
possible states – for example the weather might be sunny,
cloudy, rainy, snowy, hot, … – but does not know in which
of these states the weather is, was, or will be now or at
some other time of interest. One may be uncertain about
which Pope Gregory instituted the Gregorian calendar, while
feeling it more likely that it was Gregory VII or VIII than
that it was Gregory I or II. One may be uncertain whether
tomorrow’s Dow will be up or down, while believing strongly
that one of these is more probable than that the market will
be closed because of some disaster.

        If one knows nothing about the possibilities for the thing

in question (for example, “how high is the gelmorphyry of
the seventh frachillistator at the winter solstice?”) one
seldom thinks of oneself as “uncertain” about it. More
probably, one thinks of oneself as “ignorant” about it.
However, “ignorance” can be considered as one extreme of
uncertainty, with “certainty” as the other extreme.
“Ignorance” could be seen as “infinite uncertainty”, while
“certainty” is “zero uncertainty”. Neither extreme condition
is exactly true of anything we normally encounter, though
sometimes we can come pretty close.

        Usually, when one is uncertain about something, one has an

idea which states are more likely to be the case and which
are unlikely. For example, next July 1 in Berlin the weather
is unlikely to be snowy, and likely to be hot, and it is
fairly likely to be wet. It is rather unlikely that the Pope
is secretly married. It is very likely that the Apollo
program landed men on the moon and brought them home. One
assigns relative probabilities to the different possible
states. These probabilities may not be numerically exact, or
consciously considered as numbers, but if you were asked if
you would take a 5:1 bet on it snowing in Berlin next July 1
(you pay $5 if it does snow and receive $1 if it doesn’t
snow), you would probably take the bet. If you would take
the bet, your probability for it to snow is below 0.2.

        Whether "uncertainty" is subjective or objective depends on

how you treat probability. If you think of probability as a
limit of the proportion of times one thing happens out of
the number of times it might have happened, then for you,
uncertainty is likely to be an objective measure. I do not
accept that approach to probability, for many reasons on
which I will not waste space in this message. For me,
“probability” is a subjective measure, and is always
conditional on some other fact, so that there really is no
absolute probability of X being in state “i”.

        Typically, for the kind of things of interest here, when the

frequentist “objective” version is applicable, the two
versions of probability converge to the same value or very
close to the same value. Indeed, subjective probability may
be quite strongly influenced by how often we experience one
outcome or another in what we take to be “the same”
conditions. In other cases, such as coin tossing, we may
never have actually counted how many heads or tails have
come up in our experience, but the shape of the coin and the
mechanics of tossing make it unlikely that either side will
predominate, so our subjective probability is likely to be
close to 0.5 for both “heads” and “tails”. As always, there
are conditionals: the coin is fair, the coin tosser is not
skilled in making it fall the way she wants, it does not
land and stay on edge, and so forth.

        In order to understand the mathematics of information and

uncertainty, there is no need to go into the deep
philosophical issues that underlie the concept of
probability. All that is necessary is that however
probabilities are determined, if you have a set of mutually
exclusive states of something, and those states together
cover all the possible states, their probabilities must sum
to 1.0 exactly. Sometimes inclusivity is achieved by
including a state labelled “something else” or
“miscellaneous”, but this can be accommodated. If the states
form a continuum, as they would if you were to measure the
height of a person, probabilities are replaced by
probability density and the sum by an integral. The integral
over all possibilities must still sum to 1.0 exactly.

        -------definitions-----

        First, a couple of notations for the equations I do have to

use:

        I use underbar to mean subscript, as in x_i for x subscript

i.

        p(x_i) by itself means "Given what we know of X, this is the

probability for us that X is (was, will be) in state i" or,
if you are a frequentist, “Over the long term, the
proportion of the times that X will be (was) in state i is
p(x_i)”.

        I mostly use capital letters (X) to represent systems of

possible states and the corresponding lower-case letters to
represent individual states (x for a particular but
unspecified state, x_i for the i’th state).

        p(x_i|A) means "If A is true, then this is our probability

that X is in state i" or “Over the long term, when A is
true, the proportion of the time that X will be in state i
is p(x_i)”. In typeset text, sometimes this is notated in a
more awkward way, using A as a subscript, which I would have
to write as p_A(x_i).

        I will not make any significant distinction between the

subjective and objective approaches to probability, but will
often use one or the other form of wording interchangeably
(such as “if we know the state of X, then …”, which you
can equally read as “If X is in a specific state, then
…”). Depending on which approach you like, “probability”,
and hence “uncertainty”, is either subjective or objective.
It is the reader’s choice.

        I use three main capital letter symbols apart from the Xs

and Ys that represent different systems. U( ) means
“Uncertainty”, I( ) means Information, and M( : ) means
mutual information between whatever is symbolized on the two
sides of the colon mark. The parentheses should make it
clear when, say, “U” means a system U with states u_i and
when it means “Uncertainty”.

        -----definition of "uncertainty"----

        If there is a system X, which could be a variable, a

collection of possibilities, or the like, that can be in
distinguishable states x_i, with probabilities p(x_i), the
uncertainty of X is _defined _ by two formulae,
one applicable to discrete collections or variables

        U(X) = - Sum_i (p(x_i) log(p(x_i)))

        the other for the case in which "i" is a continuous

variable. If “i” is continuous, p(x_i) becomes a probability
density and the sum becomes an integral

        U(X) = - Integral_i (p(x_i) log(p(x_i)) di).

        That's it. No more, no less. Those two equations define

“uncertainty” in the discrete and the continuous case.

        (Aside: Von Neumann persuaded Shannon to stop calling U(X)

“uncertainty” and to call it “Entropy” because the equations
are the same. This renaming was unfortunate, as it has
caused confusion between the concepts of uncertainty and
entropy ever since.)

        If X is a continuous variable, the measure of uncertainty

depends on the choice of unit for “i”. If, for example, X
has a Gaussian distribution with standard deviation s, and
we write H for log(sqrt(2PIe)), the uncertainty of X is H

  • log(s). The standard deviation is measured in whatever
    units are convenient, so the absolute measure of uncertainty
    is undetermined. But whatever the unit of measure, if the
    standard deviation is doubled, the uncertainty is increased
    by 1 bit. Since we usually are concerned not with the
    absolute uncertainty of a system, but with changes in
    uncertainty as a consequence of some observation or event,
    the calculations give well-defined results regardless of the
    unit of measure. When the absolute uncertainty is of
    interest, one convenient unit of measure is the resolution
    limit of the measuring or observing instrument, since this
    determines how much information can be obtained by
    measurement or observation.

          There are a few other technical issues with the uncertainty
    

of continuous variables, but calculations using uncertainty
work out pretty much the same whether the variable is
discrete or continuous. Unless it matters, in what follows I
will not distinguish between discrete and continuous
variables.

        --------comments on "uncertainty-----

        Why is Sum(p*log(p)) a natural measure of "uncertainty"?

Shannon, in his 1947 monograph that introduced information
theory to the wider world, proved that this is the only
function that satisfies three natural criteria:

        (1) When the probabilities vary continuously, the

uncertainty should also vary continuously.
(2) The uncertainty should be a maximum when all the
possibilities are equally probable.
(3) If the set is partitioned into two subsets, the
uncertainty of the whole should be the weighted sum of the
uncertainties of the two parts, the weights being the total
probabilities of the individual parts. Saying this in the
form of an equation, if X has possibilities x_1, x_2, …
x_n and they are collected into two sets X1 and X2, then
U(X) = p(X1)*U(X1) + p(X2)*U(X2),
where p(Xk) is the sum of the probabilities of the
possibilities collected into set Xk.

        One important consequence of the defining equation U(X) =

Sum_i(p(x_i)*log(p(x_i))) is that the uncertainty U(X) is
always positive and ranges between zero (when one of the
p(x_i) is 1.0 and the rest are zero) and log(N) (when there
are N equally probable states of X). The logarithm can be to
any base, but base 2 is conventionally used. When base 2 is
used, the uncertainty is measured in “bits”. The reason for
using base 2 is convenience. If you can determine the actual
state through a series of yes-no questions to which every
answer is equally likely to be yes or no, the uncertainty in
bits is given by the number of questions.

        I will use the term "global uncertainty of X" for the case

when nothing is assumed or known about X other than its
long-term probability distribution over its possible states.
This is what is meant by “U(X)” in what follows, unless some
other conditional is specified in words.

        In certain contexts, uncertainty is a conserved quantity.

For example, if X and Y are independent, the joint
uncertainty of X and Y, symbolised as U(X,Y), is

        U(X,Y) = U(X) + U(Y)

        where U(X,Y) is the uncertainty obtained from the defining

equation if each possible combination of X state and Y state
(x_i, y_j) is taken to be a state of the combined system.
The concept can be extended to any number of independent
variables:

        U(X, Y, Z, W....) = U(X) + U(Y) + U(Z) + U(W) +...

        Using the subjective approach to probability, "independent"

means “independent given what is so far known”. It is not
clear how the term can legitimately be used in a frequentist
sense, because any finite sample will show some apparent
relationships among the systems, even if an infinite set of
hypothetical trials would theoretically reduce the apparent
relationships to zero. A frequentist might say that there is
no mechanism that connects the two systems, which means they
must be independent. However, to take that into account is
to concede the subjectivist position, since the frequentist
is saying that as far as he knows , there is
no mechanism. Even if there was a mechanism, to consider it
would lead the frequentist into the subjectivist heresy.

        Uncertainty is always "about" something. It involves the

probabilities of the different states of some system X, and
probabilities mean nothing from inside the system in
question. They are about X as seen from some other place or
time. To say this is equivalent to saying that all
probabilities are conditional. They depend on some kind of
prior assumption, and taking a viewpoint is one such
assumption.

        Consider this example, to which I will return later: Joe

half expects that Bob will want a meeting, but does not know
when or where that is to be if it happens at all. What is
Joe’s uncertainty about the meeting?

        Joe thinks it 50-50 that Bob will want a meeting. p(meeting)

= 0.5

        If there is to be a meeting, Joe thinks it might be at Joe's

office or at Bob’s office, again with a 50-50 chance.

        If there is to be a meeting, Joe expects it to be scheduled

for some exact hour, 10, 11, 2, or 3.

        For Joe, these considerations lead to several mutually

exclusive possibilities:

        No meeting, p=0.5

        Bob's office, p = p(meeting)*p(Bob's office | meeting) =

0.5*0.5 = 0.25

        Joe's office p = p(meeting)*p(Joe's office | meeting) =

0.5*0.5 = 0.25

        Joe has 1 bit of uncertainty about whether there is to be a

meeting, and with probability 0.5, another bit about where
it is to be. Overall, his uncertainty about whether and
where the meeting is to be is

        U(if, where) = -0.5log(0.5) - 2*(0.25log(0.25) = -0.5*(-1) -

0.5*(-2) = 1.5 bits

                                no meeting   

meeting&&either office

        Another way to get the same result is to use a pattern that

will turn up quite often

        U(if, where) = U(if) + U(where|if) = U(if) + p(no

meeting)U(where|no meeting) + p(meeting)U(where|meeting) =
1 + 0.5
0+0.5
1 = 1.5 bits

        Joe is also uncertain about the time of the meeting, if

there is one. It could be at 10, 11, 2, 3 regardless of
where the meeting is to be. That represents 2 bits of
uncertainty, but those two bits apply only if there is to be
a meeting.

        U(time|meeting) = 2 bits

        U(time) = U(time)*p(meeting) + U(time)*p(no meeting) = 0.5*2
  • 0.5*0 = 1.

          U(if, where, time) = 2.5 bits
    
          Compare this with Joe's uncertainty about the meeting if he
    

knew there would be a meeting, but did not know whether it
included Alice. Then we would have three mutually
independent things Joe would be uncertainty about:

        U(Alice) = 1 bit

        U(office) = 1 bit

        U(time) = 2 bits

        Joe's uncertainty about the meeting would be 4 bits.

        Notice what happened here. We added a dimension of possible

variation independent of the dimensions we had already, but
eliminated the dimension of whether there would be a
meeting. That dimension (“if”) interacted with the other
two, because if there was to be no meeting, there could be
no uncertainty about its time and place. Adding an
independent dimension of variation adds to the uncertainty;
it does not multiply the uncertainty.

        U(meeting_configuration) = U(Alice) + U(office) + U(time)

        ------

        2. "Information" and related concepts

        "Information" always means change in uncertainty. It is a

differential quantity, whereas “uncertainty” is an absolute
quantity. I(X) = delta(U(X)).

        A change in the uncertainty about X may be because of the

introduction of some fact such as the value of a related
variable, or because of some event that changes the
probability distribution over the states of X. If it is
sunny today, it is more likely to be sunny tomorrow than if
today had happened to be rainy.

        To continue the "meeting" example, imagine that Joe received

a call from Bob saying that the meeting was on and in his
office, but not specifying the time. Joe’s uncertainty would
be 2 bits after the phone call, but was 2.5 bits before the
call. He would have gained 0.5 bit of information about the
meeting. If the call had said the meeting was off, Joe would
have no further uncertainty about it, and would have gained
2.5 bits. His expected gain from a call about whether the
meeting was off or on, and its location if it was on, would
be 0.50.5 + 0.52.5 = 1.5 bits, which is the uncertainty we
computed by a different route above for U(if, where).

        Often the "event" is an observation. If you observe a

variable quantity whose value you did not know exactly, your
uncertainty about it is likely (but not guaranteed) to be
lower after the observation than before. Again, thinking of
Joe’s meeting, if the call had simply told Joe the meeting
was on, without specifying the location, Joe’s uncertainty
about the meeting would have increased from 1.5 bits to 3
bits (1 bit for where the meeting is to be held and 2 bits
for when). In that case his information gain would have been
negative. However, averaged over all the possible messages
and taking account of their probabilities, the information
gain from a message or observation is always zero or
positive, even though there may be individual occasions when
it is negative.

        The information gained about X from an observation is:

        I_observation(X) = U_before-observation(X) -

U_after-observation(X)

        ---------

        Mutual Information

        The Mutual information between X and Y is the reduction in

the joint uncertainty of two systems X and Y due to the fact
that X and Y are related, which means that knowledge of the
state of one reduces uncertainty about the state of the
other. If the value of Y is observed, even imprecisely, more
is known about X than before the observation of Y.

        When two variables are correlated, fixing the value of one

may reduce the variance of the other. Likewise, we define
“mutual information” as the amount by which uncertainty of
one is reduced by knowledge of the state of the other. If
U(X|Y) means the uncertainty of X when you know the value of
Y, averaged over all the possible values of Y according to
their probability distribution, then the mutual information
between X and Y is

        M(X:Y) = M(Y:X) = U(X) - U(X|Y) = U(Y) - U(Y|X)

        Here is another generic equation for mutual information,

using the reduction in the joint uncertainty of the two
systema as compared to the sum of their uncertainties:

        M(X:Y) = U(X) + U(Y) - U(X,Y)

        from which it is obvious that M(X:Y) = M(Y:X) and that

M(X:Y) cannot exceed the lesser of the two uncertainties
U(X) and U(Y). If X is a single-valued function of Y
(meaning that if you know the value of Y, you can thereby
know the value of X) then M(X:Y) = U(X).

        M(X:Y) is always positive, though there may be specific

values of one of the variables for which the uncertainty of
the other is increased over its average value, as happened
when Joe was told the meeting was on without being told
were, when, or who was attending.

        Going back again to Joe's office meeting, suppose Joe knows

there is to be a meeting, but does not know where or when.
It might be at Bob’s office or at Bob’s house. If it is at
the office, it might be at 10 or at 2. If at the house, it
might be at 6 (with a dinner invitation) or at 9 (after
dinner).

        U(where) = 1 bit

        U(when) = 2 bits

        U(when) + U(where) = 3 bits

        But U(when, where) is not 3 bits, because there are actually

only four equiprobable combinations of place and time, 10
and 2 at the office, 6 and 9 at the house, so U(where, when)
= 2 bits. Therefore M(where:when) = 1 bit.

        -----time and rates----

        Since "information" is a difference quantity, the concept of

“information rate” becomes useful. It is the rate at which
uncertainty changes as a result of ongoing conditions or
events.

        Consider I(X|observation), the information gained about X as

the result of an observation. Imagine that a series of
observations are made at regular intervals. if X does not
change state over time, then each successive observation is
likely to further reduce U(X). Because of the definition of
“information” as “change in uncertainty”, the observer gains
information at a rate that is the average reduction in
uncertainty per observation divided by the time between
observations.

        If X is a discrete system, there is a limit to how much

information can be gained about X by observing it. If the
global uncertainty of X (what you believe about X before
observing it) is U(X), no amount of observation can reduce
that uncertainty below zero – the value of U(X) when you
know the state of X exactly. Accordingly, the information
gained by continued observation of a static quantity
plateaus at the global uncertainty of that variable – the
amount that the observer did not know about it before
starting the series of observations.

        (Figure 1)

        This is one place where the difference between continuous

and discrete variables matters. If the system X has a finite
number, N, of distinguishable states, its uncertainty cannot
be greater than log(N), so log(N) is the maximum amount of
information that could be obtained about it by continued or
repeated observation. But if the states of X form a
continuum, it has an infinite number of states, meaning that
in theory there is no limit to the amount of information
about it that could be gained by continued observation. In
practice, of course, the resolution of the observing system
imposes a limit. Precision is never infinite, even if the
limit is imposed by quantum uncertainty.

        What about the situation in which X is or may be changing?

If a changing X is observed once, the uncertainty about X
becomes a function of time since the observation. X may
change state continuously or abruptly, slowly or fast.
However it may change, the “information rate” of X is
defined as dU(X)/dt provided that the observation timing is
independent of anything related to a particular state or
state change of X.

        -----------

        [Aside: Here we have another example of the kind of

complementarity between two measures that is at the heart of
the Heisenberg Uncertainty Principle of quantum mechanics.
If you know exactly where something is, you know nothing
about its velocity, and vice versa. The rate at which you
get information from observing something can tell you about
both its value and the rate of change of value, but the more
closely define one of these, the less you can know about the
other. The information rate determines the size of the
“precision*velocity” cell.]

        -----------

        If an observation is made at time t0 and never thereafter,

immediately after the observation at time t0+e (“e”
represents “epsilon”, a vanishingly small quantity) the
uncertainty of X is U_(t0+e)(X) . From that moment, U_t(X)
increases over time. The rate at which U_t(X) increases
depends, in the casual tautological sense, on how fast the
value of X changes. If the variable is a waveform of
bandwidth W, this can be made a bit more precise. The actual
rate depends on the spectrum of the waveform, but for a
white noise (the worst case) the rate dU/dt =
Wlog(2pieN) where N is the noise power (which again
depends on the choice of units).

        However, just as continued observation cannot yield more

information than the initial uncertainty of X, so continued
external influences or other cause of variation cannot
increase U_t(X) beyond its global value (its value
determined from the long-term statistics of X).

        (Figure 2)

        ----------

        3. Information and history

        In this section, we are interested in the mutual information

between the preceding history and the following observation,
which is symbolized M(X|Xhistory). In everyday language, we
want to know how much we can and cannot know about what we
will observe from what we have observed.

        Mutual information need not be just between the states of

system X and the states of a different system Y. It can be
between the states of system X at two different times, which
we can call t0 and t1. If t0 and t1 are well separated, the
state of X at time t0 tells you nothing about its state at
time t1. This would be the case at any time after Tc in
Figure 2. But if t1 follows t0 very closely (i.e. near the
left axis in Figure 2), it is very probable that the state
of X has not changed very much, meaning that the Mutual
Information between the two states is large. They are highly
redundant, which is another way of saying that the mutual
information between two things is close to its maximum
possible value.

        Suppose X is the weather. If it is sunny at 10 o'clock, it

is likely to be sunny at one second past ten, or one minute
past ten. The probability of it remaining sunny is still
pretty high at 11, and still moderately high at ten
tomorrow. But by this time next week, the observation that
it is sunny now will not tell you very much more about
whether it will be sunny then than will a book about the
local climate at this time of year.

        As the example suggests, M(X_t:X_(t-tau)) -- the complement

of U_(t-tau)(X|observation at t0) shown in Figure 2-- is a
function of tau that usually decreases with increasing tau.
I say “usually”, because it is not true of periodic or
nearly periodic systems. The temperature today is likely to
be nearer the temperature this time last year than to the
temperature at some randomly chosen date through the year.
For now we will ignore such cyclic variation, and treat
M(X_t:X_(t-tau)) as if it declined with increasing tau,
until, at some value of tau, it reaches zero.

        ------footnote-------

        For a signal of bandwidth W, M(X_t:X_(t-tau)) reaches zero

at tau = 1/2W. No physical signal is absolutely restricted
to a precisely defined bandwidth, but the approximation can
be pretty close in some cases. Good audio sampling is much
faster than 1/2W because the physical signals of musical
instruments have appreciable power at frequencies well about
the listener’s hearing bandwidth.

        -----end footnote-----

        Suppose a signal is sampled at regular intervals separated

by tau seconds, with tau small enough that knowledge of the
signal value at one sample reduces the uncertainty of the
state of the next and previous samples. The probability
distribution of the next sample is constrained by the values
of the previous samples. One might think of the value of the
sample as being that of a perception, but that is not
necessary.

        (Figure 2)

        Figure 2 suggests how knowledge of the past values may

constrain the probability distribution of the next sample
value when the inter-sample time, tau, is short compared to
the rate of change of the variable.

        The first panel suggests the uncertainty of the next sample

if one knows nothing of the history of the variable. If the
variable is from the set of possible states of X, its
uncertainty is U(X). The second and third panels show how
the uncertainty is reduced if the values of the previous one
or two samples are known. In the middle panel, one possible
value of the previous sample is shown, and in the third
panel the values of the two previous samples are taken into
account.

        The figure is drawn to show the changes in a slowly changing

waveform, but that is just one possibility. It could equally
easily represent the probabilities in a sequence of letters
in a text. For example, the left panel might represent the
possibility of observing any letter of the alphabet, the
middle the distribution of the next character after
observing a “t”, and the right panel the distribution of the
following character after observing “th”. Since we assume
that the statistics of X are stationary (i.e. do not depend
on when you sample them, provided that choice of the
sampling moment is independent of anything happening in X),
M(X0:X1) = M(X1:X2) = M(X2:X3)…

        We want to know the uncertainty of X at a given moment t0

given its entire previous history. If X is a slowly varying
waveform, its uncertainty knowing the previous history right
up to t0 is zero, which is uninteresting. In that case we
have to ask about its history only up to some time t0-tau.
We will actually consider the history to consist of
successive samples separated by tau seconds. In the case of
a discrete system such as the letters of a text, no such
restriction is required, but by making the restriction on
the sampling of the continuous waveform we can treat both
kinds of system similarly.

        Formally, one can write

        U(X0|History) = U(X0) - M(X0:History)

        but that is not much use unless one can calculate the mutual

information between the next sample value x0 and the history
of system X. It is possible again to write a general
formula, but its actual calculation depends on the
sequential statistics of X.

        The History of X can be written as a sequence of samples

counting back from the present: X1, X2, X3, … and
M(X0:History) can be written M(X0:X1, X2, X3,…)

        M(X0:History) = M(X0|X1) + M(X0:X2|X1) + M(X0:X3|X2,X1) +

M(X0:X4|X1,X2,X3) + …

        In words, the mutual information between the next sample and

the history can be partitioned into the sum of an infinite
series of mutual information elements, each one being the
amount one sample contributes to the total mutual
information, given the specific values of all the samples
between it and the most recent sample.

        To make this concrete, let's consider a few specific

examples

        --------

        Example 1: X is a sequence of symbols "a" and "b", that

alternate “…ababab…”

        If one looks at an arbitrary moment, one has an 0.5 chance

of the next sample being a or b: U(X0) = 1 bit. If one has
seen one sample, which might have been either a or b, there
is no uncertainty about the next sample: U(X0|X1) = 0, which
gives M(X0:X1) = U(X).

        What then of M(X0:X2|X1)? Remember that M(A:B) cannot exceed

the lesser of U(A) and U(B). In this case, U(X2|X1) = 0
because if X1 was a X2 must have been b, and vice-versa. So
M(X0:X2|X1) = 0 and all subsequent members of the series are
also zero.

        One can extend this example to any series or waveform that

is known to be strictly periodic with a known period.

        -------

        Example 2: X is a sequence of symbols a, b, and c with the

property that a is followed equally probably by b or c, b by
c or a, and c by a or b. No symbol is ever repeated
immediately, but apart from that, each other symbol is
equally likely at any sample moment.

        U(X0) = log(3) = 1.58 bits

        U(X0|X1) = 1 bit, which gives

        M(X0:X1) = 0.58 bits

        It is irrelevant how X1 took on its value from X2. If x1 =

a, x2 could have equally well been b or c. Therefore

        M(X0:X2|X1) = 0

        --------

        Example 3. X is a sequence of five different symbols

labelled 1, 2, 3, 4, 5. Considering 5 and 1 to be
neighbours, each symbol can be followed by itself with
probability 0.5 or one of its neighbours with probabiity
0.25. Over the long term, each symbol appears with equal
probability.

        U(X0) = log(5) = 2.32 bits

        U(X0|X1) =-0.5 log 0.5 - 2* (0.25 log 0.25) (because the

probability is 0.5 its predecessor was itself and 0.25 that
it was either of its two neighbours)

                       = 0.5 + 0.5 = 1 bit

        M(X0:X1) = 2.32 - 1 = 1.32 bits

        M(X0:X2|X1) depends on the relationship between X2 and X0,

unlike the situation in example 2. There are five
possibilities for x0 (the value of X0). the probability of
these five depends on the relationship between x0 and x2. We
must consider each of these five, and partition M(X0:X1)
according to the probabilities for each value of X2. In a
formula:

        M(X0:X2|X1) = Sum_k(p(X2=xk)*M(X0:X2|X1=xk))

        Since the probability p(X2=xk) is the same for each value of

xk (all symbols are equally likely over the long term), this
sum comes down to

        M(X0:X2|X1) = (1/N)*Sum_k(M(X0:X2|X1=xk))

        We consider all these possibilities individually. Since all

the relationships are symmetric (meaning that it doesn’t
matter which value we select as the possible next sample,
x0), we need deal only with the relationships themselves.

        The sequence could be {x0, x0, x0}, {x0, neighbour, x0},

{x0, neighbout, neighbour} or {x0, neighbour,
non-neighbour}. Examples of each might be {3.3.3}, {3,4,3},
{3,4,4} and {3,4,5}, which would have probabilities 0.50.5
= 0.25, 2
0.250.25=0.125, 20.250.5=0,25, and
0.25
0.25=0.125 respectively.

        There is no need to do the actual calculations here. They

are easy but time-consuming. The point is that we can
calculate M(X0:X2), M(X0:X2|X1) and the rest directly from
the conditional probabilities.

        ---------

        Example 4: X is built by summing a succession of samples of

another system Y, such that xj = Sum(History(X)) + yj, x0 =
0 for some specified value of k, and yj is a sequence of
positive and negative integer values, samples from a system
Y, that average zero. In this example, the numbering scheme
goes in the direction of time, later samples being labelled
with larger numbers. At the k’th sample the expression
M(X|Xhistory) becomes

        M(Xk:Yk|X_(k-1), X_(k-2), ... X(0))

        Each sample of Y contributes equally to the value of X, so

there is no need to consider the sequential order of the y
values. Assuming that Y is a series of numerical values and
the samples of Y are independent, the variance of X
increases linearly with the number of samples, so its
standard deviation increases proportionally to the square
root of that number.

        Let us trace the contribution of the first few samples of Y

to the uncertainty of X.

        After one sample, y0, of Y, x0 will equal y0, which means

U(X0) = U(Y0).

        M(X0:Y0) = U(X) + U(Y) - U(X,Y) = U(X) = U(Y)

        After the next sample, U(X1) still is given by the formula

Sum(p log p), but the probabilities are no longer the same
as they were before the first sample. There is a wider range
of possibilities, from twice the minimum value of Y to twice
the maximum value of Y, and the probabilities of those
possibilities are not equal. Because the probability
distribution over the possible values of X is not uniform,
U(X) < 2*U(Y).

        As more and more samples are included in the Sum, the range

of X increases, but values near zero come to have higher and
higher relative probabilities. The distribution of
probabilities over the possible values of X approaches a
Gaussian with variance that increases linearly with k, so
U(X) approaches U(Y)sqrt(k). The ratio U(X)/(kU(Y)
decreases with increasing k, approaching 1/sqrt(k) as k
becomes large .

        We want to consider the contribution of the k'th sample of

Y. How much does the kth sample of Y reduce the uncertainty
of the k+1th value of the Sum, as compared to knowing only
the History up to the previous sample? That is the mutual
information between the kth sample of Y and the k+1th value
of X.

        For large k, (U(X_k) - U(X_(k-1)))/U(Y) approaches

sqrt(k)-sqrt(k-1), an increasingly small number. Yet it
remains true that U(X_k|History) - U(X_k|History, Yk) =
U(Y). The uncertainty of the value of X after the next
sample of Y is always the uncertainty of Y, because if you
know the current value of X, the next value is determined by
the value of the Y sample. How can these two statements be
reconciled? The key is the conditional of knowing the
History. Without the prior History (in this case the Sum),
the contribution of each successive Y sample to the
uncertainty of X becomes less and less, but if you do have
the prior History, the value of the next X sample will have
an uncertainty that is the uncertainty of the Y sample.

        The situation is the same if the system X is normalized,

meaning that its variance is independent of the number of
prior samples after a sufficient number of Y values have
occurred. The magnitude of the contribution of the k’th Y
sample is reduced proportionally to sqrt(k), but its
contribution to the various uncertainties are the same,
provided that the different possible values of X are
resolved.

        --------

        Example 5: The same as Example 4 except that the samples of

Y are not independent. If Y is a continuous waveform, the
samples are taken more closely than the Nyquist limit
(samples separated by 1/2W, where W is the bandwidth of the
waveform). If Y is a discrete set of possibilities,
successive samples are related by some kind of grammar. The
two kinds of possibility converge when the discrete set of
possibilities is linearly arranged and the grammar merely
makes transitions among nearby values more likely than
across distant values.

        The new issue here is that if the history of Y is known,

U(Yk) < U(Y). Therefore M(Xk:History(X), yk) < U(Y).
Each successive sample of Y reduces the uncertainty of X
less than would be the case if the samples of Y were
independent of each other. The difference is subsumed by the
fact that M(Xk|History(X)) is greater than it is in the case
when the Y samples are independent, by the same amount.

        ---------

        Example 6: System X is a function of the current and past

samples of system Y. We want to know the uncertainty of
sample xk when the values of successive Y samples are known
only up to sample y_(k-h). For example, in the grammar of
example 3, we may have observed the sequence 1,2,1,5,
(samples y1, y2, y3, y4) and want to know the uncertainty of
sample x6, which is a function of y1, y2, y3, y4, y5, and
y6. We do not yet know the values of y5 or y6. As a
real-world example, a trader in the days of sailing ships
might have known the market conditions in China and England
several months before his ship captain actually bargains for
tea in Canton, but the earlier conditions are the conditions
on which he must base his instructions to the captain.

        ==========End=========

[From Matti Kolu (2014.01.29.2220 CET)]

Martin Taylor 2014.01.25.12.21--

In a 1971 paper, J.G.Taylor discussed "axioms" of psychology, truths that
are taken for granted, and how they need to be changed in light of the
importance of feedback systems.

Martin,

what's the title of that paper? Do you know if it is included in his
book from the same year?

Matti

[Martin Taylor 2014.01.29.17.12]

J.G.Taylor's paper on axioms is "The role of axioms in psychological theory", Bull British Psychol Soc, 21, 1968, 221-227. I don't know why I said 1971. The usual explanation is "brain fart", but that's not very PC(T), is it.

The only book of his that I know about is "The Behavioral Basis of Perception", Yale U.P., 1962, and it doesn't consist of paper reprints. If you want to read the axioms paper and can't get it online, I could scan it.

While we are on the topic of JGT, you may be amused by the abstract of his "Experimental Design: A cloak for Intellectual Sterility", Brit. J. Psychol., 1958, 49, 106-116. That may well be available on line.

···

-------------
Lewin's (1931) warning against applying in psychology the Aristotelian notion that frequency of occurrence is equivalent to lawfulness fell on deaf ears. Today Aristotelian modes of thought are as prevalent as ever, and are backed up by an elaborate structure of statistical methods and rules for experimental design. By reason of their logical tidiness these methods have the unfortunate effect of reducing (to a level below the threshold for effective action) the discontent that should be engendered by the rotten state of current psychological theory. What psychology needs is not statistical laws but scientific laws, that is, statements, expressed in general terms, that can generate specific preditions when specific values are assigned to general terms. Such laws are unlikely to be found by statistical investigations, since they are concerned not with populations of systems but with single systems. Thus the law of gravitation is not a catalogue of the properties of solar systems, but it explains how any given system hangs together. ...

===== what follows refers to his behavioural basis of perception theory====

Sounds like Bill Powers, does it not?

Martin

[From Matti Kolu (2014.01.29.2220 CET)]

Martin Taylor 2014.01.25.12.21--

In a 1971 paper, J.G.Taylor discussed "axioms" of psychology, truths that
are taken for granted, and how they need to be changed in light of the
importance of feedback systems.

Martin,

what's the title of that paper? Do you know if it is included in his
book from the same year?

Matti

[From Matti Kolu (2014.01.31.2005 CET)]

J.G.Taylor's paper on axioms is "The role of axioms in psychological
theory", Bull British Psychol Soc, 21, 1968, 221-227. I don't know why I
said 1971. The usual explanation is "brain fart", but that's not very PC(T),
is it.

The only book of his that I know about is "The Behavioral Basis of
Perception", Yale U.P., 1962, and it doesn't consist of paper reprints. If
you want to read the axioms paper and can't get it online, I could scan it.

The year lead me astray. :slight_smile: I was looking at the 1971 "The Shape of
Minds to Come", written by the British physicist John G. Taylor.
Amazon has the book listed under "J.G Taylor".

I'd love to read the paper. If you've got time to scan it, I'd appreciate it.

Matti

Me too!

···

On Friday, January 31, 2014, Matti Kolu matti.kolu@gmail.com wrote:

[From Matti Kolu (2014.01.31.2005 CET)]

J.G.Taylor’s paper on axioms is "The role of axioms in psychological

theory", Bull British Psychol Soc, 21, 1968, 221-227. I don’t know why I

said 1971. The usual explanation is “brain fart”, but that’s not very PC(T),

is it.

The only book of his that I know about is "The Behavioral Basis of

Perception", Yale U.P., 1962, and it doesn’t consist of paper reprints. If

you want to read the axioms paper and can’t get it online, I could scan it.

The year lead me astray. :slight_smile: I was looking at the 1971 "The Shape of

Minds to Come", written by the British physicist John G. Taylor.

Amazon has the book listed under “J.G Taylor”.

I’d love to read the paper. If you’ve got time to scan it, I’d appreciate it.

Matti


Dr Warren Mansell
Reader in Psychology
Cognitive Behavioural Therapist & Chartered Clinical Psychologist
School of Psychological Sciences
Coupland I
University of Manchester

Oxford Road
Manchester M13 9PL
Email: warren.mansell@manchester.ac.uk

Tel: +44 (0) 161 275 8589

Website: http://www.psych-sci.manchester.ac.uk/staff/131406

See teamstrial.net for further information on our trial of CBT for Bipolar Disorders in NW England

The highly acclaimed therapy manual on A Transdiagnostic Approach to CBT using Method of Levels is available now.

Check www.pctweb.org for further information on Perceptual Control Theory