Pitch control; explicit v implicit perceptions'

[From Bill Powers (960510.2010)]
RE: pitch control

First off, let me apologize to the experts who are getting pretty ticked
off with my ignorance. I think that a large part of the mutual ticked-
off-ness can be traced to this simple misunderstanding:

     It's incredible to me that you don't listen at all to the purpose
     of the procedure that I was asked for and provided. Rick asked me
     for a controller for pitch, not for a neural account, and I gave
     him the simplest (digital) computational strategy I know of for
     computing this from digitized signal. You're way, way off base if
     you think I am (or would ever, EVER) propose this as the way that
     the brain does it. Frankly, I would have expected a bit more
     reflection on your part.

You see, I thought you _were_ trying to propose a plausible neural
account of pitch perception and control. The entire picture changes if
it is understood that you are simply trying to _design_ a pitch-
perception system based on the principles you describe. If you're
designing a system, you are no longer constrained to operations that
would be plausible in a nervous system, but can make use of all known
mathematical operations and any circuit or program designs you can think
up. I don't put you down for doing this; the first thing we have to do
in building any model is to start with SOME way of accomplishing the
result, even though we can be pretty sure it won't be the actual way
used in the real system.

     It is YOU who are oversimplifying the problem of the central
     computation of pitch. Pitch is a "complex percept" -- the models I
     am talking about deal with the pitches produced by complex tones
     with many harmonics, and they can deal with pure tone pitches below
     a few kHz (although they do not account for subtle (1-5%),
     secondary and tertiary effects of pure tone pitches like binaural
     diplacusis and level-dependent pitch shifts.

I plead guilty to oversimplification. I haven't been thinking of complex
tones where the perceived pitch might be different from the fundamental
of the physical tone. I've run into only one example, which I described;
the problem a few subjects have in controlling a tone created by a
square-wave generator driving a PC loudspeaker. And I haven't
experimented with controlling tones that are very loud or very soft, so
I haven't seen the effects you talk about.

But back to our basic misunderstanding. You said to Rick,

     I'm not sure what more you'd want. I've given two ways of computing
     a signed, "continuous" error signal for adjusting the fundamental
     of a harmonic complex tone (that a human being would hear as its
     pitch) to match that of another (or of a pure tone, for that
     matter). If you invert this error signal and feed it to a voltage-
     controlled motor that slowly turns a knob that changes the
     fundamental frequency of the test signal, reference and test
     fundamentals (F0's) will converge.

You can see the problem we would have if we interpreted this as a design
for a neural controller. It's easy to say "invert this error signal,"
but the actual computations required to do this in real time, or at all,
would be quite complex; it would be hard to see how they would be
accomplished in neurons. But if you're just talking about a computer
program, there's no problem. I don't think that either Rick or I would
have doubted that you could _design_ a pitch control system in a
computer even if you restrict yourself to methods that employ interval
computations. The same thing could be done with a system that uses rate-
place information or other kinds, if you allow an equal degree of
complexity in the model.

At one point you say

     Clue: we're not at all talking about low, level neural processes
     here.

This depends on what you mean by "low level." Pitch perception has to be
pretty low in the hierarchy, because it is only one element of much more
complex perceptions, ranging from perceptions of words (particularly in
a tonal language like Chinese!) to perception of propositions to
perception of principles to perception of concepts like Self or
Mathematics.

     You should really look at what kinds of inputs and precise
     connections are necessary to do this in your rate-place model, and
     whether they can do this under different conditions, like at high
     levels.

Well, I doubt that I will be undertaking any research projects like
this, especially on a par with yours, but I can still be permitted to
look for alternate explanations. In a rate-place model, there is really
no difficulty in dealing with some of the phenomena of misperception of
physical fundamental frequencies (which is the basic data you're talking
about). If, for example, the perceived pitch varies with loudness of a
standard-pitch input, all that's needed is the right nonlinear amplitude
function. The effect of harmonics on the perceived pitch might be
slightly more difficult, but I have no doubt that by postulating a
particular frequency-detector we could eventually reproduce the
phenomenon. This is just a matter of circuit design, and any competent
designer could reproduce just about any such distortion phenomenon. Of
course coming up with such a model would still only prove that a
workable design is possible; there's no proof that the nervous system
does it that way, either. My point is that given any kit of fundamental
operations that you deem permissible, you can probably construct a model
that will do the right things.

···

-------------------------------
     Nowhere in the auditory system has anyone seen any kind of code
     where firing rate is a monotonic function of stimulus frequency.

Now that's very interesting; I didn't know that. That, of course, rules
out any simple rate-place model for pitch perception. But it doesn't
automatically rule in an interval interpretation, either. For example,
it would be possible that pitches are detected in a way similar to the
detection of line orientatations. In the visual detection of line
orientations, we find populations of neurons which respond maximally (in
rate of firing) for specific orientations, one neuron giving maximal
response in impulses per second only for one orientation, so you end up
with a vector map in a volume of the brain (Georgeopolis's stuff).

Concerning your statement, I'm a little confused, however, because I
recall an article from rather a long time ago in which an electrode
placed somewhere in the auditory nerve produced a signal that could be
amplified and heard on a loudspeaker as the sounds entering the ear. Is
this a popular-science myth?

A few years ago there was an article in _Science_ about an experiment
involving a 64x64 array of electrodes placed in a visual nucleus. When a
point of light was moved across the animal's retina, a distribution of
firing rates shaped (roughly) like a narrow gaussian curve was seen to
move across the electrode array. If I recall correctly, this same
experiment required the animal to pick a target position, and the target
position showed up as another (stationary) peak at a specific location
in the array. But I no longer recall the article clearly enough to be
confident of the details and I've lost the reference. At any rate, the
implications were that position information is carried positionally in
the internal map. This, of course, makes it difficult to see how a
reference signal could be compared against a position, to yield the kind
of error signal needed for position control.

I mention these observations only to bring up the point that the basic
PCT model as we run it in a computer, which predicts very well, does not
necessarily resemble the way in which the various elements of a control
system in the brain -- perception, comparison, and action -- actually
work. All the model actually says is that there are neural
representations of a perceptual signal, a reference signal, and an error
signal, and that these signals can be represented as numbers that vary
in certain relations to each other and to the external world. What
aspect of the actual signals is represented by the numbers is not
specified. The method of coding does not have to be specified.
---------------------------
     Nobody has proposed a physiologically-plausible way in which the
     pitches of complex tones might be computed by the auditory system
     using rate-place patterns in auditory maps. The spectral pattern
     pitch models assume that one can 1) resolve all the peaks of the
     component frequencies 2) estimate the absolute frequencies involved
     with high accuracies and 3) do a harmonic analysis to infer the
     greatest common denominator of the frequency peaks, and do this
     over huge ranges in level, in noise, for auditory objects situated
     in various locations in auditory space, etc, etc etc. Even if one
     uses spectral "templates", these are complex, you'd need a hell-of-
     alot of them for all the relative level, location, and s/n
     conditions that are encountered.

All I can say is that I wouldn't propose that kind of spectral pattern
pitch model. A much simpler model that I've used just looks at zero
crossings; another one I've actually tried is a time-domain model in
which you get an oscilloscope-like picture of the signal envelope, which
can then be analyzed as a quasi-stationary pattern (in a perceptron-like
way). And an interesting-looking one starts with a phase-locked loop
with a phase-reference waveform not at the fundamental frequency but at
the highest possible frequency (like 20 KHz), with subharmonics being
generated by frequency division to produce the fundamental phase-
reference signal. This has the nice property of being able to generate
"DC" signals for formants using multiple synchronous detectors, which
look only for harmonics that are locked to the fundamental. It can also
track the formants over at least a 2:1 range of fundamental frequencies,
so the formant relationships become nearly independent of pitch. I
mention these ideas only to suggest that if one begins with other trains
of thought, other possible models suggest themselves. And of course the
mathematical treatment would be quite different for each such model.
Autocorrelations and cross-correlations are not the only possibilities,
although I'm sure that there would be mathematical similarities among
many of the possible models.

     The interval-based models have harmonic structure embedded within
     them, so that all of the complicated, precise estimation of
     component frequencies and inference of the fundamental is neurally
     taken care of in a very simple and elegant way.

Well, simplicity and elegance are what I admire, so perhaps I will have
to accept your model in the end.

One thing you would have to take into account is that in a model like
the PCT model, there are no "complex" signals. If a signal has a complex
component, then after passing through a perceptual function, only one
dimension of the signal would be represented as a higher-level
perception. Pitch would be extracted as one perceptual signal; timbre as
another (or several), and so forth, each being extracted by a different
input function. So instead of a single signal with a complex pattern, we
would end up with many different signals, each one being simple and
representing just one dimension in which the input pattern can change.
This is somewhat akin to your proposals for complex operations that
could, for example, isolate the fundamental of pitch. The output of such
an operation could then be a simple scalar signal which varies in
frequency as the perceived fundamental varies. More on this in my reply
to Martin Taylor, below.

Where you and Martin Taylor seem to have a basic difference with me is
over the necessity of specifically representing these aspects of the
patterns as simple signals. I'm sure you don't disagree that doing so
would be possible. But I claim it is necessary in order to account for
the qualities of subjective experience.
-----------------------------------------------------------------------
Martin Taylor 960510 13:30 --

     Why need there ever be an explicit perceiver of the magnitude of
     the colour vector, or of its phase, for colour to be a useful
     perception?

The explicit perceiver is a higher-level perceptual function, for which
color is one input. If a person has to point to "green" on a spectrum,
control is easiest if there is a single signal for green that can be
specified. It is even easier if there is a single signal the frequency
of which represents a position on a color continuum.

     Scalar control variables are easy to verbalize, and to discuss, but
     I see no need at all for any perception discussed by the analyst to
     exist as a separate scalar variable inside the control hierarchy
     being analyzed.

It's not the analyst I'm concerned with, but the higher systems that
will have to use the perception as a component of a higher-level
perception.

     If the _user_ of a neural network wants to perceive the network as
     having "seen" a pattern, then that _user_ will probably prefer the
     network to produce a single output of a high value for that
     pattern.

Yes. The "user" of a perceptual neural network in the brain is a
perceptual system of higher order in the same brain. As I point out
below, another user could be considered to be the comparator.

Even in doing
electronic designs, I had never said to myself that any function of
multiple signals had to be physically implemented, with an explicit
physical result, before it had any real existence. I mean, what
electronics engineer says something like that? It goes without saying.

     Again, "real existence" to whom? I think to the designer, not to
     the circuit. It's so much easier to design when you can think about
     one signal in one place.

No, "real" to the circuit. Suppose you have an R.F. signal in which
there are implicitly a carrier frequency and two sideband envelopes. In
the sideband envelopes (once they are isolated by tuned circuits and
heterodyning) there are, implictly, two audio signals. But before those
implicit audio signals can have the appropriate physical effects on the
rest of the circuits, one of them must be diode-detected and smoothed
and represented explicitly as a voltage varying at the audio frequency.
In that varying audio voltage, there is implicit some sort of sounds,
but before those sounds can be produced, the voltages must be converted
into physical forces on a speaker cone. Implicit in those sounds might
be music or news, but before either music or news can become an explicit
perception, the sounds must be sensed and converted into neural signals,
and the neural signals must give rise to further neural signals until
the necessary level of perception is reached. It would not be sufficient
to present the R.F. signal to the final perceptual apparatus, even
though all the relevant information is implicit in that signal.
----------------------------------
I think that in the past few decades there has arisen a sort of
philosophical bias against the idea that there must be a perceiver of
perceptions. "Down with Dualism!" says the bumper sticker. The goal
seems to be to model signals in which certain kinds of information can
be shown to be present, even though that information is not converted
into any explicit form. Thus, Peter says that the harmonic structure of
a pitch is conveniently contained in the distribution and
autocorrelation of impulse intervals -- even though there is nothing
that is explicitly performing the autocorrelation and extracting and
representing the various dimensions of this distribution. The idea seems
to be that simply because the structure _could_ be extracted, that is
sufficient to account for perception of the various aspects of the
structure.

This is entirely consistent with the idea that perceptions do not have
to be made explicit, as signals. The mere presence of a pattern of
firings is enough for the pattern to become a perception. Only in this
way, where in effect the pattern is also the perception of the pattern,
can the Observer be done away with.

Of course the objective is to do away with the metaphysical Observer.
However, in a hierarchical model, we have a different non-metaphysical
observer, small-o, which is the set of all higher perceptual functions.
Now the idea of implicit patterns leads to a problem, because unless the
various attributes of the pattern are represented explicitly as signals,
there is no way to pass information about individual attributes to the
higher systems. All that could be passed upward would be the pattern
itself, unanalyzed.

But if you review Peter's recent post, you will see that he speaks of
doing explicit computations to extract from the pattern certain specific
attributes of it. The raw signal is passed through a tapped delay line,
each delayed signal is multiplied by the undelayed signal, and the
products are summed into an array of "values." Then, to pick out the
attribute called the "fundamental frequency," the highest peak in the
autocorrelation function (actually carried out by computing machinery)
is located, and is summed together with all its harmonics into -- what?
A final, explicit number that represents the fundamental frequency, or
the longest interval. A scalar value that stands for just one dimension
of the temporal pattern, the apparent pitch.

[To extract a different attribute of the pattern, such as a measure of
timbre, a different computation would have to be applied, resulting in a
different number. There's no reason why several computations can't be
performed in parallel with the same pattern as input to each].

Now it becomes very simple to remember the pitch, to use a remembered
value as a reference signal, to compare a present pitch with the
reference pitch, and to generate an appropriate error signal. Only
scalar values have to be handled, once the attribute has been
computationally extracted. The reference signal no longer has to be
another temporal pattern. The comparison is ultra-simple, because it
consists of subtracting one scalar value from another. And the error
signal, being also a scalar value, can be converted simply into a set of
outputs that, via the world outside the control system, reduce the
error. All this can be done with simple rate-coded signals.

And since pitch is now represented by a single scalar number, a simple
signal can carry a copy of the pitch signal to higher systems, separated
from timbre and other attributes of the pattern, where it joins with
other inputs to higher perceptual functions. Only the relevant feature
of the temporal pattern, not the whole pattern, is sent to the higher
systems.

Even if you don't believe in an Observer, or see awareness as any
particular problem, I think that in a hierarchical model you have to
admit that we have to consider small-o observers. And even at the level
of the original perceptions, we can see that recording and replay of
perceptual signals as reference signals, and the process of comparison,
are far simpler if there is an explicit computation of each aspect of
the input pattern that is to be separately controlled, so that after the
initial operations, only scalar values have to be handled.

In fact, we can now see that the interval-based computations that Peter
proposes can be treated as internal details of a perceptual function, or
actually a number of perceptual functions each extracting a measure of
some feature of the input pattern and producing a single scalar signal
representing the state of one attribute. There is no reason why these
output signals could not be rate-encoded, since each one represents the
final outcome of the kinds of computations that Peter proposes. And then
we would have what I think is required: one scalar signal per
perception, with the measure of that perception being representing by
firing frequency.

Tell me what's wrong with that.
-----------------------------------------------------------------------
Ed Ford May 10, 1996 --

Perry Good called me to read me that letter from Bill Glasser. Why am I
reminded of L. Ron Hubbard?

Tom Bourbon is reporting very favorably on what he has seen of your
school programs in action so far. Keep up the good work. See you at the
meeting in July.
-----------------------------------------------------------------------
Best to all,

Bill P.

[From Peter Cariani (960513.1100)]

[From Bill Powers (960510.2010)]
First off, let me apologize to the experts who are getting pretty ticked
off with my ignorance. You see, I thought you _were_ trying
to propose a plausible neural account of pitch perception and control.
The entire picture changes if it is understood that you are simply
trying to _design_ a pitch-perception system based on the principles you describe.

I accept your apology. But it's not about "experts and ignorance", though,
that got me ticked off -- I'd rather not comport myself as an "expert"
(although too often that's the only way I can get anyone to listen),
and I try to be very patient with what I perceive as "ignorance".
Just, please, try to step back a minute when you read something I've
said that doesn't seem to make immediate sense to you, and think about how I might
have meant it, and how what I've said could make sense. All I want
is a little "slack" (as the SubGenii are wont to say). On many levels,
I have scientific and intellectual values that are pretty close to
yours: I want to see explicit models proposed with all
definitions and procedures fully operationalized, I want to see
the physiological basis pinned down, and I reject "realist"
metaphysics in favor of an observer-based pragmatism.

On "low-level" neural processes and pitch:

This depends on what you mean by "low level." Pitch perception has to be
pretty low in the hierarchy, because it is only one element of much more
complex perceptions, ranging from perceptions of words (particularly in
a tonal language like Chinese!) to perception of propositions to
perception of principles to perception of concepts like Self or
Mathematics.

I know I said "low level", and I meant it in contrast to an element that
boils down the whole percept into one level (e.g. pitch), but the concepts
of low- vs. high-level perceptions bear some discussion.
This gets into a whole nest of issues about perception. The way most
machine perception people and neuroscientists think about it is that
there are various "feature detectors" in the ascending pathway, so that
the signals get simpler, but what they 'represent' gets more complex.
Although this idea of the "pontifical neuron" has been successfully
satirized by Jerry Lettvin's example of the "grandmother cell",
the notion of a feedforward hierarchy of processes
is still tacitly held by almost everybody. There are other general
classes of possibilities, though. It is possible that every station
has some access to an "analog" representation of the entire signal
(e.g. in the form of temporal spike patterns present in the single
neuron discharges of a population of cells), and that particular
high-order invariances are extracted when needed by particular
stations (like pitch or phonetic identity, or timbre, etc.) or when
new aspects of the signal become relevant (when "perceptual learning"
is needed). This has a Gibsonian ring to it (I don't have a sense of how you all
think of Gibson and the "direct perceptionists"), and also dovetails
in many ways with some of the Gestaltist notions (albeit no longer
conceived in terms of electrical fields and or current flow, but in
terms of propagation of spikes through a structured set of connections).
If this is the case, then position in a processing hierarchy is not
the right concept -- which 'invariant' is being analyzed is the important
thing to watch.......

     You should really look at what kinds of inputs and precise
     connections are necessary to do this in your rate-place model, and
     whether they can do this under different conditions, like at high
     levels.

Well, I doubt that I will be undertaking any research projects like
this, especially on a par with yours, but I can still be permitted to
look for alternate explanations. In a rate-place model, there is really
no difficulty in dealing with some of the phenomena of misperception of
physical fundamental frequencies (which is the basic data you're talking
about). If, for example, the perceived pitch varies with loudness of a
standard-pitch input, all that's needed is the right nonlinear amplitude
function. The effect of harmonics on the perceived pitch might be
slightly more difficult, but I have no doubt that by postulating a
particular frequency-detector we could eventually reproduce the
phenomenon. This is just a matter of circuit design, and any competent
designer could reproduce just about any such distortion phenomenon. Of
course coming up with such a model would still only prove that a
workable design is possible; there's no proof that the nervous system
does it that way, either. My point is that given any kit of fundamental
operations that you deem permissible, you can probably construct a model
that will do the right things.

Yes, I generally agree with this, that engineers can often build effective devices,
although often there are cultural and historical blinders that the
engineers have that prevent effective devices from being designed.
For example, I do think that a (maybe <the>) major impediment to
current automatic speech recognition performance is that front-ends based on
"spectrographic" representations don't do well in noise and/or when there
are competing sounds. I may be one of the few people on this earth that
think this, but I do believe that the auditory system uses a different coding
strategy and that there are large technical gains to be made from considering
how the auditory system might do it. Oded Ghitza of Bell Labs (or whatever
its name is nowadays) used a simulated auditory nerve front-end that
computed intervals between threshold-crossings, and with his simple
front-end, he immediately outperformed spectrographic front-ends in
the presence of noise. He's working on other "higher-level" problems now,
but I think that these kinds of front-ends still have a good deal of
room for improvement (more work should be done on them than is being
done -- the speech recognition community is much more conservative than
they should be).

-------------------------------
     Nowhere in the auditory system has anyone seen any kind of code
     where firing rate is a monotonic function of stimulus frequency.

Now that's very interesting; I didn't know that. That, of course, rules
out any simple rate-place model for pitch perception. But it doesn't
automatically rule in an interval interpretation, either. For example,
it would be possible that pitches are detected in a way similar to the
detection of line orientatations. In the visual detection of line
orientations, we find populations of neurons which respond maximally (in
rate of firing) for specific orientations, one neuron giving maximal
response in impulses per second only for one orientation, so you end up
with a vector map in a volume of the brain (Georgeopolis's stuff).

We probably need to do some calibration of what the term "rate-place" code
means. You were discussing the output of a pitch detector in which the
rate of firing was montonically related to pitch frequency, (say, 50
spikes for 100 Hz, 100 spikes for 1000 Hz, up to 200 spikes for
10 kHz). I don't know of any units like that anywhere. There are, of course,
many units in all auditory stations that are "frequency-tuned" in the
sense that, for a given sound pressure level, they fire maximally at
a particular frequency (so this is a non-monotonic rate-frequency
curve). A "rate-place" code assumes that some central processor analyzes
the rates in each "best frequency" channel (like in a spectrum analyzer
that one has with some stereos or equalizers), and deduces properties
of the stimulus on this basis. This is the kind of representation that
most auditory and speech people assume, but when you really get into
the physiology with a halfway critical eye (i.e. way beyond the textbooks),
there are some big, big problems. "Best-frequency" changes with level,
and worst of all, most discharge rates saturate at moderate-to-high levels,
so that "spectral contrast" in such representations would be very likely
to be degraded considerably (in contrast our perceptions stay rock-steady
and just as good at higher levels than at lower ones). The upshot of this
is that we really have no viable theory of how the auditory cortex
"works" in a signal-processing sense. [I was also recently told by someone
in motor neurocomputation that the Georgeopolis story may not be as simple
as it's made out to be -- there are competing hypotheses that involve
different coordinate systems.....again, just be very, very wary of the
textbook examples.......]

Concerning your statement, I'm a little confused, however, because I
recall an article from rather a long time ago in which an electrode
placed somewhere in the auditory nerve produced a signal that could be
amplified and heard on a loudspeaker as the sounds entering the ear. Is
this a popular-science myth?

No myth. If you put an electrode near the round-window of the cochlea,
one can hear a "cochlear microphonic", which is the summed electrical
activity of cochlear hair cells. The hair cells follow the stimulus
(after it has been band-passed filtered by the mechanical properties
of the cochlea), and (up to 4-5 kHz) "phase-lock" to the filtered
stimulus waveform. The result is that there is a great deal of
time structure in the responses, even on a population level, and it
is this time structure that allows one to recognize the form of the
stimulus, played back through the cochlear hair cells and the electrode.

A few years ago there was an article in _Science_ about an experiment
involving a 64x64 array of electrodes placed in a visual nucleus. When a
point of light was moved across the animal's retina, a distribution of
firing rates shaped (roughly) like a narrow gaussian curve was seen to
move across the electrode array. If I recall correctly, this same
experiment required the animal to pick a target position, and the target
position showed up as another (stationary) peak at a specific location
in the array. But I no longer recall the article clearly enough to be
confident of the details and I've lost the reference. At any rate, the
implications were that position information is carried positionally in
the internal map. This, of course, makes it difficult to see how a
reference signal could be compared against a position, to yield the kind
of error signal needed for position control.

I'm sure that the visual system is organized retinotopically; what gives me
pause is the assumption that it is firing rates in each of the retinotopic
channels (e.g. over 100 msec) that constitutes the visual signal. When you
look at what happens when an edge crosses the spatial receptive field of
many visual neurons, there is an elevated discharge rate, yes, but the
spikes also occur during the transient event, such that they "phase-lock"
to the visual stimulus as it drifts across the retina (and our eyes are
always in motion, even when we fixate). This timing of spikes (1 msec jitter)
is much more precise than information from firing rates and it is the basis of
Reichardt's work on motion detection in the fly. The point here is that
there is all this high-quality spatio-temporal correlation structure that
the vision people don't generally think about, and it's possible that
this is the basis of form vision. In my spare time (such as it is) I've
been looking for evidence to falsify this notion, but I have yet to come
up with any that rules it out of hand. On the other hand, there are many
visual phenomena that are entirely compatible with the idea (like the
Pulfrich effect -- an interocular time delay creates a predictable
depth illusion). I wish I had more confidence in our current explanations
for these things, but I fear we are as in the dark about a physiological
theory of visual form perception as we are about auditory form perception.

I mention these observations only to bring up the point that the basic
PCT model as we run it in a computer, which predicts very well, does not
necessarily resemble the way in which the various elements of a control
system in the brain -- perception, comparison, and action -- actually
work. All the model actually says is that there are neural
representations of a perceptual signal, a reference signal, and an error
signal, and that these signals can be represented as numbers that vary
in certain relations to each other and to the external world. What
aspect of the actual signals is represented by the numbers is not
specified. The method of coding does not have to be specified.

Yes, I agree, although if it turns out that many percepts are temporally
coded, I think this could give (possibly) us insights into new ways that
control systems could be implemented.

     Nobody has proposed a physiologically-plausible way in which the
     pitches of complex tones might be computed by the auditory system
     using rate-place patterns in auditory maps. The spectral pattern
     pitch models assume that one can 1) resolve all the peaks of the
     component frequencies 2) estimate the absolute frequencies involved
     with high accuracies and 3) do a harmonic analysis to infer the
     greatest common denominator of the frequency peaks, and do this
     over huge ranges in level, in noise, for auditory objects situated
     in various locations in auditory space, etc, etc etc. Even if one
     uses spectral "templates", these are complex, you'd need a hell-of-
     alot of them for all the relative level, location, and s/n
     conditions that are encountered.

All I can say is that I wouldn't propose that kind of spectral pattern
pitch model. A much simpler model that I've used just looks at zero
crossings; another one I've actually tried is a time-domain model in
which you get an oscilloscope-like picture of the signal envelope, which
can then be analyzed as a quasi-stationary pattern (in a perceptron-like
way).

The envelope approach doesn't explain human pitch perception (Schouten and
de Boer in the 40's, 50's and 60's effectively falsified these mechanisms
using inharmonic AM tones), but if it works for what you want it for, so
much the better.

And an interesting-looking one starts with a phase-locked loop
with a phase-reference waveform not at the fundamental frequency but at
the highest possible frequency (like 20 KHz), with subharmonics being
generated by frequency division to produce the fundamental phase-
reference signal. This has the nice property of being able to generate
"DC" signals for formants using multiple synchronous detectors, which
look only for harmonics that are locked to the fundamental. It can also
track the formants over at least a 2:1 range of fundamental frequencies,
so the formant relationships become nearly independent of pitch. I
mention these ideas only to suggest that if one begins with other trains
of thought, other possible models suggest themselves. And of course the
mathematical treatment would be quite different for each such model.
Autocorrelations and cross-correlations are not the only possibilities,
although I'm sure that there would be mathematical similarities among
many of the possible models.

Yes, all models we can imagine should be on the table for consideration.
The more the merrier. Out of many, something that works.

Powers, responding to Martin Taylor:

I think that in the past few decades there has arisen a sort of
philosophical bias against the idea that there must be a perceiver of
perceptions. "Down with Dualism!" says the bumper sticker. The goal
seems to be to model signals in which certain kinds of information can
be shown to be present, even though that information is not converted
into any explicit form.

I'm down on Dualism because of the Cartesian notion of separate "substances".
Here, also, the population-interval distributions could be the explicit
signal. Relatively many 10 msec intervals in a particular population
might well be the "representation for" a particular pitch. I agree,
however, that not all sensory signals make it into our conscious
awareness.

Thus, Peter says that the harmonic structure of
a pitch is conveniently contained in the distribution and
autocorrelation of impulse intervals -- even though there is nothing
that is explicitly performing the autocorrelation and extracting and
representing the various dimensions of this distribution. The idea seems
to be that simply because the structure _could_ be extracted, that is
sufficient to account for perception of the various aspects of the
structure.

Actually, I think that we perceive pitch because of the nature of the
neural representations and architectures involved, not because pitch
per se (and octave recognition, and chords and all of the other
complex things we can do) has a selective value. The auditory system,
I think, like vision, has evolved as a general-purpose system, but the
the basic information-processing strategies that it uses confer upon
it particular properties. Problems of consciousness aside (which are
really metaphysical problems), when one finds very strong correspondences
between patterns of neural activity and patterns of perceptual judgments
(observed from the outside, by psychophysicists), and these cannot be
explained in any other way, one is led to believe that these patterns
of activity have something to do with the perception (and may even
constitute the "neural code" underlying the perception).

This is entirely consistent with the idea that perceptions do not have
to be made explicit, as signals. The mere presence of a pattern of
firings is enough for the pattern to become a perception. Only in this
way, where in effect the pattern is also the perception of the pattern,
can the Observer be done away with.
Of course the objective is to do away with the metaphysical Observer.

I'm not a death-to-the-individual, death-to-the-Observer post-modernist,
neo-Hegelian kind-of-guy. Far from it. Normally (like you)
I would say that a receiver is needed to "interpret" a neural code, and
this is fine for describing the workings of an information-processing
system. I'm comfortable in thinking about neural assemblies as
"observers" (or monads, if you like), but, as I said before,
I don't necessarily think that being an "observer" means being "consciously
aware" -- maybe my ideas will change, I don't know. I'm also not doing
away with the "metaphysical observer" if I postulate that the existence of
such an observer or of "conscious awareness" entails a particular
kind of organization. I'm just proposing accounts of observers and
conscious awarenesses that are different from the "point-source receiver" model.

However, in a hierarchical model, we have a different non-metaphysical
observer, small-o, which is the set of all higher perceptual functions.
Now the idea of implicit patterns leads to a problem, because unless the
various attributes of the pattern are represented explicitly as signals,
there is no way to pass information about individual attributes to the
higher systems. All that could be passed upward would be the pattern
itself, unanalyzed.

There are alternatives to the strict hierarchical model where the higher
centers only get output signals from the lower ones. Imagine a human hierarchy
where "lower-level" functionaries get the primary signals, compile a
report and send the report AND the primary signals upstairs for further
analysis and decisionmaking. If the middle-level executives have no other
concerns and the report looks consistent with what they expect, they go
with the report, but if there is competing input or confusion, they may
also do more data analysis or send out the primary signal + all of the
accumulated reports to other consultants for evaluation. They might just
sit tight and wait for more reports to come in.......... There are ways
to think this through -- I'm trying, but there are precious few ideas
about these alternatives that have been articulated in the literature.

But if you review Peter's recent post, you will see that he speaks of
doing explicit computations to extract from the pattern certain specific
attributes of it. The raw signal is passed through a tapped delay line,
each delayed signal is multiplied by the undelayed signal, and the
products are summed into an array of "values." Then, to pick out the
attribute called the "fundamental frequency," the highest peak in the
autocorrelation function (actually carried out by computing machinery)
is located, and is summed together with all its harmonics into -- what?
A final, explicit number that represents the fundamental frequency, or
the longest interval. A scalar value that stands for just one dimension
of the temporal pattern, the apparent pitch.

[To extract a different attribute of the pattern, such as a measure of
timbre, a different computation would have to be applied, resulting in a
different number. There's no reason why several computations can't be
performed in parallel with the same pattern as input to each].

Now it becomes very simple to remember the pitch, to use a remembered
value as a reference signal, to compare a present pitch with the
reference pitch, and to generate an appropriate error signal. Only
scalar values have to be handled, once the attribute has been
computationally extracted. The reference signal no longer has to be
another temporal pattern. The comparison is ultra-simple, because it
consists of subtracting one scalar value from another. And the error
signal, being also a scalar value, can be converted simply into a set of
outputs that, via the world outside the control system, reduce the
error. All this can be done with simple rate-coded signals.

And since pitch is now represented by a single scalar number, a simple
signal can carry a copy of the pitch signal to higher systems, separated
from timbre and other attributes of the pattern, where it joins with
other inputs to higher perceptual functions. Only the relevant feature
of the temporal pattern, not the whole pattern, is sent to the higher
systems.

Yes, I made it simple that way because I was trying to come up with
an example that dovetailed in the most direct way with the PCT structures
that we all know and love. We need to keep the other ideas in mind,
though, when we are trying to understand how the brain might work
and/or we are trying to invent new technologies.

Even if you don't believe in an Observer, or see awareness as any
particular problem, I think that in a hierarchical model you have to
admit that we have to consider small-o observers. And even at the level
of the original perceptions, we can see that recording and replay of
perceptual signals as reference signals, and the process of comparison,
are far simpler if there is an explicit computation of each aspect of
the input pattern that is to be separately controlled, so that after the
initial operations, only scalar values have to be handled.

Simplicity is in the mind of the beholder. Also, what is simple for 1 percept
may not be so simple when one has 10 or 100 simultaneous ones. One of the
problems I have with scalar codes is that they result in "combinatorial
explosion" when one tries to deal with all of the combinations of
explicit scalar percepts that are possible.

In fact, we can now see that the interval-based computations that Peter
proposes can be treated as internal details of a perceptual function, or
actually a number of perceptual functions each extracting a measure of
some feature of the input pattern and producing a single scalar signal
representing the state of one attribute. There is no reason why these
output signals could not be rate-encoded, since each one represents the
final outcome of the kinds of computations that Peter proposes. And then
we would have what I think is required: one scalar signal per
perception, with the measure of that perception being representing by
firing frequency.
Tell me what's wrong with that.

Yes, one can do it, certainly, but is this the way it's done in the brain?
We don't know, one way or the other, but at this point we need as many
alternative hypotheses as we can get.

Peter Cariani