[From Bill Powers (960510.2010)]
RE: pitch control
First off, let me apologize to the experts who are getting pretty ticked
off with my ignorance. I think that a large part of the mutual ticked-
off-ness can be traced to this simple misunderstanding:
It's incredible to me that you don't listen at all to the purpose
of the procedure that I was asked for and provided. Rick asked me
for a controller for pitch, not for a neural account, and I gave
him the simplest (digital) computational strategy I know of for
computing this from digitized signal. You're way, way off base if
you think I am (or would ever, EVER) propose this as the way that
the brain does it. Frankly, I would have expected a bit more
reflection on your part.
You see, I thought you _were_ trying to propose a plausible neural
account of pitch perception and control. The entire picture changes if
it is understood that you are simply trying to _design_ a pitch-
perception system based on the principles you describe. If you're
designing a system, you are no longer constrained to operations that
would be plausible in a nervous system, but can make use of all known
mathematical operations and any circuit or program designs you can think
up. I don't put you down for doing this; the first thing we have to do
in building any model is to start with SOME way of accomplishing the
result, even though we can be pretty sure it won't be the actual way
used in the real system.
It is YOU who are oversimplifying the problem of the central
computation of pitch. Pitch is a "complex percept" -- the models I
am talking about deal with the pitches produced by complex tones
with many harmonics, and they can deal with pure tone pitches below
a few kHz (although they do not account for subtle (1-5%),
secondary and tertiary effects of pure tone pitches like binaural
diplacusis and level-dependent pitch shifts.
I plead guilty to oversimplification. I haven't been thinking of complex
tones where the perceived pitch might be different from the fundamental
of the physical tone. I've run into only one example, which I described;
the problem a few subjects have in controlling a tone created by a
square-wave generator driving a PC loudspeaker. And I haven't
experimented with controlling tones that are very loud or very soft, so
I haven't seen the effects you talk about.
But back to our basic misunderstanding. You said to Rick,
I'm not sure what more you'd want. I've given two ways of computing
a signed, "continuous" error signal for adjusting the fundamental
of a harmonic complex tone (that a human being would hear as its
pitch) to match that of another (or of a pure tone, for that
matter). If you invert this error signal and feed it to a voltage-
controlled motor that slowly turns a knob that changes the
fundamental frequency of the test signal, reference and test
fundamentals (F0's) will converge.
You can see the problem we would have if we interpreted this as a design
for a neural controller. It's easy to say "invert this error signal,"
but the actual computations required to do this in real time, or at all,
would be quite complex; it would be hard to see how they would be
accomplished in neurons. But if you're just talking about a computer
program, there's no problem. I don't think that either Rick or I would
have doubted that you could _design_ a pitch control system in a
computer even if you restrict yourself to methods that employ interval
computations. The same thing could be done with a system that uses rate-
place information or other kinds, if you allow an equal degree of
complexity in the model.
At one point you say
Clue: we're not at all talking about low, level neural processes
here.
This depends on what you mean by "low level." Pitch perception has to be
pretty low in the hierarchy, because it is only one element of much more
complex perceptions, ranging from perceptions of words (particularly in
a tonal language like Chinese!) to perception of propositions to
perception of principles to perception of concepts like Self or
Mathematics.
You should really look at what kinds of inputs and precise
connections are necessary to do this in your rate-place model, and
whether they can do this under different conditions, like at high
levels.
Well, I doubt that I will be undertaking any research projects like
this, especially on a par with yours, but I can still be permitted to
look for alternate explanations. In a rate-place model, there is really
no difficulty in dealing with some of the phenomena of misperception of
physical fundamental frequencies (which is the basic data you're talking
about). If, for example, the perceived pitch varies with loudness of a
standard-pitch input, all that's needed is the right nonlinear amplitude
function. The effect of harmonics on the perceived pitch might be
slightly more difficult, but I have no doubt that by postulating a
particular frequency-detector we could eventually reproduce the
phenomenon. This is just a matter of circuit design, and any competent
designer could reproduce just about any such distortion phenomenon. Of
course coming up with such a model would still only prove that a
workable design is possible; there's no proof that the nervous system
does it that way, either. My point is that given any kit of fundamental
operations that you deem permissible, you can probably construct a model
that will do the right things.
···
-------------------------------
Nowhere in the auditory system has anyone seen any kind of code
where firing rate is a monotonic function of stimulus frequency.
Now that's very interesting; I didn't know that. That, of course, rules
out any simple rate-place model for pitch perception. But it doesn't
automatically rule in an interval interpretation, either. For example,
it would be possible that pitches are detected in a way similar to the
detection of line orientatations. In the visual detection of line
orientations, we find populations of neurons which respond maximally (in
rate of firing) for specific orientations, one neuron giving maximal
response in impulses per second only for one orientation, so you end up
with a vector map in a volume of the brain (Georgeopolis's stuff).
Concerning your statement, I'm a little confused, however, because I
recall an article from rather a long time ago in which an electrode
placed somewhere in the auditory nerve produced a signal that could be
amplified and heard on a loudspeaker as the sounds entering the ear. Is
this a popular-science myth?
A few years ago there was an article in _Science_ about an experiment
involving a 64x64 array of electrodes placed in a visual nucleus. When a
point of light was moved across the animal's retina, a distribution of
firing rates shaped (roughly) like a narrow gaussian curve was seen to
move across the electrode array. If I recall correctly, this same
experiment required the animal to pick a target position, and the target
position showed up as another (stationary) peak at a specific location
in the array. But I no longer recall the article clearly enough to be
confident of the details and I've lost the reference. At any rate, the
implications were that position information is carried positionally in
the internal map. This, of course, makes it difficult to see how a
reference signal could be compared against a position, to yield the kind
of error signal needed for position control.
I mention these observations only to bring up the point that the basic
PCT model as we run it in a computer, which predicts very well, does not
necessarily resemble the way in which the various elements of a control
system in the brain -- perception, comparison, and action -- actually
work. All the model actually says is that there are neural
representations of a perceptual signal, a reference signal, and an error
signal, and that these signals can be represented as numbers that vary
in certain relations to each other and to the external world. What
aspect of the actual signals is represented by the numbers is not
specified. The method of coding does not have to be specified.
---------------------------
Nobody has proposed a physiologically-plausible way in which the
pitches of complex tones might be computed by the auditory system
using rate-place patterns in auditory maps. The spectral pattern
pitch models assume that one can 1) resolve all the peaks of the
component frequencies 2) estimate the absolute frequencies involved
with high accuracies and 3) do a harmonic analysis to infer the
greatest common denominator of the frequency peaks, and do this
over huge ranges in level, in noise, for auditory objects situated
in various locations in auditory space, etc, etc etc. Even if one
uses spectral "templates", these are complex, you'd need a hell-of-
alot of them for all the relative level, location, and s/n
conditions that are encountered.
All I can say is that I wouldn't propose that kind of spectral pattern
pitch model. A much simpler model that I've used just looks at zero
crossings; another one I've actually tried is a time-domain model in
which you get an oscilloscope-like picture of the signal envelope, which
can then be analyzed as a quasi-stationary pattern (in a perceptron-like
way). And an interesting-looking one starts with a phase-locked loop
with a phase-reference waveform not at the fundamental frequency but at
the highest possible frequency (like 20 KHz), with subharmonics being
generated by frequency division to produce the fundamental phase-
reference signal. This has the nice property of being able to generate
"DC" signals for formants using multiple synchronous detectors, which
look only for harmonics that are locked to the fundamental. It can also
track the formants over at least a 2:1 range of fundamental frequencies,
so the formant relationships become nearly independent of pitch. I
mention these ideas only to suggest that if one begins with other trains
of thought, other possible models suggest themselves. And of course the
mathematical treatment would be quite different for each such model.
Autocorrelations and cross-correlations are not the only possibilities,
although I'm sure that there would be mathematical similarities among
many of the possible models.
The interval-based models have harmonic structure embedded within
them, so that all of the complicated, precise estimation of
component frequencies and inference of the fundamental is neurally
taken care of in a very simple and elegant way.
Well, simplicity and elegance are what I admire, so perhaps I will have
to accept your model in the end.
One thing you would have to take into account is that in a model like
the PCT model, there are no "complex" signals. If a signal has a complex
component, then after passing through a perceptual function, only one
dimension of the signal would be represented as a higher-level
perception. Pitch would be extracted as one perceptual signal; timbre as
another (or several), and so forth, each being extracted by a different
input function. So instead of a single signal with a complex pattern, we
would end up with many different signals, each one being simple and
representing just one dimension in which the input pattern can change.
This is somewhat akin to your proposals for complex operations that
could, for example, isolate the fundamental of pitch. The output of such
an operation could then be a simple scalar signal which varies in
frequency as the perceived fundamental varies. More on this in my reply
to Martin Taylor, below.
Where you and Martin Taylor seem to have a basic difference with me is
over the necessity of specifically representing these aspects of the
patterns as simple signals. I'm sure you don't disagree that doing so
would be possible. But I claim it is necessary in order to account for
the qualities of subjective experience.
-----------------------------------------------------------------------
Martin Taylor 960510 13:30 --
Why need there ever be an explicit perceiver of the magnitude of
the colour vector, or of its phase, for colour to be a useful
perception?
The explicit perceiver is a higher-level perceptual function, for which
color is one input. If a person has to point to "green" on a spectrum,
control is easiest if there is a single signal for green that can be
specified. It is even easier if there is a single signal the frequency
of which represents a position on a color continuum.
Scalar control variables are easy to verbalize, and to discuss, but
I see no need at all for any perception discussed by the analyst to
exist as a separate scalar variable inside the control hierarchy
being analyzed.
It's not the analyst I'm concerned with, but the higher systems that
will have to use the perception as a component of a higher-level
perception.
If the _user_ of a neural network wants to perceive the network as
having "seen" a pattern, then that _user_ will probably prefer the
network to produce a single output of a high value for that
pattern.
Yes. The "user" of a perceptual neural network in the brain is a
perceptual system of higher order in the same brain. As I point out
below, another user could be considered to be the comparator.
Even in doing
electronic designs, I had never said to myself that any function of
multiple signals had to be physically implemented, with an explicit
physical result, before it had any real existence. I mean, what
electronics engineer says something like that? It goes without saying.
Again, "real existence" to whom? I think to the designer, not to
the circuit. It's so much easier to design when you can think about
one signal in one place.
No, "real" to the circuit. Suppose you have an R.F. signal in which
there are implicitly a carrier frequency and two sideband envelopes. In
the sideband envelopes (once they are isolated by tuned circuits and
heterodyning) there are, implictly, two audio signals. But before those
implicit audio signals can have the appropriate physical effects on the
rest of the circuits, one of them must be diode-detected and smoothed
and represented explicitly as a voltage varying at the audio frequency.
In that varying audio voltage, there is implicit some sort of sounds,
but before those sounds can be produced, the voltages must be converted
into physical forces on a speaker cone. Implicit in those sounds might
be music or news, but before either music or news can become an explicit
perception, the sounds must be sensed and converted into neural signals,
and the neural signals must give rise to further neural signals until
the necessary level of perception is reached. It would not be sufficient
to present the R.F. signal to the final perceptual apparatus, even
though all the relevant information is implicit in that signal.
----------------------------------
I think that in the past few decades there has arisen a sort of
philosophical bias against the idea that there must be a perceiver of
perceptions. "Down with Dualism!" says the bumper sticker. The goal
seems to be to model signals in which certain kinds of information can
be shown to be present, even though that information is not converted
into any explicit form. Thus, Peter says that the harmonic structure of
a pitch is conveniently contained in the distribution and
autocorrelation of impulse intervals -- even though there is nothing
that is explicitly performing the autocorrelation and extracting and
representing the various dimensions of this distribution. The idea seems
to be that simply because the structure _could_ be extracted, that is
sufficient to account for perception of the various aspects of the
structure.
This is entirely consistent with the idea that perceptions do not have
to be made explicit, as signals. The mere presence of a pattern of
firings is enough for the pattern to become a perception. Only in this
way, where in effect the pattern is also the perception of the pattern,
can the Observer be done away with.
Of course the objective is to do away with the metaphysical Observer.
However, in a hierarchical model, we have a different non-metaphysical
observer, small-o, which is the set of all higher perceptual functions.
Now the idea of implicit patterns leads to a problem, because unless the
various attributes of the pattern are represented explicitly as signals,
there is no way to pass information about individual attributes to the
higher systems. All that could be passed upward would be the pattern
itself, unanalyzed.
But if you review Peter's recent post, you will see that he speaks of
doing explicit computations to extract from the pattern certain specific
attributes of it. The raw signal is passed through a tapped delay line,
each delayed signal is multiplied by the undelayed signal, and the
products are summed into an array of "values." Then, to pick out the
attribute called the "fundamental frequency," the highest peak in the
autocorrelation function (actually carried out by computing machinery)
is located, and is summed together with all its harmonics into -- what?
A final, explicit number that represents the fundamental frequency, or
the longest interval. A scalar value that stands for just one dimension
of the temporal pattern, the apparent pitch.
[To extract a different attribute of the pattern, such as a measure of
timbre, a different computation would have to be applied, resulting in a
different number. There's no reason why several computations can't be
performed in parallel with the same pattern as input to each].
Now it becomes very simple to remember the pitch, to use a remembered
value as a reference signal, to compare a present pitch with the
reference pitch, and to generate an appropriate error signal. Only
scalar values have to be handled, once the attribute has been
computationally extracted. The reference signal no longer has to be
another temporal pattern. The comparison is ultra-simple, because it
consists of subtracting one scalar value from another. And the error
signal, being also a scalar value, can be converted simply into a set of
outputs that, via the world outside the control system, reduce the
error. All this can be done with simple rate-coded signals.
And since pitch is now represented by a single scalar number, a simple
signal can carry a copy of the pitch signal to higher systems, separated
from timbre and other attributes of the pattern, where it joins with
other inputs to higher perceptual functions. Only the relevant feature
of the temporal pattern, not the whole pattern, is sent to the higher
systems.
Even if you don't believe in an Observer, or see awareness as any
particular problem, I think that in a hierarchical model you have to
admit that we have to consider small-o observers. And even at the level
of the original perceptions, we can see that recording and replay of
perceptual signals as reference signals, and the process of comparison,
are far simpler if there is an explicit computation of each aspect of
the input pattern that is to be separately controlled, so that after the
initial operations, only scalar values have to be handled.
In fact, we can now see that the interval-based computations that Peter
proposes can be treated as internal details of a perceptual function, or
actually a number of perceptual functions each extracting a measure of
some feature of the input pattern and producing a single scalar signal
representing the state of one attribute. There is no reason why these
output signals could not be rate-encoded, since each one represents the
final outcome of the kinds of computations that Peter proposes. And then
we would have what I think is required: one scalar signal per
perception, with the measure of that perception being representing by
firing frequency.
Tell me what's wrong with that.
-----------------------------------------------------------------------
Ed Ford May 10, 1996 --
Perry Good called me to read me that letter from Bill Glasser. Why am I
reminded of L. Ron Hubbard?
Tom Bourbon is reporting very favorably on what he has seen of your
school programs in action so far. Keep up the good work. See you at the
meeting in July.
-----------------------------------------------------------------------
Best to all,
Bill P.