modeling audition

[From Bill Powers (960514.1430 MDT)]

Ellery Lanier [960514 11:00 AM mst] --

     The posts of Peter Cariani are fascinating but it would be real
     nice if he would include a brief description of the basics of what
     he is writing about for us non-initiates. CSGNET is an
     interdisciplinary group and I am sure that we all want to know what
     the other fields are doing.

That would help me, too -- you saw the kind of trouble I got myself into
through not knowing enough about the field.

     If the meeting at Flagstaff will be just a lot of technical jargon
     about pitch detection it will be a major disappointment.

Not very likely -- it's never been that way in the past 11 meetings.
This is mostly because all sessions are plenary, and those who want to
talk about technical matters are forcibly reminded that they have to
explain what they're saying to non-technical people. We don't have
paper-readings; if you want to give a paper, you bring 30 copies with
you and distribute them on the first night (or whenever you get there).
When you give a talk (if you want to), you summarize the paper in 10
minutes and the discussion starts. The techie types have to interact
with non-techies, and vice versa; I think it's good for everyone.

Here on the net, the threads are started spontaneously, and those who
are interested participate. No reason we can't have as many threads
going as the participants want -- they vote by commenting.

     I sent a post a few days ago which apparently was not published so
     I am posting it again below.

It got through. I was intending to get around at least to an
ackknowlegement (nonlinear perceptual functions are an interesting
subject, if not on the front burner at the moment). But things leak out
of my brain at a rate proportional to the contents. Consider this an
ack, but no comment for now.

···

-----------------------------------------------------------------------
Peter Cariani (960514.1200 EDT) --

     These "pitch shift" experiments with inharmonic complex tones led
     to the realization that whatever the auditory system is doing, it
     must do a fine structure analysis of the waveform and/or the power
     spectrum rather than an analysis of waveform envelope and/or
     frequency spacings.

I don't see why. When you have a system designed to handle harmonically-
coherent signals and feed it a strange anharmonic signal, it's going to
behave in a somewhat different way. But the way it behaves can fall out
of the same design that works under normal circumstances. Since there is
usually more than one way to handle the normal case, the abnormal case
can help you choose among the models.

     The pitch shift simply "drops out" of an autocorrelation analysis
     (as de Boer showed in his 1956 thesis), but it can also be
     accounted for in spectral terms, the major models being Goldstein's
     model, which is based on best matches to "spectral templates" that
     are harmonic series, and Terhardt's model, which is based on a
     "subharmonic sieve".

Ideally, the variant case should drop out of the basic design, the way
odd-harmonic distortion drops out of (i.e., shows up all by itself in) a
push-pull amplifier design. I think that if you build a model around the
most usual, middle-of-the-range case, you can usually handle many
extreme cases simply by adding suitable trimmings to the model, like
amplitude limiting, large-signal nonlinearities, and so on. I'll admit
that this is hard to do in a purely mathematical treatment, but in
modeling by simulation you can do lots of things that are impossible
with the rigorous mathematics.

     As I understand your proposal, the organism or device would be
     adaptively generating harmonically-structured signals, in effect
     doing adaptive filtering, in order to find the best fit to a
     harmonic series.

That would be the effect. I'm not sure how much of my fancy harmonic-
series filtering would really be necessary, however. The basic model of
a neural integrator, in Pascal, is simply

output := output + gain* input*dt - leakage*output*dt.

The input and output can be represented by an impulse-frequency, for
maximum realism, if you want to take the trouble. If the input is a
series of impulses, there will already be plenty of harmonics in the
signal; adding them explicitly may be unnecessary.

A truly realistic neural model would convert the output to a frequency
by using a model of the neural impulse, where the time between impulses
depends on the recovery time after a discharge and the momentary setting
of the PSP. No frequency averaging at all is needed in such a model; the
impulses just occur when they occur. The only averaging is that which
takes place inside the neuron as the potential is pumped upward by each
incoming jolt and decays downward exponentially.

I haven't tried using trains of impulses yet; the neural integrator
model is simplest to run using numbers for the variables instead of
literal pulse trains. You might be interested in the design of the self-
tuning filter:

y := y + f*x - y/Q + input_signal
x := x - f*y - x/Q
output := output + input*y - output/leakage
error := reference - output
f := gain * error

   (Initialize x, y, output to zero)

Here f determines the tuning frequency, Q determines sharpness and
resonant gain, "leakage" determines the bandwidth of the control
circuit, and "gain" determines the tightness and speed of control. Only
integer arithmetic is necessary, so it's easy (using pointers to
records) to run a hundred or so of these systems are a reasonable speed.

     It appears to me to be almost an "analysis-by-synthesis" strategy
     which is adaptively controlled.

Yes, although the "adaptation" is only a systematic method of making the
filter track a frequency. It would be interesting to add nonlinear
limits to the basic filter (the first two statements), and as I said to
make the signals x and y into pulse trains. The multiplication would be
slightly complicated because it is the PSPs that have to multiply, not
the impulses. In effect, one signal frequency has to modulate the other.
It might be best just to cheat on that one and say "voila, the product!"

     The major problem that one would encounter in a neural
     implementation would be realizing precise harmonic relations (if
     implemented using a "place-pattern" generators) or generating
     precise time patterns (if operating using temporal pattern
     generators).

Well, as you can see this model doesn't do anything precisely. If we had
to use harmonically-related filters, I would start with the highest-
frequency filter, and then simply divide the input frequency down by
factors of 2, which is relatively easy, to feed the inputs of the other
filters. But that's only a guess as to how I would end up doing it if I
did it.
------------------

Thus while the mean rate of firing can remain the same, there is a
modulation of the mean rate at the audio frequency, so a
representation in terms of impulses per second would look like a
mean value plus an A.C. component.

     This is mostly correct, although talking in terms of a "mean rate"
     in this context (500 usec windows for a 1 kHz tone) is a bit
     misleading. It's the AC component that carries the information ...
     The two components, AC and DC, largely covary in the auditory
     nerve, so it is difficult to separate out the representational
     consequences of the two kinds of information.

The AC component carries the AC information; the "DC" (slower-varying)
component carries the loudness information. By using a neural integrator
we can generate a loudness signal that varies as the DC component, with
too long a time constant to reflect the AC component. This slowly-
varying component can be used for loudness control at the same time that
the AC component is being used for pitch recognition and control.

     The problem here is that the efferent systems of the auditory
     system, in the middle ear and in the cochlea, are much slower (tens
     to hundreds of milliseconds and longer) than the intensity
     fluctuations you want to control.

Not at all. The intensity fluctuations I want to control represent
loudness, not the individual pressure cycles.

     It's also still a matter of debate how much compensatory gain these
     systems have, but they appear to be fairly weak.......In an
     artificial system, these limitations need not apply......

They don't apply to turning a knob on a radio to control perceived
loudness, or to varying the effort in the diaphragm to control loudness
of speech, either.

     I've never liked the term "instantaneous rate" because it has
     always seemed to me to be oxymoronic: "rates" always need to be
     computed over some (contiguous) time interval.

Doesn't "instantaneous interval" have a similar problem? You need at
least two impulses to get 1 interval, so the interval can't be defined
as occurring at a specific time. If the impulse times are t1 and t2,
then a measure of frequency is just 1/(t2 - t1). So what if this
frequency changes with every new impulse? So does the interval t2 - t1.

In a neuron, the frequency of incoming impulses is directly related to
the mean post-synaptic potential, which is directly related to the
output frequency. If you start using 1/f to represent the processes, you
end up with inverted relationships and very inconvenient expressions for
the dependence of output interval on input interval. The physical
situation isn't changed, of course, but the mathematics gets ugly. Just
try writing the expression for my neural integrator using an interval
variable!
--------------------
     There are "locker neurons" in auditory thalamus that follow
     stimulus periodicities up to 1 kHz, and (a very few) similar kinds
     of units have been seen at the level of the primary cortex in awake
     animals, but again this looks much more like a "periodicity code"
     than a "rate-frequency" code.

That's pretty encouraging -- that frequency range covers the lower
formants of speech at least. I'm not sure what you mean by a
"periodicity code." Do these thalamic signals vary in frequency as the
stimulus frequency varies, or not? The harmonics don't have to be
represented as well; they might simply be represented as low-frequency
signals standing for relative strength of harmonics. There could be
octave-locking, with the basic pitch perception reduced to one standard
octave and which octave being represented by a different signal.

     Those people who have looked for cortical "pitch detectors" would
     have seen "rate-frequency" units, since they generally are paying
     attention to neural discharge rates, although this negative finding
     needs to be strongly tempered by the very few numbers of people who
     have looked and the relatively few places that they have looked.

They may have been looking for signals that vary at the audio frequency.
Anyway, by the time the information gets to the cortex, I imagine that
the signals represent abstract qualities of the sound, and not pitches
any more.
-----------------------
     I think the point here is that we tend to conceive of the brain
     (which we don't understand) as working something like our most
     advanced technological artifacts (in this case, computers, neural
     nets, or control systems, which we think we understand), but there
     may be principles involved in its functioning that we haven't
     discovered yet.

Of course. But we try to use what we do understand to see how far we can
get with it. I think we have the spinal reflexes and the control of limb
position pretty well nailed; I doubt that any drastically new principles
will be found there. Obviously, we aren't that far along with sound
perception and control.
----------------------------------------------------------------------
Best to all,

Bill P.

[Martin Taylor 960514 22:35]

Bill Powers (960514.1430 MDT)

(To Peter Cariani)

They may have been looking for signals that vary at the audio frequency.
Anyway, by the time the information gets to the cortex, I imagine that
the signals represent abstract qualities of the sound, and not pitches
any more.

The Pattersan "Stabilized Auditory Image" to which I referred a few
postings ago does not change so long as the perceived quality of the
sound does not change. But it incorporates both the inter-impulse
intervals and the place-frequency representation. If Patterson is
right that a person can hear anything that is visibly different in the
SAI and cannot hear anything that does not affect the SAI, it seems
reasonable to assume that the variables that influence the form of the
SAI have something to do with auditory perception. By the way, Patterson
was one of the main researchers in the pitch perception of the odd
shifted-harmonic tones before he developed the SAI.

By the way, responding to something in an earlier posting:

Has anyone tried to apply the perceptron approach to a set of frequency-
sorted signals?

Yes, it's been a standard approach to speech recognition for quite a long
time. But the sort of simple perceptrons I have been talking about on
CSGnet don't work very well. Some kind of recursion is usually part
of successful systems, where the output from some level of the perceptron
is incorporated along with new data as the input to the system. Also,
as you proposed a few messages ago, suvccessful systems are usually
modular, with the different modules doing different jobs (e.g. (module 1)
"If this is a vowel, which vowel is it likely to be", (module 2) If this is
a stop consonant, which is it",.... (module N+1) "Is it a vowel, or a
stop consonant, or a ...") That's only one of several ways of modularizing
that have been tried.

Martin