Sensorimotor adaptation in speech production

[From Bruce Nevin (980322.1342 EST)]

Bill Powers (980322.0121 MST)--

Thanks for Houde's email address, I have written to him. I can't get the
Science article on line without being a member of AAAS, and the Science
site doesn't seem to tell me how much that costs. In any case, that will
take a while. I have asked him for a reprint.

From a UNIX host, finger houde@phy.ucsf.edu returns this:

Project: Affiliation merzenich_lab office: 126MR-4 work: 6-2512 box: 0732
Plan:

···

-------------------------------------------------------------------
I am a postdoctoral researcher in Mike Merzenich's lab.

Office: MR4, room 126
        (415) 476-2512

Mailing Address: UCSF
                 513 Parnassus Ave., S-877
                 San Francisco, CA 94143-0732
                 houde@phy.ucsf.edu

Home: 1233 Arguello Blvd. #4
      San Francisco, CA 94122
      (415) 566-9280

Home in Rochester: 35 Rensselaer Dr.
                   Rochester, NY 14618
                   (716) 473-2126
                   (716) 239-6029
-------------------------------------------------------------------

I found a URL for Merzenich's lab at the UCSF site, but the server appears
to be down just now. Here's the URL:

http://www.keck.ucsf.edu/labinfo/merzenic.htm

This is probably in physiology, as there is no physics department listed
for UCSF, and the link to the physiology department also goes to a host
named keck (same error).

I wouldn't call this "learning," because the adaptation is always in the
right direction. I'd rather see it just as a slow control process.

Fine. That is how I see it.

The article actually does support your position, in that the new motor
patterns persist when auditory feedback is blocked by noise (the
"adaptation" part of the experiment). I didn't see any explicit data on
how quickly the "compensation" takes place, but they did apply the
disturbance slowly over a period of 17 minutes, and subjects said they
didn't notice the disturbance or their compensating change of articulation.

Non-acoustic perceptions for vowels are much harder to identify consciously
than those for consonants, because there is less physical contact involved.
When I try to perceive a configuration of my tongue as for a vowel, it is
inseparable from an imagined acoustic image of the vowel. I have asked
Houde if the "perturbations" affected the formant transitions for
consonants at the margins of syllables. I supposed that the converse,
perturbing consonants without affect on vowels, was probably not
technically feasible in real time, but asked if it had been considered.

I don't think you mean that tactile/kinesthetic reference levels are fixed.

You are right, of course. During the course of a normal interval of
speaking, they are fixed, and that was what I meant, relatively fixed in
that sense, possible to re-set over longer time spans.

What is relatively fixed, I would guess, is the
perceptual input function that detects the configuration of the mouth. As
the organization of this input function slowly changes, the objective mouth
configuration corresponding to a given articulation reference signal
changes. This is the long-term adaptation. Short-term, however, one can
still rapidly alter the reference signal for which articulation is to be
produced, say along the scale from "eee" to "ooo", and the actual sensed
articulation will immediately follow the reference signal. So the auditory
feedback effect is very rapid, as rapid as one's ability to say consecutive
phonemes. The articulatory control must operate as rapidly as speech can
proceed.

Yes, the PIF gets re-set, and speech production continues as non-acoustic
control with the new reference levels.

Is this configuration control? The PIF combines pressure sensations from
the tongue, tension for the velum raised or lowered, perhaps configuration
for lip rounding. Voice likely does require acoustic control to maintain
the critical laryngeal aperture and tension required. The difference
between d, j, and g is the point of tongue contact. The difference between
d and n is in the lowering of the velum. The difference between d and t is
in the larynx. Is this configuration control? I think maybe I'm putting the
question wrongly. I have this category, "configuration", based on the
example of limb position. But if the PIF receives multiple inputs from
level 2, sensations of tongue contact and velum tension, then it is a level
3 input function. It could be that one of its inputs is also from level 3,
for lip rounding. Am I understanding correctly?

The adaptation could actually be in the output function of the auditory
level of control.

If so, they would hear themselves producing the desired sounds without
altering their actual pronunciation. If a person said "beg" and heard
"bag", with adaptation of the auditory output function they would hear
"beg" correctly with no alteration of their speech output. If the
adaptation is in the non-acoustic control loop, one would expect a shift
toward the articulation for "big" in order to get the sound of "beg." I
gather from your description that the latter is the case--their motor
outputs changed. Or am I missing your point here?

The basis for this twofold. First, experimentally blocking auditory
feedback with no degradation of pronunciation;

This certainly tells us something, but it's like Blom's "walking in the
dark" example. People may use auditory feedback when it's available, and
switch to kinesthetic control when it's not. Before I will accept "no
degradation of pronunciation" I would would want to see sonograms: human
listeners can tolerate large changes in auditory inputs.

The point I made at the time is that you can't walk in the dark with the
same confidence as you can with the light on, not without stumbling, and
you can't do it for a protracted period through varying terrain. Over a
90-minute period of reading unfamiliar text you would expect some
stumbling. And you would expect immediately some of the subjective
experience of groping without benefit of normal sensory input. I think this
is more telling than fine details of a sound spectrograph, which anyway
show considerable variation in normal speech. But you say as much:

This is not an easy problem. The biggest problem is why, when auditory
feedback is swamped by noise, the outputs don't wildly exaggerate the
reference signal changes for the articulatory systems.

I'll let you know as soon as I hear from Houde.

Now, back to the dissertation. (And we're doing some tiling this afternoon.)

  Bruce Nevin

[From Bill Powers (980322.1940 MST)]

Bruce Nevin (980322.1342 EST)--

I don't think you mean that tactile/kinesthetic reference levels are fixed.

You are right, of course. During the course of a normal interval of
speaking, they are fixed, and that was what I meant, relatively fixed in
that sense, possible to re-set over longer time spans.

I think you have a picture of how the reference signals enter into speaking
that is very different from mine. In my view, the kinesthetic reference
signals specify such things as tongue retraction/extension, tongue pressure
upward/downward, and jaw, diaphragm, and vocal cord states. In order for
speech to happen, these reference signals must vary rapidly, as rapidly as
the mouth configuration is changing while we say "spot on, old chappie!":
ah aw oh aa ee, with consonants flapping away between the vowels. I
visualize the vowels as a single variable that varies smoothly from ee to
oo. There may be a second dimension as well. My point is that while we
speak, this variable is changing from one state to another rapidly and
continuously; a continuous record of this variable would look like this:

                       ie

···

---

                  chap
        spot ---
          ---
             on
             ---
                old
                ---

Is this configuration control? The PIF combines pressure sensations from
the tongue, tension for the velum raised or lowered, perhaps configuration
for lip rounding. Voice likely does require acoustic control to maintain
the critical laryngeal aperture and tension required. The difference
between d, j, and g is the point of tongue contact. The difference between
d and n is in the lowering of the velum. The difference between d and t is
in the larynx. Is this configuration control? I think maybe I'm putting the
question wrongly. I have this category, "configuration", based on the
example of limb position. But if the PIF receives multiple inputs from
level 2, sensations of tongue contact and velum tension, then it is a level
3 input function. It could be that one of its inputs is also from level 3,
for lip rounding. Am I understanding correctly?

Reasonable guesses, but I'l like to see a stronger experimental base for
them (stronger than at least I know about).

The adaptation could actually be in the output function of the auditory
level of control.

If so, they would hear themselves producing the desired sounds without
altering their actual pronunciation.

No, the output adaptation would result in different kinesthetic reference
signals for the same acoustic error. We have to remember that with the
formant being altered, the person changes the articulation until he sound
entering the ear is actually the same as it was before. In order for this
to happen, the articulation must end up different from what it was before.

If a person said "beg" and heard
"bag", with adaptation of the auditory output function they would hear
"beg" correctly with no alteration of their speech output.

Adapting the output function doesn't affect what they hear. It affects how
the kinesthetic reference signals are changed for a given auditory error.
If the disturbance is bumping the heard sound upward from ih toward ee, the
reference signals for articulation must relax the tongue to make the
actually produced sound sag downward from ih toward eh. The apparatus then
promotes the eh upward to ih, which is the intended sound.

We must find out how how rapidly the compensation occurs.

If the
adaptation is in the non-acoustic control loop, one would expect a shift
toward the articulation for "big" in order to get the sound of "beg." I
gather from your description that the latter is the case--their motor
outputs changed. Or am I missing your point here?

Let's get the missing information first, then work on the details some more.

Best,

Bill P.