Language modeling

[From Bill Powers (931014.1500 MDT)]

Bruce Nevin (931013.1222) --

I like your idea of projecting what the "targets" must have
been from the enacted trajectory as successive phonemes foil
one another from actually attaining their "targets".

The idea wasn't so much that one phoneme prevents another from
attaining its target as the fact that phonemes follow in
sequence, and it takes time for the articulators to change from
the configuration for producing one phoneme to the different
configuration for producing the next one. If this process of
change is still under way when the reference-signal for the
phoneme changes to a different one, the articulators will
suddenly start changing in a new way as they are used to move the
phoneme from wherever it is toward a new target state. The
perceived phoneme will never actually arrive at the first target
state before that target is switched off and a new one is
switched on. Think of moving your finger toward a suddenly-
presented target, and having the target jump to a new position
when your hand is about half-way to the first target. Maybe this
is what you meant, but the way you stated it sounded like an
interaction among phonemes that exist at the same time. The
phonemes can't actually do anything to each other; it's the
control system producing them that has speed limitations. There's
only one phoneme being heard/produced at a time, as far as I
understand it.

Your prediction about the terminal phoneme runs into the fact
that, at least in English, terminal syllables tend to be
unstressed. (There is a very strong tendency, across languages,
for word margins to be deemphasized, so it is not just
English.)

Is this true even for the case I described, the terminal phoneme
in an utterance, with silence following? I wasn't talking about
margins between words, but the final sound as at the end of a
sentence. I realize that words actually run together in a
continuous stream in the middle of a sentence. But don't you
think that the reference-"t" ending "don't" would be approached
less closely in "don't you" than in a terminal "don't?" "Don't do
it, don't do it, don't!"

In playing with sound spectrograms, I noticed that my mouth tends
to shape itself for the next vowel even before the intervening
consonant is spoken -- you can see the frequencies shift as the
tongue moves toward the new position, so the consonant excites
formants typical of the next vowel. When you say "dee" the pre-
release vocalization contains more high frequencies than when you
say "dough" because the tongue is already in position for the
following vowel and the mouth cavity is differently-shaped. And
as you say, the preceding vowel also has an effect -- if you're
saying "odo" there's no frequency-change during or just after the
consonant, but if you say "odee" there an upward sweep and if you
say "eedo" a downward swoop, even during the consonant and before
the release. You can see that the middle and back of the tongue
are in motion from one configuration to the next even while the
tip is closing the cavity to make the "ud" sound. There seem to
be three or four tongue control systems acting at once,
independently: back of the tongue, middle, tip, and margins
(which contract and spread, as in saying "lee"). And they seem to
make easily-discernible differences in the perceived phoneme.

I'm bothered by how easily we perceive differences between sounds
that look identical in the spectrogram. I'm not talking about
sounds embedded in words, but just vowels uttered by themselves.
To me, the sound of "ee" is markedly different from the sound of
"ih", yet the difference is barely discernible on a spectrogram,
and is totally wiped out if I change the pitch of my voice. This
makes me more and more sure that the spectrogram is the wrong
model of the perceptual process. It's just not sensitive enough
to differences.

In the physics of sound production, you have a sound-generator
consisting of two membranes with their edges almost in contact,
under variable tension, with air being forced between them. Each
vibration launches a pressure-pulse up the throat and into the
mouth and (if open) nasal cavity. This pulse is partly reflected
at constrictions, and often partly escapes into the next cavity.
The reflected pulse must modify the pressure at the vocal cords,
changing their mode of vibration, so the underlying sound
generator is not just an independent vibrator, but part of the
whole system of cavities, resonating with them, with pressure-
waves travelling both ways at all times.

The pulses create many harmonics, but these harmonics must be
locked together in phase, or at least must be, for any phoneme,
in constant phase relationships that depend on the shapes of the
cavities. The sound spectrograph discards this phase information
because of the way the narrow-band filtered oscillations are
rectified and smoothed. Only amplitude information appears in the
output signal for any channel. This may be where the difference
between "ee" and "ih" is lost in the spectrogram.

What we desperately need are other ways of characterizing the
sounds of speech. It's easy to talk about what the "real" sound
is as opposed to the "perceived" sound, but what we call the
"real" sound is simply the output of an artificial perceptual
function. A sound spectrogram no more represents the physical
sound than the perceived phoneme does. We will not have the right
artificial perceptual function until it produces output signals
that can discriminate every sound that _we_ can discriminate, and
preferably more.

When you say that people perceive differences in sounds when
there are, in fact, no differences in the sound-waves, what do
you mean by "in fact?" You are really comparing human reports on
differences with differences detectable by an artificial input
function. If the artificial input function is incapable of
discriminating sounds with the same finesse as human perceptual
functions, then there could well be clear differences that simply
don't show up in the crude "objective" representation.

I think we need to do a lot more work on finding ways of
representing speech productions detected by a microphone, until
we can detect more differences than the human system can detect
rather than fewer. I mistrust stacking up hypotheses and
conclusions when the basic observations have such low resolution.

Even when we have an input function that can really discriminate
sounds well, we will have only a set of input signals for the
next level of perceptual functions. Here again we need to think
things through in more detail. A next-level input function can be
designed so as to treat a whole range of input signals as the
same -- produce the same output signal over this range. Yet when
the input signals change in a different pattern each next-level
input function could respond with a change in its output signal.
This means that "contrasts" are determined by the nature of the
input function. When we understand the way contrasts behave in
different words (a matter of experiment, as you have outlined),
we can start looking for particular forms of input functions that
will yield the same contrasts and lacks of contrast, given the
set of lower-level input signals. If this works, and it might
well work, then we can forget all the more complex hypotheses and
rules that have been thought up to explain contrast effects.

I'm convinced that we have to explain as much as possible at each
level before introducing a new level of analysis. If we do that,
then the new level will have as little as possible left to
explain. If we can explain how most sounds are discriminated into
separate sensations and configurations and transitions, then all
that the event-level would have left to do would be to recognize
that a sound that is still ambiguous doesn't match one possible
perception of a sound-event but does match another. And words
that are intended differently but are actually the same at the
event level could be discriminated by their relations to other
words, and relations that are ambiguous can be discriminated at
the category level by subject-matter, and so on.

What I'm concerned with is that if we just dive into the sea of
facts known about speech, such as you and Martin keep bringing
up, we won't have any way to know which facts are important and
which are only side-effects of the way speech recognition and
production are organized. Maybe it's true that people demonstrate
subtle and systematic differences in producing phonemes that only
a trained linguist can detect, but it may also be true that these
differences have nothing to do with word-recognition and simply
reflect mechanical or organizational side-effects. While these
differences are interesting, they may not be important. Whether I
say p'in or p`in, you still know I am talking about a small sharp
object, or winning a wrestling match, or a decorative brooch
(depending on context, not pronunciation). We may be able to
build a perfectly adequate and even skillful basic model of
speech production and recognition without specifically trying to
produce these subtle differences -- and when the model runs, many
of these differences may simply show up, unbidden, as natural
concomitants of the way the model is organized. That's if we're
lucky.

My problem with all these phenomena that you and Martin throw out
in such profusion is that I can't tell what's important and needs
to be explained first and what's only a curiosity or a
embellishment, or a minor statistical trend somebody noticed. It
seems to me that we're still unable to explain how sounds are
reliably differentiated in the easiest possible circumstances,
and how even a hyperarticulated word is recognized. If we don't
have a good working model of such simple things, how on earth can
we expect to model something like a strategy for increasing the
contrast between the vowels in "donkey" and "flunky?" We don't
even know what contrast is, not to mention how we alter it.

ยทยทยท

------------------------------

I have proposed that we model pronunciation of a consonant
phoneme as control of an intended articulatory gesture
perception, which hearers reconstruct from the sounds
(behavioral outputs) because there is a well-known (publically
known) finite set of these gestures, because they are maximally
different from one another in a thoroughly familiar environment
(one's own mouth, reliably assumed to be identical in all
relevant respects to the other person's mouth), and because the
hearer assumes that both she and the speaker controls the same
number of them differentiated from each other in the same or
nearly the same ways.

All right, go ahead and model it that way. Your description above
tells us what the model has to accomplish, and why accomplishing
it might be desireable, and how it might be related to other
motivations, but it doesn't tell us what the model IS. First we
have to model an articulatory gesture, don't we? We have to
define exactly what constitutes a "gesture," how it is perceived,
and how the control systems are organized to produce perception
of a given gesture when a reference signal is set to a particular
value. I rather suspect that we will end up with what I call a
transition-level control system, but since I haven't yet got a
model of a configuration-control system for the articulators, I
can't guess what the transitions will prove to be, or how they
will be carried out.

I actually think that your proposed modeling project is out of
sequence, but if you think you can do it, go ahead. I'm still
trying to model vowel production.

Unless, of course, you mean "Let's pretend that we have actually
produced a model and that it will do the following things for the
following reasons..." In that case you're on shaky ground. How do
you know the proposed model will prove to do those things in the
way you describe? You may be quite right, but if you are it will
be a lucky guess, or an inspiration, and you still won't have an
actual model to demonstrate that you're right. Where would we go
from there?
-----------------------------------------

And when you hear win, ninny, minimum, thin, kin, gin, lint,
fin, sin, chin, din, tin, gimpy, and so on, as opposed to pin
(or bin). Then (assuming the point of view of one who doesn't
know English yet) you represent the contrast between pin and
bin by a difference in two sets of phonetic perceptions.

That is certainly a clear statement of your claim, so clear that
I feel no hestitation in doubting it. I think the same result can
be obtained without this awkward insistence on calculating the
pairwise differences among one word and all possible others. You
also have ninny, minny, manny, money, boney, bunny, burney,
burner, birder, boarder, border, badder, batter, battle, bottle,
bottler, butler, butter, mutter, matter ... it's simply endless.
A patient linguist might create a list of hundreds of thousands,
or millions, of experimentally established contrasts or lacks
thereof, but this is no indication that the brain distinguishes
words in that way. This is simply an unwieldy and unlikely model,
incredibly costly in terms of computational ability and unlikely
to be able to run in real time even given the brain's great
parallelism. And in no way does it explain why contrast exists in
the first place.

It should be possible to come up with a far simpler model that
would give the same effect -- I've mentioned a normalization
procedure as one possible method, akin to the Land model of color
reception which doesn't require any pairwise comparisons but
achieves the same result. I have no objection to your report on
the basic phenomenon. What I object to is the naive translation
of the linguist's way of investigating contrasts directly into a
proposed model of how the brain decides that THAT word is not the
same as THIS word. A judgement of contrast is the OUTCOME of a
process, not the process itself. You're begging the question.
-----------------------------------------------------------
Best,

Bill P.

[From: Bruce Nevin (Fri 931015 14:15:19 EDT)]

Bill Powers (931014.1500 MDT) --

No, I'm not attributing agency to phonemes; yes, your non-shorthand way
of putting it is more explicit. Yes, your description of what you see on
your sound spectrograms is just what we are both talking about.

Terminal phoneme in an utterance with silence following: why ask me? Use
your senses. Augment your senses with the instrumentation that you have
developed. Observe these sounds in naturally occurring conversation;
performance notoriously differs when attention is drawn to pronunciation.
Is the final t of "don't" the same in "don't you", in "don't do it", and
in utterance final "don't!", and are these the same as in "top" and in
"butter"? Can you make them the same, and do you normally do so in
conversation? Don't ask me. Listen and feel and (in the case of your
instrumentation) see your own perceptions. Try sampling from tape
recorded conversation.

Your sound spectrographic setup may need some refinement. In the article
I cited in _Language_ (the only data I have handy), the "ee" vowel and
"ih" vowel are separated as much as any pair of vowels on a plot of F1
against F2. The values are roughly:

            F1 F2
    "ee" 300 Hz 2500 Hz
    "ih" 400 Hz 2200 Hz

Formant frequency does not vary with voice pitch; what varies with pitch
is the number of harmonics that make it through the filter presented by
the mouth at the two passed bands of frequencies that constitute the
formants. There are more harmonics per formant at lower fundamental
frequency (pitch) and fewer when the fundamental is higher.
(Nonetheless, as Liberman and Blumstein describe, people appear to
perceive formants even when no harmonics pass to physically represent the
formant, apparently contextually if I remember right.) The plots of F1
against F2 that map well onto a plot of the physical location of closest
stricture made by the tongue in the mouth (the vowel triangle or vowel
trapezoid, a sagittal section of the oral cavity with the lips imagined
as being on the left side, velum and throat on the right side) are as
follows:

    F1 200

        300 i u

        400 I U

        500 eI oU

        600 E ^

        700
                                            a
        800 ae

        900
            2800 2400 2000 1600 1200 800
                                                     f2

These are of course very rough, given ascii display, and averaging
differences among the displays in the article. The horizontal spread of
the graph is too great, but I haven't the time or patience to tinker with
the tabs. The vowels are for a southern california dialect, as follows:

    i = ee "bead" u = "oo" "food"
    I = ih "bid" U = "u" "hood"
    eI = "a" "raid" oU = "o" "bode"
    e = "eh" "bed" ^ = "uh" "bud"
    ae = "a" "bad" a = "ah" (vowels of "pa" and "bod" same
for them)

in the physics of sound production, you have a sound-generator >

consisting of two membranes with their edges almost in contact, > under
variable tension, with air being forced between them.

Electromyographic studies show muscle pulses contributing, so it's not
just equivalent to stretching the neck of an inflated balloon and letting
the emerging air stream set up a vibration. But yes, you have a source
of air pressure pulses in the larynx. I don't know anything about affect
of reflected waves on the activity of the larynx, my understanding is
that there is very little reflection, mostly absorption. Perhaps Martin
or Osmo Eerola knows about work done on phase aspects of the resultant
auditory signal emerging from the mouth.

When you say that people perceive differences in sounds when > there

are, in fact, no differences in the sound-waves, what do > you mean by
"in fact?"

I mean substituting a segment of recorded sound from one utterance in
place of a segment in a recording of another utterance. The first such
experiements were done in Germany in the 1930s using film sound track
technology. Then there were tape splicing experiments with magnetic
tape. Then it became possible to smooth out splicing artifacts (clicks
and the like), and now you don't need tape at all with digital sampling.
One could even in the early days splice the original sound back into the
frame, so that the two versions were on an equal footing, but I don't
know if anyone did. I also mean recognition experiments with synthesized
speech, where the auditory characteristics of the synthetic signal are
well understood and not dependent upon possible acoustic coupling affects
of the vocal tract such as you suggest. Unless you wish to claim that
recording technology does not capture physical features of speech that
the human ear does capture, I think these methods are not subject to the
objections you raise.

If you substitute the r of throw, the t of butter, and the d of ladder
(in normal, unemphatic pronunciation in an american dialect)
interchangeably in the frames "th_ow", "bu_er", "la_er", "be_y" you get
"throw, butter, ladder or latter, betty". The segment taken from any one
sounds just like the segment taken from any of the others, that is, a
given frame with one segment substituted (that originally from "throw")
sounds identical to the same frame with a different segment substituted
(that originally from "butter").

When you substitute the r from "throw" into "be_y" you don't get "berry".
Yet that sound, which unmistakably is a t in "butter" and a d in "buddy",
is unmistakably an r in "throw". This is what the substitution
experiments show.

Such experiments qualify, I think, as PCT experiments. A potential
disturbance (interchanging bits of speech perceived hitherto as r vs. t
vs d) turns out to be no disturbance. Likewise, work with synthesized
speech, where an acoustic variable is varied and hearers report no change
up to a point, and then report hearing an adjacent sound category.

In careful, emphatic speech you get the r of "berry" in "throw", the d of
"dog" in "ladder", and the t of "top" in the other words. These sounds
are not substitutable in all the same frames. You also get "the"
pronounced like "thee" in the same style of speaking, and "a" prounounced
like the vowel of "ace", and the final t of "don't" pronounced like the t
of "top".

You are absolutely right about sound spectrography: it is a tool for
representing visually some features of the acoustic signal, and it has
limitations. However, speech synthesis using only features captured by
sound spectrography produces a signal in which people recognize phonemic
distinctions, whether in words or nonsense syllables. Features left out
contribute to the naturalness and expressiveness of the voice, but appear
to have nothing to do with phonemic contrast and speech recognition.

There's lots to look at in the _journal of the american acoustical
association_ (jasa). A good bit is summarized in lieberman & blumstein's
book. And there is the sensory evidence. But you have to dig below the
category level for that. Otherwise, you don't know what you're talking about.

My emphasis on apparent paradoxes, like the p in spin being neither p nor
b, and the r in throw being interchangeable with the t in butter and the
d in budder ("that rose bush is a good budder, but most of the buds
fail") but being an r nonetheless, is not merely to plunge into the sea
of facts known about speech. My purpose is to bring to your attention
levels of perception of speech, your own and that which you hear from
others, below the level of phonemic categories. Until you attend to
those lower levels of perception, you do not know what perceptions may be
controlled, and your work on these problems flies blind. Worse than
blind: the in-flight movie is being projected on the inside of the
aircraft windshield.

So far as i know, no one has brought up a statistical trend in these
discussions.

I think the same result can
be obtained without this awkward insistence on calculating the
pairwise differences among one word and all possible others. you
also have ninny, minny, manny, money, boney, bunny, burney,
burner, birder, boarder, border, badder, batter, battle, bottle,
bottler, butler, butter, mutter, matter ... it's simply endless.
A patient linguist might create a list of hundreds of thousands,
or millions, of experimentally established contrasts or lacks
thereof, but this is no indication that the brain distinguishes
words in that way. This is simply an unwieldy and unlikely model,
incredibly costly in terms of computational ability and unlikely
to be able to run in real time even given the brain's great
parallelism.

This is just why the child learning language must develop speech
perception functions for interchangeable parts of all those myriad
utterances. The utterances are effectively infinite (not actually
infinite in the child's experience). The interchangeable parts are
manageable, and once perceptual functions are developed that control
phonemic perceptions, they provide perceptual means for controlling a
vastly greater vocabulary than is possible without them. Once again: the
analytical process is actually analytical for a linguist, appears as
though analytical in child language learning, but might well not be, and
is not at all involved in later control of language, after the
lowest-level interchangeable parts, the phonemic elements, are controlled
perceptions.

What I object to is the naive translation
of the linguist's way of investigating contrasts directly into a
proposed model of how the brain decides that that word is not the
same as this word.

You're objecting to this as a scheme for the adult's control of language;
I am not claiming that it is a scheme for the adult's control of
language.

A judgement of contrast is the outcome of a process, not the process
itself.

You're identifying "contrast" with "perception of difference". This
identification of phonemic contrast with phonetic difference does not
work. But to recognize that, you have to look at phonetic perceptions
below the level which you, as an adult speaker of english, control as
phonemes. When you're on the category level, everything looks like a
category to you. "Contrast" seems identical to "difference" because the
only differences you can perceive are the differences (on the category
level) between one phoneme and another.

Take the vowel space plotted above. Take the indicated vowel positions
on the f1/f2 coordinates to be the target positions. (The
"hyperarticulated" targets are actually a bit farther out, but no
matter.) Here's the vowel space again:

    f1 200

        300 i u
                      (i)
        400 I (I) U

        500 eI @ oU

        600 E ^

        700
                                            a
        800 ae

        900
            2800 2400 2000 1600 1200 800
                                                     F2

Near the center, at about F1=500 and F2=1600, I have placed @ to
represent a schwa vowel. (I don't know if those are accurate for schwa,
but no matter.) In unstressed, deemphasized speech, all the vowels are
pronounced closer to schwa. I have put in parentheses a reduced
pronunciation of i and of I. The reduced i vowel is close to the target
for I, but the reduced I vowel is even farther in, even closer to @, so
that their relative distinctness is preserved, even though the space in
which they are distributed is reduced. (There's a third dimension
suggested by eI and oU, and that's diphthongization, especially important
when you get into dialect comparisons, but we'll ignore that.)

Maybe Land's scheme can be adapted to capture this in terms of relative
deviation from schwa. That's fine, I like that. The point is precisely
that the phonemes are controlled relative to one another, and not as
absolute values of lower-level phonetic perceptions. I don't care if you
call it something other than contrast, so long as the phonemic elements
are relative rather than absolute.

    Bruce
    bn@bbn.com