[From Bruce Nevin (2002.10.07 20:24 EDT)]
Bill Powers (2002.10.08.0848 MDT)–
Bruce Nevin (2002.10.07 23:17 EDT)–
What I gave was not rules, but descriptive facts.
Well, then some of the descriptive facts are not necessarily 100%
accurate at least as far as speech recognition is
They are accurate descriptions of the articulation and acoustics
of careful speech (the samples on which the sound spectrograph
descriptions are based). In careful speech we attempt to control
articulatory and acoustic intentions completely and accurately. We
control with high gain. In more ordinary speaking, we control those
perceptions with less gain. This seems to be a result of conflict between
two kinds of aims, commonly espressed as minimization of confusion to the
hearer and minimization of effort by the speaker. What we find in typical
speech is a compromise varying somewhere between ultra-precise diction
and slurred incomprehensibility, such as we would expect from a
The speaker has the following conflicting aims:
- Minimize effort in articulation. Boersma says the speaker minimizes
the number and complexity of gestures and coordinations. I suspect this
is a side effect of one gesture interfering with another within the close
timing constraints of speech.
- Minimize perceptual confusion of utterances that have different
meanings - maximize their contrast.
The listener has the following conflicting aims:
- Minimize the effort needed to classify speech sounds into phonemic
categories; use as few perceptual categories as possible.
“In a world with large variations between and within speakers, the
disambiguation of an utterance is facilitated by having large perceptual
classes into which the acoustic input can be analyzed: it is easier to
divide a perceptual continuum into two categories than it is to divide it
into five.” (Boersma, Functional Phonology, p. 2).
- Minimize miscategorization, maximize use of acoustic differences.
- The speaker and listener both intend to maximize the information that
is successfully transmitted from the one to the other. (5) conflicts with
both 1 and 3. Decreased gain in speaking, and decreased discrimination of
acoustic distinctions in listening, tends to reduce the amount of
information successfully transmitted from speaker to listener. But (5) is
not reducible to controlling with high gain; rather, controlling with
high gain (either speaker, or listener, or both) is a consequence of
(1) often conflicts with (2) for the speaker. If the listener is confused
whether you mean “the ladder” or “the latter” you can
make the distinction, but ordinarily you do not.
The conflict of (3) with (4) for the listener is most evident when
hearing a different dialect from their own, or when learning a new
language. “Are you marred?” asked my co-worker in Florida.
“What?” “Are you marred?” One beat. Two beats.
“Oh! No, I’m not married.” (I wasn’t then.) My wife is from
near the “wendy” city, and for her the pin is mightier than the
sword. (Not making the i/e distinction, speakers in the Chicago area
produce an intermediate sound. Likewise the distinction between
baud and bah!, between pod and pa’d as in
“Pa’d go if he could”.) Japanese speakers famously have
difficulty learning to distinguish l and r in English. English speakers
typically merge all the nasalized vowels of French into one
… if the sound is produced by a particular
mechanical system, it will have characteristics reflecting the properties
of that system. But if it is created some other way, it may have some
different characteristics and still be recognizable as the same utterance
… Also, words are produced by different speakers who have very
different sizes and shapes of vocal tracts, yet the result is still
recognizeable as “the same utterance.”
And these speakers do not all have the same reference values for
“the same” phonemes, or sometimes the same references for the
phonemes in “the same” words. Part of the complexity resisted
in (3) can be the complexity of maintaining someone else’s different
references for the same thing, in parallel to, and mapped to, one’s own.
One way of resisting that complexity is, over time, to come to speak in
that dialect which was once foreign to us. Another is to extend to this
wider diversity a skill that we all exercise in switching between
distinct sets of references. (Example: start interviewing someone
formally, and listen to how their talk changes when interrupted by a
phone call from their neighbor.) We ordinarily don’t notice such
switching, and indeed it is important for its social functioning that it
not be noticed.
If the presumption is that what is heard is an utterance in English, a
listener who knows English will try to force what is heard into the set
of possibilities that the language allows. Also, in the face of these
conflicts, and as means of controlling (5), the listener brings to bear
other kinds of information beyond the acoustic signal, starting with
phonotactics (the possible syllables in the language) and lexicon (the
inventory of possible words). If you find a match to a possible syllable,
and a partial match to the phonemes of that syllable, you
“hear” the phonemes.
between repetition and imitation. Repetition is phonemic; imitation
Not sure what that last distinction means.
Christine wanted me to imitate the ‘correct’ pronunciation of
Swedish as she perceives it. If she said an utterance meaning “my
shinbone hurts” in Swedish, Dag could repeat the same utterance
without imitating it at all, and indeed there would be differences in his
repetition. If I spoke Swedish, and she said an utterance meaning
“my shinbone hurts” in Swedish, I could repeat it rather than
imitate it. She might judge that I said it in a funny way or not quite
right, but she wouldn’t deny that I had repeated what she said. Because I
obviously and avowedly do not speak Swedish, and because of the
contextual emphasis on pronunciation, she was returning judgements on my
imitation of her pronunciation, not on my repetition of the
words. Imitation is phonetic; repetition is phonemic.
Suppose we take two utterances, preferably nearly the same (e.g.
heart vs. hart or heart vs. hearth), and have
a native speaker pronounce them randomly intermixed. We ask a second
speaker to guess which was said. Linguists doing this Test have found
that for pairs like heart vs. hart the guesses are right
about 50% of the time, and for pairs like heart vs. hearth
they are correct about 100% of the time. To reproduce the difference
between the second pair is simply a matter of repeating one or the other.
If one tries to reproduce differences between the first two, it’s in the
realm of imitation.
“But of course,” you say, “they’re different sounds”.
But remember, the r and l of Japanese are different sounds, but it
doesn’t matter whether you say arugato or alugato in Japanese. In many
languages the difference between t and th is not phonemic, so if they had
a word that sounded like “heart” or “hearth” those
would be two repetitions of the same word. Some children in a stage of
their learning English as their native language do not distinguish
l and y: less vs. yes. In due course, they
learn (but never yearn, unless maybe if they’re ridiculed).
How could these
children come to produce the adult pronunciations? Why doesn’t Katrina
still say nowsman for snowman? Optimality theory gives a
simpler explanation (concerning the relative rank of a constraint against
complex syllable onsets vs. faithfulness to what is heard from adults),
and this accounts not only for the minority-case metathesis but also for
the majority-case conformity to the norm.
Sounds like a much more complicated explanation to me, but never mind. I
don’t think I’ve communicated exactly what I have in mind here. It’s
really much simpler.
To clarify: assume each constraint is imposed by a control loop in the
speaker-listener. One loop is controlling the faithfulness of what is
pronounced to the memory of what is heard from adults. The remembered
acoustic image is stuck. The other constraint concerns what can be
in a syllable, or how complicated the articulations are permitted to be
at the beginning and end of a syllable (the syllable onset and coda);
syllable-initial st is too complicated, but syllable-final
ks is easy enough. (Could have tested this by having a
conversation with her about a stuck thing, two stuck
things, or the like, so that the s would close the preceding syllable
instead of being syllable-initial, but I didn’t think of it.) She was
capable of pronouncing st, just not at the beginning of a
Every language has stateable regularities as to what is a possible
syllable. Linguists call this the phonotactics of the language.
Utterances that violate these regularities are difficult for speakers of
that language to pronounce, and in fact are ‘corrected’ to canonical
syllables. Beginning students of German sometimes have difficulty
learning to pronounce an initial ts as in zu, die
Zimmer, and (worse yet) tsv as in zwei. English
syllables don’t begin with ts. A way to get them over the hump is
to demonstrate that an acceptable English expression, “Hats
off!”, has this ts in it. Then they can extrapolate to
“ha-tsoff”, hæ tsof. (Similarly, “pot o’
coffee” as a stepping stone to pronouncing the Spanish r in
Rick Marken (2002.10.08.1030) –
I believe that Warren did do some studies
where the components of the auditory sequences were speech sounds rather
than tones. I think the results were the same in terms of sequence
perception. But I think different sequences of auditory elements can be
heard as different events even if the sequence of the elements cannot be
From what Lieberman said (quoted earlier) one would expect that words
and nonsense syllables would be perceived at a faster (segment per
second) rate than sequences of speech sounds that violate the
phonotactics (syllable constraints) of the subject’s language, and
possibly that such speech sounds in turn would be perceived at a faster
rate than non-speech segments can be perceived.
Rick Marken (2002.10.09.1620)–
I think Bruce Nevin, who has the linguistic
and programming expertise, should take the lead on this one, as you
Linguistics, yes; programming expertise, no.
Bill Powers (2002.10.09.1546 MDT)–
“Ladle Rat Rotten Hut” who
“lift on the ledge of a lodge, dock florist”
The full story can be found at
I’d love to see a
where you’ll find a link to the rest of Anguish Languish, the
1956 book by H.L. Chace. The introduction alone is worth the price of
admission. (Exploratorium says 1940. Possibly that was the date of the
LLWH story. There are 5 copies of the book at
and the four with date mentioned say 1956.)
At 09:23 AM 10/8/2002 -0600, Bill Powers wrote: