phonetics

[From Bruce Nevin (2002.10.06 16:20 EDT)]
Bill,
Your description works fine for vowels and diphthongs, where you’ve got
auditory configurations of formants and transitions in them. (Formants
are frequencies at which harmonics of the fundamental pitch are notdamped by the configuration of jaw, tongue, lips, etc. The
fundamental frequency and its harmonics are generated by the vocal
folds.) However, there are much faster formant transitions that are not
perceived as vowels or diphthongs.
An oral stop consonant such as p, t, k, b, d, g is signaled by silence in
the speech signal, and by a very brief transition in the formants of
adjacent vowels before and/or after it. These transitions that
distinguish one consonant from another are about 50 ms in duration.
(There are other cases, such as the formants in an adjacent nasal stop m,
n, ng, etc., and there are other characteristics of different kinds of
consonants, but we’ll ignore them. What is said here about the simplest
case, oral stops, applies to all consonants.)
The curvature of the formants as seen on a sound spectrogram is
determined by the position of the tongue and lips for the stop and their
positions for the vowel. The configuration of the tongue and lips
determines which harmonics of the speech signal are damped and which are
allowed to pass. (Other articulators of the vocal tract, such as the
epiglottis, are used in some languages.) Before the vowel /i/ (as in
“beet”), a /d/ is indicated by an upward transition of the
second formant from perhaps 2.2 to 2.7 KHz; before the vowel /u/ of
boot a /d/ is indicated by a downward transition of the second
formant from about 1.2 to 0.6 KHz. In both cases, the first formant rises
from 0 to about 2.5 KHz. (Formants at higher frequencies are of marginal
importance for speech recognition.) The diphthongs of ye and
you present similar but slower transitions from the [i] sound
associated with the initial /y/. The tongue position for /d/ is near the
tongue position for /y/, so the transitions to a following vowel are
similar.
If the speech signal were segmented as our alphabetic conventions
indicate, we would typically produce, and recognize, 20 to 30 segments
per second (Lieberman & Blumstein, Speech physiology, speech
perception, and acoustic phonetics
p. 145). The fastest rate at which
sounds can be identified is about 7 to 9 segments per second, and at 20
to 30 segments per second sounds merge into an indifferentiable ‘tone’
(Miller, 1956, “The magical number seven, plus or minus two: Some
limits in our capacity for processing information” Psych.
Rev.
63:81-97). So it seems that people are not directly perceiving
phonemic segments per se in the acoustic signal of speech, however
convenient a segmental representation of phonemes may be analytically.
People seem to perceive speech in syllables, and yet are able to repeat
back “the segments” of those syllables more slowly or even in
isolation. It seems that partly concurrent features of syllables
are perceived as sequential segments at a rate far faster than any
actual sequence of segments could be perceived.
The target positions that the tongue reaches in artificially slow and
careful speech are not actually reached in normal speech. The syllable
bat begins with the lips closed. As the lips open and spread,
causing a formant transition, and vibration of the vocal cords begins
(English /b/ is not voiced throughout in initial position), the tongue
moves toward the position for the vowel that you hear fully articulated
if you stretch out the emphasized syllable in “He’s
bad!” or “I’ve been had!” In ordinary
running speech - that is, if the syllable bat is not artificially
prolonged - before the lips fully spread and the tongue reaches the low
front target position for that vowel, the tongue starts that 50 ms.
transition moving up through the position for the vowel in bit in
order to make the closure for the final /t/ stop consonant.
The syllable bout begins with the tongue moving toward the target
/a/ of aha! but by the time it reaches the low-mid centralized
position of the [^] vowel of up it starts moving high and back to
the position (with lip rounding) for the /u/ of boot, which in
turn it never reaches before that final, fast transition to /t/ as
described above.
So we have transitions of two kinds going on here: a slow transition for
a diphthong (the closing diphthong of bout or bite or
barn if you pronounce postvocalic r, or the opening diphthong of
wan or win or run); and the 50 ms transitions that
differentiate one consonant from another.

···

At 08:13 AM 10/5/2002 -0600, Bill Powers wrote:

I have some speech signal software
installed, but haven’t learned to use
it. I just need to pry the time free from
other obligations.

Is it the sort of software I could acquire at reasonable
cost?

Praat is free from Paul Boersma at

http://www.fon.hum.uva.nl/praat/

    /Bruce

Nevin

[From Bill Powers (2002.10.06.1808 MDT)]
Bruce Nevin (2002.10.06 16:20 EDT)--

Bruce, your discourse on phonetics is beautifully clear. I will have to
study it for a while to make sure I understand everything, at least as much
as I'm going to understand without being a linguist. I'm actually pretty
excited about maybe learning something real here.

I did download Praat and the .pdf manual, and had a very brief preliminary
look at the program. The power and richness of this program are almost
beyond belief. Just fooling around I was able to create a sound (from a
canned formula) and play it -- no problem with my sound card or anything
else. It just worked. I am vastly impressed and delighted. I highly
recommend that others on CSGnet obtain this program. My first impression is
that we ought to be able to do some real PCT research in linguistics using
it -- that is, if at least one person involved knows enough about
linguistics to guide the effort. Don't look around, Bruce, I mean you.

It's going to take a while to learn all the things that Praat will allow us
to do.Already I'm wondering if we can put formants under mouse control and
synthesize some speech-like noises. Now I have to sit on my enthusiasm so I
don't drop all the other things I'm trying to get done.

Best,

Bill P.

[From Bruce Nevin (2002.10.06 21:10 EDT)]
Bill Powers (2002.10.06.1808 MDT)–
Oh wonderful! I can’t say how delighted I am! You had me clapping my
hands like a child!
Boersma is very good. I came across him in the last stages of my
dissertation work, in connection with Optimality Theory. Optimality
Theory proposes a generative source of all possible candidates for
utterance, and a ranked set of constraints, such that one candidate is
‘better’ than any of the others with respect to the constraints. Most
constraints are thought to be universal (apply to all languages) though
some may be language-specific. Differences between languages are due to
differences in constraint ranking.
A simple stock example. One constraint, Onset, says that a syllable must
begin with a consonant. A family of constraints, called Faithfulness
constraints, say that an utterance as produced must match its underlying
form – a kind of idealized representation of what it ought to be
– faithfully. An underlying form lacks a consonant at the beginning of a
syllable. In a language that ranks Onset above Faithfulness, a consonant
is inserted at the beginning of the syllable. English is an example,
inserting a glottal stop in such cases (“apple” spoken in
isolation) or retaining the consonant of a preceding word which is
dropped elsewhere (“an apple” vs. “a book”) to
satisfy the Onset requirement. Because this is automatic and required in
English, we don’t notice it and most English speakers have difficulty
recognizing that a glottal stop (as in the negation “uh-uh”) is
a consonant. On the other hand, in a language that ranks Faithfulness
above Onset, no consonant is inserted. In Hawai’ian, the glottal stop /’/
is a distinct consonant and not just an automatic, unconsciously inserted
syllable-separator.
It it tempting to think of the constraints as controlled variables.
Boersma’s book Functional Phonology: Formalizing the interactions
between articulatory and perceptual drives
, is at
http://www.fon.hum.uva.nl/paul/papers/funphon.[pdf

](http://www.fon.hum.uva.nl/paul/papers/funphon.pdf)I haven’t finished working through it. I haven’t been up to tackling
the math. A terminological stumbling block is that he uses the term
‘perceptual’ narrowly, meaning acoustic perception, or sometimes
recognition as distinct from production of speech, so we have to make
PCT-benign translations. He has a lot of data about aerodynamics,
myoelasticity, etc. and attempts to construct a testable model.

Have to run pick up my daughter.

    /Bruce

Nevin

[From Bill Powers (2002.10.07.1610 MDT)]
Bruce Nevin (2002.10.06 16:20 EDT) –

An oral stop consonant such as p, t, k, b, d, g is signaled by
silence in the >speech signal, and by a very brief transition in the
formants of adjacent >vowels before and/or after it. These transitions
that distinguish one >consonant from another are about 50 ms in
duration.
I’ve been playing with Praat, which makes it easy to record vocalizations
and then isolate segments for playing back and examining both amplitude
and frequency (spectrum) data. When I wanted to see the intensity
envelope of the data, naturally Praat had an option for doing
that.
I’ve been pretty surprised by what I see in looking at consonants in
utterances like “ah da ba ta ka ga” (Mary came in to see what
was wrong with me). As you say, the onset transitions are rapid, but not
anywhere near as different (in intensity profile) as I thought they would
be. Has anyone plotted the rate of change of intensity against time? I
see that Praat can export .WAV files, and I found a Pascal program for
reading them, so maybe I’ll be able to look at such plots.
I think that to look at the consonants more closely we’ll need to plot
intensity for slices of the spectrum against time. Of course Plaat
provides for doing that! I just haven’t figured out how to do it yet. Of
course what I will find is already well known, otherwise Praat wouldn’t
provide for it. But I don’t know it yet.
I’m getting the impression that some of the rules you described are not
really as precise as one might think. That is, in some instances of
saying a word the rule might clearly describe what is observed, but in
other instances it could not be seen to apply, yet the word might seem as
recognizeable as ever. I found I could make a pretty good “t”
in the word “top” just by playing a segment of the /ah/ vowel;
the onset of the vowel was then abrupt, and I could clearly hear the /t/.
Unfortunately, imagination can also supply the /t/, as I discovered in
trying other things. It’s not easy to get that imagination switch turned
totally off.
I think it’s important to separate rules that apply on the average from
rules that apply in every single instance. We can hope to model the
latter, but not the former.
…So it seems that people are not directly perceiving phonemic
segments per >se in the acoustic signal of speech, however
convenient a segmental >representation of phonemes may be
analytically.

I think we need to think in levels here. Have you seen Rick’s experiment
with perceptions of different levels? He did them with visual patterns,
but there’s no reason we couldn’t do the same thing with audition, given
a collection of sound files that could be used in a “tracking”
experiment. In HPCT, there are perceptions going on at all levels at the
same time. The perceptions of lower order can change faster than those of
higher order. So it’s possible that auditory event recognition (words)
has a maximum speed of 7 - 9 per second, but perception of lower-order
aspects of sound could change much faster. One of the nice experiments
Rick did showed that a person could detect that different elements of a
pattern were present, but, above a certain speed of presentation, not the
order in which they occurred. It’s possible that people can distinguish
the presence of different phonemes at certain speeds of presentation but
still be unable to tell what words they make. Rick’s experimental design
might work for auditory tests. Would this be anything new in
linguistics?

Best,

Bill P.

[From Bruce Nevin (2002.10.07 23:17 EDT)]

Bill Powers (2002.10.07.1610 MDT)–

I’m getting the impression that some of the
rules you described are not really as precise as one might think.
That is, in some instances of saying a word the rule might clearly
describe what is observed, but in other instances it could not be seen to
apply, yet the word might seem as recognizeable as ever. I found I could
make a pretty good “t” in the word “top” just by
playing a segment of the /ah/ vowel; the onset of the vowel was then
abrupt, and I could clearly hear the /t/. Unfortunately, imagination can
also supply the /t/, as I discovered in trying other things. It’s not
easy to get that imagination switch turned totally off.

What I gave was not rules, but descriptive facts. When an English
speaker actually pronounces top, their tongue actually makes a
closure against the alveolar ridge behind the teeth (closer to the teeth
for some dialects, e.g. of NYC). A sound spectrogram of their speech
consequently shows the formant transitions that I described –
“consequently”: as a matter of physiology, aerodynamics, and
acoustics. These acoustic facts do indeed obtain in every single instance
of a given configuration or sequence of configurations (articulations) of
the vocal tract.
That you hear a t when you play back an artificially truncated
sound segment is an entirely different matter. Yes, imagination plays a
great role in the perception of speech. Beginning students of phonetic
transcription, upon hearing words that end in a glottal stop in, say,
Navajo or Ute, typically transcribe them as ending with a t or other oral
stop. (In most American dialects, final stops are accompanied by glottal
closure.) In this is the difference between repetition and imitation.
Repetition is phonemic; imitation is phonetic.

…So it seems that people are not
directly perceiving phonemic segments per >se in the acoustic
signal of speech, however convenient a segmental >representation of
phonemes may be analytically.

I think we need to think in levels here. Have you seen Rick’s experiment
with perceptions of different levels? He did them with visual patterns,
but there’s no reason we couldn’t do the same thing with audition, given
a collection of sound files that could be used in a “tracking”
experiment. In HPCT, there are perceptions going on at all levels at the
same time. The perceptions of lower order can change faster than those of
higher order. So it’s possible that auditory event recognition (words)
has a maximum speed of 7 - 9 per second, but perception of lower-order
aspects of sound could change much faster. One of the nice experiments
Rick did showed that a person could detect that different elements of a
pattern were present, but, above a certain speed of presentation, not the
order in which they occurred. It’s possible that people can distinguish
the presence of different phonemes at certain speeds of presentation but
still be unable to tell what words they make. Rick’s experimental design
might work for auditory tests. Would this be anything new in
linguistics?

I don’t know whether it is new or not. The literature is huge, and I am
far from it. (The literature. Or huge, for that matter.) At first blush,
it seems that this might be a contributor transpositions (termed
metathesis in studies of language change and variation). Some children
transpose s when followed by a consonant to syllable final position:
stuck becomes ducks, etc. My youngest daughter did this.
(Not all children do this; not all rules are universal.) However, if this
were as fundamental and prevalant a factor as you suggest, how could
people ever get agreement on the ‘right’ sequence? How could these
children come to produce the adult pronunciations? Why doesn’t Katrina
still say nowsman for snowman? Optimality theory gives a
simpler explanation (concerning the relative rank of a constraint against
complex syllable onsets vs. faithfulness to what is heard from adults),
and this accounts not only for the minority-case metathesis but also for
the majority-case conformity to the norm.

It’s probably more like people catch enough to recognize the word shape,
without necessarily apprehending every feature, and using imagination and
all sorts of contextual clues to pick which item out of a finite lexicon
was spoken. Note the difficulty with unfamiliar words, and the difficulty
learning a new language.

Sorry, got to go to bed. 6 to get said daughter up and breakfasted and
off to school comes altogether too early.

    /Bruce

Nevin

···

At 04:56 PM 10/7/2002 -0600, Bill Powers wrote:

[From Bill Powers (2002.10.08.0848 MDT)]
Bruce Nevin (2002.10.07 23:17 EDT)–
What I gave was not rules, but descriptive facts.
Well, then some of the descriptive facts are not necessarily 100%
accurate at least as far as speech recognition is concerned.

When an English speaker actually pronounces top, their tongue
actually makes >a closure against the alveolar ridge behind the teeth
(closer to the teeth >for some dialects, e.g. of NYC). A sound
spectrogram of their speech >consequently shows the formant
transitions that I described – >“consequently”: as a matter
of physiology, aerodynamics, and acoustics.
These acoustic facts do indeed obtain in every single instance of a
given >configuration or sequence of configurations (articulations) of
the vocal >tract.

Yes, if the sound is produced by a particular mechanical system, it will
have characteristics reflecting the properties of that system. But if it
is created some other way, it may have some different characteristics and
still be recognizable as the same utterance, just as printed letters can
be recognized in addition to handwritten ones (and, indeed, often more
easily). Also, words are produced by different speakers who have very
different sizes and shapes of vocal tracts, yet the result is still
recognizeable as “the same utterance.”

That you hear a t when you
play back an artificially truncated sound segment is an entirely
different matter. Yes, imagination plays a great role in the perception
of speech. Beginning students of phonetic transcription, upon hearing
words that end in a glottal stop in, say, Navajo or Ute, typically
transcribe them as ending with a t or other oral stop. (In most American
dialects, final stops are accompanied by glottal closure.) In this is the
difference between repetition and imitation. Repetition is phonemic;
imitation is phonetic.

Not sure what that last distinction means.

Anyway, this is not entirely a matter of imagination; it’s also a matter
of how the word-recognizers are organized. What I’m suggesting is that we
could be so organized that either an abrupt truncation of a sound or a
burst of high-frequency sound will give rise to the same consonant
perception, the same signal out of the same input function. I don’t mean
to propose that as the actual case, only as an example to show what I
mean. I’m sure we also have to consider contrasts and other contextual
matters, including judgments of what word is most likely when several are
possible…

…So
it seems that people are not directly perceiving phonemic segments
per >se in the acoustic signal of speech, however convenient a
segmental >representation of phonemes may be analytically.

I think we need to think in levels here. Have you seen Rick’s experiment
with perceptions of different levels?

I don’t know whether it is new or
not. The literature is huge, and I am far from it. (The literature. Or
huge, for that matter.) At first blush, it seems that this might be a
contributor [to] transpositions (termed metathesis in studies of language
change and variation). Some children transpose s when followed by a
consonant to syllable final position: stuck becomes ducks,
etc. My youngest daughter did this.

It might, but just as a test for levels of perception it could help us
understand word perception. I see something like this effect in listening
to short segments of my own speech. If I make the segment short enough I
cease to hear the typical vowel sound – there’s some indication that the
sound has to be present for some minimum time to be recognized.
Shortening the time certainly makes diphthongs harder to hear. Of course
that would put a limit on how fast different vowel sounds could occur and
still be distinguished.

The ordering of sounds might have a slower limit of detection. To
paraphrase Rick’s experiment, we could present edited utterances in which
the subject has to indicate which of two sequences was heard, or correct
a repeating sequence by pressing a button to restore the right sequence
after a switch to a different one. This would require writing a program
that could repeatedly play back one sound sequence or a different one and
switch back and forth as the experimental participant indicates – I
think I may be able to do that.

(Not all children do this;
not all rules are universal.) However, if this were as fundamental and
prevalant a factor as you suggest, how could people ever get agreement on
the ‘right’ sequence?

Normally, we present sequences at a rate that is well within a listener’s
ability to distinguish. I’m talking about varying the rate to find where
the cutoff is, to see if that rate is different from what it is for other
elements of speech. This is a hint as to the relative level of
perception. Read pages 100 ff in More Mind Readings to see how Rick did
it with numbers (a pilot experiment that really ought to be done with
lots of subjects).

How could these children come to
produce the adult pronunciations? Why doesn’t Katrina still say
nowsman for snowman? Optimality theory gives a simpler
explanation (concerning the relative rank of a constraint against complex
syllable onsets vs. faithfulness to what is heard from adults), and this
accounts not only for the minority-case metathesis but also for the
majority-case conformity to the norm.

Sounds like a much more complicated explanation to me, but never mind. I
don’t think I’ve communicated exactly what I have in mind here. It’s
really much simpler.

Best,

Bill P.

[From Rick Marken (2002.10.08.1030)]

Bill Powers (2002.10.08.0848 MDT)

but just as a test for levels of perception it could help us understand word
perception. I see something like this effect in listening to short segments of
my own speech. If I make the segment short enough I cease to hear the typical
vowel sound -- there's some indication that the sound has to be present for some
minimum time to be recognized. Shortening the time certainly makes diphthongs
harder to hear. Of course that would put a limit on how fast different vowel
sounds could occur and still be distinguished.

One problem with studying auditory perception is that the duration of a waveform
affects its spectrum. So when you shorten a vowel you are also making it more
noise like. So what you may be observing here is a spectral rather than a
perceptual effect. That's the problem with using the "presentation rate" method
in the auditory domain.

The ordering of sounds might have a slower limit of detection. To paraphrase
Rick's experiment, we could present edited utterances in which the subject has
to indicate which of two sequences was heard, or correct a repeating sequence by
pressing a button to restore the right sequence after a switch to a different
one. This would require writing a program that could repeatedly play back one
sound sequence or a different one and switch back and forth as the experimental
participant indicates -- I think I may be able to do that.

I agree that this would be a nifty experiment. But it would not be as easy as the
equivalent in the visual domain. You would have to compensate for the effects of
speed on the spectral composition of the speech components (which can be done but
takes some work).

Normally, we present sequences at a rate that is well within a listener's
ability to distinguish. I'm talking about varying the rate to find where the
cutoff is, to see if that rate is different from what it is for other elements
of speech. This is a hint as to the relative level of perception. Read pages 100
ff in _More Mind Readings_ to see how Rick did it with numbers (a pilot
experiment that really ought to be done with lots of subjects).

Actually, the numbers study was developed as an analogy to an _auditory_ version
that was done by Richard Warren (cited in the paper). What was interesting to me
was that the fastest sequence perception rate seemed to be the same for tones and
for visual patterns. But Warren was trying to get at the very problem you mention
here, though, of course, he does not look at it in terms of levels. I believe that
Warren did do some studies where the components of the auditory sequences were
speech sounds rather than tones. I think the results were the same in terms of
sequence perception. But I think different sequences of auditory elements can be
heard as different events even if the sequence of the elements cannot be
perceived. I imagine this would be true of visual sequences, too, if they were
presented in an analogous way.

I really should do that "levels of perception" experiment with several people.

Best regards

Rick

···

--
Richard S. Marken, Ph.D.
The RAND Corporation
PO Box 2138
1700 Main Street
Santa Monica, CA 90407-2138
Tel: 310-393-0411 x7971
Fax: 310-451-7018
E-mail: rmarken@rand.org

[ From Peter Burke (2002.10.08 7:41PDT]

This discussion of language and meaning reminded me of the story of
"Ladle Rat Rotten Hut" who "lift on the ledge of a lodge, dock florist"
The full story can be found at
http://www.exploratorium.edu/exhibits/ladle/
Peter

[From Bruce Nevin (2002.10.07 23:17 EDT)]

Bill Powers (2002.10.07.1610 MDT)--

I'm getting the impression that some of the rules you described are not
really as precise as one might think. That is, in some instances of
saying a word the rule might clearly describe what is observed, but in
other instances it could not be seen to apply, yet the word might seem
as recognizeable as ever. I found I could make a pretty good "t" in the
word "top" just by playing a segment of the /ah/ vowel; the onset of the
vowel was then abrupt, and I could clearly hear the /t/. Unfortunately,
imagination can also supply the /t/, as I discovered in trying other
things. It's not easy to get that imagination switch turned totally off.

What I gave was not rules, but descriptive facts. When an English
speaker actually pronounces top, their tongue actually makes a closure
against the alveolar ridge behind the teeth (closer to the teeth for
some dialects, e.g. of NYC). A sound spectrogram of their speech
consequently shows the formant transitions that I described --
"consequently": as a matter of physiology, aerodynamics, and acoustics.
These acoustic facts do indeed obtain in every single instance of a
given configuration or sequence of configurations (articulations) of the
vocal tract.

That you hear a t when you play back an artificially truncated sound
segment is an entirely different matter. Yes, imagination plays a great
role in the perception of speech. Beginning students of phonetic
transcription, upon hearing words that end in a glottal stop in, say,
Navajo or Ute, typically transcribe them as ending with a t or other
oral stop. (In most American dialects, final stops are accompanied by
glottal closure.) In this is the difference between repetition and
imitation. Repetition is phonemic; imitation is phonetic.

...So it seems that people are not directly perceiving phonemic

segments per >se in the acoustic signal of speech, however convenient a
segmental >representation of phonemes may be analytically.

I think we need to think in levels here. Have you seen Rick's experiment
with perceptions of different levels? He did them with visual patterns,
but there's no reason we couldn't do the same thing with audition, given
a collection of sound files that could be used in a "tracking"
experiment. In HPCT, there are perceptions going on at all levels at the
same time. The perceptions of lower order can change faster than those
of higher order. So it's possible that auditory _event_ recognition
(words) has a maximum speed of 7 - 9 per second, but perception of
lower-order aspects of sound could change much faster. One of the nice
experiments Rick did showed that a person could detect that different
elements of a pattern were present, but, above a certain speed of
presentation, not the order in which they occurred. It's possible that
people can distinguish the presence of different phonemes at certain
speeds of presentation but still be unable to tell what words they make.
Rick's experimental design might work for auditory tests. Would this be
anything new in linguistics?

I don't know whether it is new or not. The literature is huge, and I am
far from it. (The literature. Or huge, for that matter.) At first blush,
it seems that this might be a contributor transpositions (termed
metathesis in studies of language change and variation). Some children
transpose s when followed by a consonant to syllable final position:
stuck becomes ducks, etc. My youngest daughter did this. (Not all
children do this; not all rules are universal.) However, if this were as
fundamental and prevalant a factor as you suggest, how could people ever
get agreement on the 'right' sequence? How could these children come to
produce the adult pronunciations? Why doesn't Katrina still say nowsman
for snowman? Optimality theory gives a simpler explanation (concerning
the relative rank of a constraint against complex syllable onsets vs.
faithfulness to what is heard from adults), and this accounts not only
for the minority-case metathesis but also for the majority-case
conformity to the norm.

It's probably more like people catch enough to recognize the word shape,
without necessarily apprehending every feature, and using imagination
and all sorts of contextual clues to pick which item out of a finite
lexicon was spoken. Note the difficulty with unfamiliar words, and the
difficulty learning a new language.

Sorry, got to go to bed. 6 to get said daughter up and breakfasted and
off to school comes altogether too early.

        /Bruce Nevin

Peter J. Burke
Professor
Department of Sociology
University of California
Riverside, CA 92521-0419
Phone: 909/787-3401
Fax: 909/787-0333
peter.burke@ucr.edu

···

At 04:56 PM 10/7/2002 -0600, Bill Powers wrote:

[From Bill Powers (2002.10.09.1546 MDT)]

Peter Burke (2002.10.08 7:41PDT]

This discussion of language and meaning reminded me of the story of
"Ladle Rat Rotten Hut" who "lift on the ledge of a lodge, dock florist"
The full story can be found at
http://www.exploratorium.edu/exhibits/ladle/

Took me a while to realize it was audio only. I'd love to see a transcript.
Very funny, and also fascinating.

Best,

Bill P.

[From Bruce Nevin (2002.10.07 20:24 EDT)]

Bill Powers (2002.10.08.0848 MDT)–

Bruce Nevin (2002.10.07 23:17 EDT)–
What I gave was not rules, but descriptive facts.

Well, then some of the descriptive facts are not necessarily 100%
accurate at least as far as speech recognition is
concerned.

They are accurate descriptions of the articulation and acoustics
of careful speech (the samples on which the sound spectrograph
descriptions are based). In careful speech we attempt to control
articulatory and acoustic intentions completely and accurately. We
control with high gain. In more ordinary speaking, we control those
perceptions with less gain. This seems to be a result of conflict between
two kinds of aims, commonly espressed as minimization of confusion to the
hearer and minimization of effort by the speaker. What we find in typical
speech is a compromise varying somewhere between ultra-precise diction
and slurred incomprehensibility, such as we would expect from a
conflict.
The speaker has the following conflicting aims:

  1. Minimize effort in articulation. Boersma says the speaker minimizes
    the number and complexity of gestures and coordinations. I suspect this
    is a side effect of one gesture interfering with another within the close
    timing constraints of speech.
  2. Minimize perceptual confusion of utterances that have different
    meanings - maximize their contrast.
    The listener has the following conflicting aims:
  3. Minimize the effort needed to classify speech sounds into phonemic
    categories; use as few perceptual categories as possible.
    “In a world with large variations between and within speakers, the
    disambiguation of an utterance is facilitated by having large perceptual
    classes into which the acoustic input can be analyzed: it is easier to
    divide a perceptual continuum into two categories than it is to divide it
    into five.” (Boersma, Functional Phonology, p. 2).
  4. Minimize miscategorization, maximize use of acoustic differences.
    In addition,
  5. The speaker and listener both intend to maximize the information that
    is successfully transmitted from the one to the other. (5) conflicts with
    both 1 and 3. Decreased gain in speaking, and decreased discrimination of
    acoustic distinctions in listening, tends to reduce the amount of
    information successfully transmitted from speaker to listener. But (5) is
    not reducible to controlling with high gain; rather, controlling with
    high gain (either speaker, or listener, or both) is a consequence of
    (5).
    (1) often conflicts with (2) for the speaker. If the listener is confused
    whether you mean “the ladder” or “the latter” you can
    make the distinction, but ordinarily you do not.
    The conflict of (3) with (4) for the listener is most evident when
    hearing a different dialect from their own, or when learning a new
    language. “Are you marred?” asked my co-worker in Florida.
    “What?” “Are you marred?” One beat. Two beats.
    “Oh! No, I’m not married.” (I wasn’t then.) My wife is from
    near the “wendy” city, and for her the pin is mightier than the
    sword. (Not making the i/e distinction, speakers in the Chicago area
    produce an intermediate sound. Likewise the distinction between
    baud and bah!, between pod and pa’d as in
    “Pa’d go if he could”.) Japanese speakers famously have
    difficulty learning to distinguish l and r in English. English speakers
    typically merge all the nasalized vowels of French into one
    indiscriminate honk.

… if the sound is produced by a particular
mechanical system, it will have characteristics reflecting the properties
of that system. But if it is created some other way, it may have some
different characteristics and still be recognizable as the same utterance
… Also, words are produced by different speakers who have very
different sizes and shapes of vocal tracts, yet the result is still
recognizeable as “the same utterance.”

And these speakers do not all have the same reference values for
“the same” phonemes, or sometimes the same references for the
phonemes in “the same” words. Part of the complexity resisted
in (3) can be the complexity of maintaining someone else’s different
references for the same thing, in parallel to, and mapped to, one’s own.
One way of resisting that complexity is, over time, to come to speak in
that dialect which was once foreign to us. Another is to extend to this
wider diversity a skill that we all exercise in switching between
distinct sets of references. (Example: start interviewing someone
formally, and listen to how their talk changes when interrupted by a
phone call from their neighbor.) We ordinarily don’t notice such
switching, and indeed it is important for its social functioning that it
not be noticed.

If the presumption is that what is heard is an utterance in English, a
listener who knows English will try to force what is heard into the set
of possibilities that the language allows. Also, in the face of these
conflicts, and as means of controlling (5), the listener brings to bear
other kinds of information beyond the acoustic signal, starting with
phonotactics (the possible syllables in the language) and lexicon (the
inventory of possible words). If you find a match to a possible syllable,
and a partial match to the phonemes of that syllable, you
“hear” the phonemes.

…the difference
between repetition and imitation. Repetition is phonemic; imitation
is phonetic.

Not sure what that last distinction means.

Christine wanted me to imitate the ‘correct’ pronunciation of
Swedish as she perceives it. If she said an utterance meaning “my
shinbone hurts” in Swedish, Dag could repeat the same utterance
without imitating it at all, and indeed there would be differences in his
repetition. If I spoke Swedish, and she said an utterance meaning
“my shinbone hurts” in Swedish, I could repeat it rather than
imitate it. She might judge that I said it in a funny way or not quite
right, but she wouldn’t deny that I had repeated what she said. Because I
obviously and avowedly do not speak Swedish, and because of the
contextual emphasis on pronunciation, she was returning judgements on my
imitation of her pronunciation, not on my repetition of the
words. Imitation is phonetic; repetition is phonemic.
Suppose we take two utterances, preferably nearly the same (e.g.
heart vs. hart or heart vs. hearth), and have
a native speaker pronounce them randomly intermixed. We ask a second
speaker to guess which was said. Linguists doing this Test have found
that for pairs like heart vs. hart the guesses are right
about 50% of the time, and for pairs like heart vs. hearth
they are correct about 100% of the time. To reproduce the difference
between the second pair is simply a matter of repeating one or the other.
If one tries to reproduce differences between the first two, it’s in the
realm of imitation.
“But of course,” you say, “they’re different sounds”.
But remember, the r and l of Japanese are different sounds, but it
doesn’t matter whether you say arugato or alugato in Japanese. In many
languages the difference between t and th is not phonemic, so if they had
a word that sounded like “heart” or “hearth” those
would be two repetitions of the same word. Some children in a stage of
their learning English as their native language do not distinguish
l and y: less vs. yes. In due course, they
learn (but never yearn, unless maybe if they’re ridiculed).

How could these
children come to produce the adult pronunciations? Why doesn’t Katrina
still say nowsman for snowman? Optimality theory gives a
simpler explanation (concerning the relative rank of a constraint against
complex syllable onsets vs. faithfulness to what is heard from adults),
and this accounts not only for the minority-case metathesis but also for
the majority-case conformity to the norm.

Sounds like a much more complicated explanation to me, but never mind. I
don’t think I’ve communicated exactly what I have in mind here. It’s
really much simpler.

To clarify: assume each constraint is imposed by a control loop in the
speaker-listener. One loop is controlling the faithfulness of what is
pronounced to the memory of what is heard from adults. The remembered
acoustic image is stuck. The other constraint concerns what can be
in a syllable, or how complicated the articulations are permitted to be
at the beginning and end of a syllable (the syllable onset and coda);
syllable-initial st is too complicated, but syllable-final
ks is easy enough. (Could have tested this by having a
conversation with her about a stuck thing, two stuck
things, or the like, so that the s would close the preceding syllable
instead of being syllable-initial, but I didn’t think of it.) She was
capable of pronouncing st, just not at the beginning of a
syllable.
Every language has stateable regularities as to what is a possible
syllable. Linguists call this the phonotactics of the language.
Utterances that violate these regularities are difficult for speakers of
that language to pronounce, and in fact are ‘corrected’ to canonical
syllables. Beginning students of German sometimes have difficulty
learning to pronounce an initial ts as in zu, die
Zimmer
, and (worse yet) tsv as in zwei. English
syllables don’t begin with ts. A way to get them over the hump is
to demonstrate that an acceptable English expression, “Hats
off!”, has this ts in it. Then they can extrapolate to
“ha-tsoff”, hæ tsof. (Similarly, “pot o’
coffee” as a stepping stone to pronouncing the Spanish r in
Paraguay.)

Rick Marken (2002.10.08.1030) –

I believe that Warren did do some studies
where the components of the auditory sequences were speech sounds rather
than tones. I think the results were the same in terms of sequence
perception. But I think different sequences of auditory elements can be
heard as different events even if the sequence of the elements cannot be
perceived.

From what Lieberman said (quoted earlier) one would expect that words
and nonsense syllables would be perceived at a faster (segment per
second) rate than sequences of speech sounds that violate the
phonotactics (syllable constraints) of the subject’s language, and
possibly that such speech sounds in turn would be perceived at a faster
rate than non-speech segments can be perceived.

Rick Marken (2002.10.09.1620)–

I think Bruce Nevin, who has the linguistic
and programming expertise, should take the lead on this one, as you
suggest.

Linguistics, yes; programming expertise, no.

Bill Powers (2002.10.09.1546 MDT)–

“Ladle Rat Rotten Hut” who
“lift on the ledge of a lodge, dock florist”

The full story can be found at

http://www.exploratorium.edu/exhibits/ladle/

I’d love to see a
transcript.

Here:
http://www.exploratorium.edu/xref/exhibits/ladle_rat_rotten_hut.html
Also here:
http://drzeus.best.vwh.net/Ladle.html
where you’ll find a link to the rest of Anguish Languish, the
1956 book by H.L. Chace. The introduction alone is worth the price of
admission. (Exploratorium says 1940. Possibly that was the date of the
LLWH story. There are 5 copies of the book at
http://dogbert.abebooks.com/abep/BookSearch
and the four with date mentioned say 1956.)

    /Bruce Nevin
···

At 09:23 AM 10/8/2002 -0600, Bill Powers wrote:

[From Bill Powers (2002.10.08.0848 MDT)]

Bruce Nevin (2002.10.07 23:17 EDT)--

What I gave was not rules, but descriptive facts.

Well, then some of the descriptive facts are not necessarily 100% accurate
at least as far as speech recognition is concerned.

>When an English speaker actually pronounces top, their tongue actually
makes >a closure against the alveolar ridge behind the teeth (closer to the
teeth >for some dialects, e.g. of NYC). A sound spectrogram of their
speech >consequently shows the formant transitions that I described
-- >"consequently": as a matter of physiology, aerodynamics, and acoustics.
> These acoustic facts do indeed obtain in every single instance of a
given >configuration or sequence of configurations (articulations) of the
vocal >tract.

Yes, if the sound is produced by a particular mechanical system, it will
have characteristics reflecting the properties of that system. But if it is
created some other way, it may have some different characteristics and
still be recognizable as the same utterance, just as printed letters can be
recognized in addition to handwritten ones (and, indeed, often more
easily). Also, words are produced by different speakers who have very
different sizes and shapes of vocal tracts, yet the result is still
recognizeable as "the same utterance."

>That you hear a t when you play back an artificially truncated sound
>segment is an entirely different matter. Yes, imagination plays a great
>role in the perception of speech. Beginning students of phonetic
>transcription, upon hearing words that end in a glottal stop in, say,
>Navajo or Ute, typically transcribe them as ending with a t or other oral
>stop. (In most American dialects, final stops are accompanied by glottal
>closure.) In this is the difference between repetition and imitation.
>Repetition is phonemic; imitation is phonetic.

Not sure what that last distinction means.

Anyway, this is not entirely a matter of imagination; it's also a matter of
how the word-recognizers are organized. What I'm suggesting is that we
could be so organized that either an abrupt truncation of a sound or a
burst of high-frequency sound will give rise to the same consonant
perception, the same signal out of the same input function. I don't mean to
propose that as the actual case, only as an example to show what I mean.
I'm sure we also have to consider contrasts and other contextual matters,
including judgments of what word is most likely when several are possible..

>> >...So it seems that people are not directly perceiving phonemic
>> segments per >se in the acoustic signal of speech, however convenient a
>> segmental >representation of phonemes may be analytically.
>>
>>I think we need to think in levels here. Have you seen Rick's experiment
>>with perceptions of different levels?

>I don't know whether it is new or not. The literature is huge, and I am
>far from it. (The literature. Or huge, for that matter.) At first blush,
>it seems that this might be a contributor [to] transpositions (termed
>metathesis in studies of language change and variation). Some children
>transpose s when followed by a consonant to syllable final position: stuck
>becomes ducks, etc. My youngest daughter did this.

It might, but just as a test for levels of perception it could help us
understand word perception. I see something like this effect in listening
to short segments of my own speech. If I make the segment short enough I
cease to hear the typical vowel sound -- there's some indication that the
sound has to be present for some minimum time to be recognized. Shortening
the time certainly makes diphthongs harder to hear. Of course that would
put a limit on how fast different vowel sounds could occur and still be
distinguished.

The ordering of sounds might have a slower limit of detection. To
paraphrase Rick's experiment, we could present edited utterances in which
the subject has to indicate which of two sequences was heard, or correct a
repeating sequence by pressing a button to restore the right sequence after
a switch to a different one. This would require writing a program that
could repeatedly play back one sound sequence or a different one and switch
back and forth as the experimental participant indicates -- I think I may
be able to do that.

> (Not all children do this; not all rules are universal.) However, if
> this were as fundamental and prevalant a factor as you suggest, how could
> people ever get agreement on the 'right' sequence?

Normally, we present sequences at a rate that is well within a listener's
ability to distinguish. I'm talking about varying the rate to find where
the cutoff is, to see if that rate is different from what it is for other
elements of speech. This is a hint as to the relative level of perception.
Read pages 100 ff in _More Mind Readings_ to see how Rick did it with
numbers (a pilot experiment that really ought to be done with lots of
subjects).

>How could these children come to produce the adult pronunciations? Why
>doesn't Katrina still say nowsman for snowman? Optimality theory gives a
>simpler explanation (concerning the relative rank of a constraint against
>complex syllable onsets vs. faithfulness to what is heard from adults),
>and this accounts not only for the minority-case metathesis but also for
>the majority-case conformity to the norm.

Sounds like a much more complicated explanation to me, but never mind. I
don't think I've communicated exactly what I have in mind here. It's really
much simpler.

Best,

Bill P.

···

On Tue, 8 Oct 2002, Bill Powers wrote:

[From Bill Powers (2002.10.14.0843 MDT)]
Attachment: ah.jpg.

Bruce Nevin (2002.10.07 20:24 EDT)]

Well,
then some of the descriptive facts are not necessarily 100% accurate at
least as far as speech recognition is concerned.

They are accurate descriptions of the articulation and acoustics
of careful speech (the samples on which the sound spectrograph
descriptions are based). In careful speech we attempt to control
articulatory and acoustic intentions completely and accurately. We
control with high gain. In more ordinary speaking, we control those
perceptions with less gain. This seems to be a result of conflict between
two kinds of aims, commonly espressed as minimization of confusion to the
hearer and minimization of effort by the speaker.

“Conflict” is probably putting it too strongly. The two
requirements you mention can both be achieved with reasonable effort, and
at the same time, so there’s no conflict as I think of it. In my
definition, there is no conflict unless achieving one goal makes
achievement of the other literally impossible.

I’d like to say it this way: when noise levels are high, or in general
the listener has some difficulty in understanding, the sounds can be
brought closer to the center of the recognition range in the speaker’s
hearing, which presumably will also maximize the responses of the
listener’s input functions that are intended to respond, and minimize
those that are intended not to respond. This would be done, I suppose,
only if some error in communication occurred. Normally the recognition
machinery works quite well over a range of input waveforms.

I have resurrected my sound spectrograph algorithm from some years ago
and got it to work with Delphi. It produces much prettier spectra than
the Praat program does, but that could be because I don’t yet know how to
adjust Praat’s system. I believe that Praat uses a fast fourier
transform, which is a neat way to do it. However, my way uses a set of
simulated tuned filters operating (effectively) in parallel, with the
advantage that the filters can have their bandwidth changed and also can
have their center frequencies independently adjusted. My interest for the
moment is in the perceptual input functions, and for that I have to know
more about the sound signal and the ear’s way of sorting it out in the
cochlea. Nothing new there, of course, but I’m still in the lab course of
Linguistics 101.

The attached JPEG file shows the spectrum of “Ah” with the
pitch of my voice jumping up and down by a fifth.The spectrogram actually
plots the output of each filter as a small sine-wave and also varies the
darkness according to the amplitude (white is zero).

The total range of frequencies from bottom to top is 2500 Hz, with tuned
filters every 10 Hz, for a total of 250 filters. The bandwidth of each
filter is relatively narrow, though I can’t say exactly what it is (yet).
The spectra of my voice show a fundamental frequency of about 100 Hz with
resonances every 100 Hz above that, and little energy at the frequencies
between the harmonics. When I vary the pitch of my voice all the
harmonics change frequency, too. As can be seen, the harmonics of the
sound are tied to the fundamental, with the mouth configuration
determining relative amplitudes at different frequencies. Saying
“AH”, there isn’t much intensity at the highest
frequencies.

The idea I was pursuing seven or eight years ago was to try to remove the
pitch variations so we could get a clearer picture of the harmonic
relationships. The way I was going to do this was to find the fundamental
frequency and make the tuning of one filter track it, and then set up the
other filters at multiples of the (now-varying) frequency of the
fundamental. I have got this to work quite well, but am still debugging
it and am not ready to show the results.

When we can track the fundamental, the filters between harmonics are no
longer necessary, because there will never be any sound energy between
the harmonics (save for breathing, hisses, pops, clicks, and so forth).
All voiced sounds, therefore, can be represented by the outputs of 25 or
30 filters which are tracking the harmonics of the fundamental. This
greatly speeds up the program; in fact it can run much faster than real
time (on a 1.6 GHz computer). You see what I’m getting at – it might be
possible to play with real-time vowel recognition.

What I’m hoping for is to find explanations of how we recognize sounds
that don’t involve “cognitive” explanations like the sort you
mentioned, in which it sounds as if the lower parts of the brain are
calculating strategies, suffering conflicts, and so forth. Higher
processes are no doubt involved in the overall picture of language
recognition, but I’d like to see how much can be done using only
lower-level processes at first (like the processes implied by Harris’
term “contrast”).

Best

Bill P.

(Attachment Ah.jpg is missing)