[From: Bruce Nevin (Tue 931026 09:35:58 EDT)]
( Rick Marken (931022.1530) ) --
Rick, I apologize. Being swamped with finishing a programmer's manual
(and having lost my glasses two weeks ago!), I actually had not read your
message. My mailbox is almost up to 400 messages, many of them unread.
I'm trying to go back and clean up as well as stay current, but I'm not
there yet.
The acoustic signal in which we recognize discrete words seems to be made
up of continuous sound-wave properties, continuously varying through the
course of an utterance, and variable across instances of what hearers
judge to be repetitions of the same words. All attempts to identify
reliable invariants signalling phonemes in the speech stream have
achieved only very limited success. Computational speech recognition
systems typically generate a set of word-strings as candidates to match a
given acoustic input (perhaps breaking the input into words at different
points in different candidates) by looking for matches in a word list.
They then apply other criteria to the context so as to assign relative
probabilities to these candidat word-strings. The words that they look
up are discrete by virtue of being spelled in alphabetic symbols
(letters).
Perhaps something roughly analogous does indeed take place in a human
perceptual control hierarchy. People like to assume this. If os, the
words that are "looked up" for a match to perceived sound-features are
discrete by virtue of being represented by discrete elements, one for
each point of contrast in the word. These discrete representations of
contrasts are the phonemic elements or phonemes. Sound-features are
associated with the phonemic elements. Not all the sound-features
associated with a given phonemic element typically are present in the
acoustic signal in which a hearer perceives that phoneme. The hearer
appears to correct her perception of the acoustic signal, perceiving as
present sound features that actually are absent, and perceiving as absent
features that actually are present.
An example is nasalization of a vowel adjacent to a nasal stop (m, n, ng).
The velum does not close promptly when the oral stop is opened for the
vowel, and the velum begins to open before the oral opening for the vowel
is reduced to stop for the following consonant.
If someone says "Hey, ma~" (with nasalized vowel but no oral stop) a
hearer hears "Hey, man!", and would assert that that was what the speaker
actually pronounced, with the nasal stop n. This also happens with
nonsense syllables, so the normalization is not based on "looking up" the
canonical phonemic shape of the word in memory.
If the nasalized vowels in syllables ending in nasal consonants are
excised and played back in isolation, they are perceived as strongly
nasalized, like vowels in French. If they are played back in context,
they are perceived as not nasalized. The nasalization on the vowel is
perceived as a feature of the consonant.
The phonemic contrasts of English partition the articulatory/acoustic
space available to all humans in a way that is different for English than
it is for French or Lakota. In English, nasalisation is contrastive only
for consonants. Utterances do not contrast with one another with respect
to nasality of vowels. That is a convention that people learning English
pick up. In French or Lakota, nasalization is contrastive for vowels as
well as for consonants. The nonsense syllables ma ma~ ma~n contrast with
one another. (I do not know of a language that contrasts ma ma~ man
ma~n.) In English, nasalization of a vowel is attributed to an adjacent
consonant. If all the vowels in an utterance are nasalized, then an
intonation of nasalization is perceived (which would be "talking funny").
But nasalization is not attributed to a phonemic contrast between one
word with nasalized vowel and another word with clear, non-nasalized
vowel.
The partitioning of the humanly available articulatory/acoustic space is
exhaustive. Cut the oe vowel from German pronunciation of Goethe and
splice it in place of the o vowel of English "road" and an English
hearer, say from the US midwest, perceives the English word "road". It's
a kind of funny pronunciation, but hey, there are all kinds of funny
accents out there, he said "road" I tell you. And their repetition of
the heard word would have the speaker's own English o vowel, not the
German oe vowel. Splice the o vowel from an upper-class English
pronunciation of "road" into the German name Goethe, and a German hearer
will hear a somewhat peculiar German pronunciation of Goethe (peculiar
because of the diphthongization, a "w" quality after the main vowel). Splice
the o from our midwesterner's pronunciation of "road" into Goethe and the
German hears a different word, I don't know what it would be (Gotte has a
lower vowel, like English "got"), but it's definitely in contrast with
Goethe. Splice the oe from Goethe into the midwesterner's "road" and the
midwesterner might very well hear "raid" with a funny kind of accent.
Put the w-glide after it, and we English speakers hear a British kind of
pronunciation of "road". We shove these sound features into one of the
phonemic elements into which we have learned to partition the
articulatory/acoustic space. The partitioning is conventional, different
for different languages. The space that is available for pronunciations
is universal to all humans (setting aside physical defects, and even they
are compensated in fascinating ways).
I presume you mean that I hear many different acoustical variants
of the same word as being the same and each different word as being
discretely different because of "the way phonemic contrasts
exhaustively partition the available articulatory space". In other
words, these phonemic contrasts are what break a continuum of acoustical
signals into the discrete events that we call words. Is this right?
So we have:
a1 a2 a3 a4 a5 a6 a7 a8 a9 .....an
> > > > > > > > > > > > >
w1 w1 w2 w2 w2 w2 w3 w4 w4 .....an
where the ai are different acoustic signals, the w.i are the discrete
word perceptions that correspond to each and the "|" show the mapping
of a.i values to w.i values. The a.i are actually vary on a continuum
-- so there is an acoustical event between a1 and a2, etc. The w.i, on
the other hand, vary discretely -- there is no word perception
"between" w1 and w2, etc. This figure represents what I think you
mean by "the discreteness of words". Is this right?
Yes.
Bruce
bn@bbn.com