spin, sbin, spectra, and levels

[From Bill Powers (931006.0745 MDT)]

Bruce Nevin (various posts) --

It seems to me that we're on the verge of understanding something
about spin, sbin, and other phenomena. What's occurred to me is
that while we CAN become aware of various auditory components of
speech items (I give up on using the correct terms in the correct
place), the words or morphemes we perceive are not made up of any
simple functions of these components. We need at least a two-
level perceptual process to model the perception of speech items:
it's not a choice of sounds _vs_ phonemes _vs_ morphemes, as
though we have to choose among them to find _the_ determinant of
perceived speech. It's a question of (1) what phonemes (or basic
sounds) we are capable of discriminating, and (2) how these basic
sound-perceptions are then combined at a higher level of
perception to create perceptions of larger units. There may be
two or three stages in this process.

When I say the [b] sound, what happens is that my mouth and nose
passages close so there is no escape of air, and then I begin
forcing air past my vocal cords to produce a sort of muffled "uh"
sound. The air going past the vocal cords enters the mouth
cavity, building up back-pressure all the time; it is not
possible to prolong the sound more than a fraction of a second
because the air has no place to go. This is also true in saying
[d], [g], and [j]. Then I release the lips (or tongue) and the
sound becomes open, the passage of air now being free; this
affects the vibrations of the vocal cords, too, because now there
is no significant back pressure and air flows freely through

How do I know all this? I can feel it and hear it. This means
that these processes are represented in my perceptions. There is,
therefore, a level of perception of speech that occurs at this
level of detail. If there were not, I would not be able to hear
and feel these things.

In my sound spectrograms of these sounds, many details are lost,
apparently: the muffled "uh" sound doesn't come through clearly
(I hear it a lot better from inside than the microphone does from
outside). But looking at the spectrograms I can see the
attenuated sound prior to the opening of the lips, and it's clear
that [b] and [t] and [p] look rather different to the eye.

Unfortunately, what I'm finding with the spectra is what you
warned me about: the spectrograms don't seem to show nearly as
many differences as I can hear. I can easily hear the difference
between ee and ih, but in the spectrum there is only a slight
shift of appearance that hardly looks like enough to allow such a
clear difference to be detected. When I vary my voice pitch while
saying these sounds, the differences become even less. Martin and
Tom have both assured me that my spectrograms look quite like the
ones from professional equipment; the representations we see
published have been greatly worked over and cleaned up, but
still, apparently, don't provide clear discriminations among
sounds that are clearly different to the human ear/brain.

In the contrast studies you describe, the contrast effects may be
due to the operation of higher-level systems. There may be
different perceptual functions assigned to "pin" and "bin" and
"spin", but no separate function assigned (yet) to "sbin". As a
result, pin and bin result in separate perceptual signals, but
spin and sbin result in a signal from only one perceptual
function because there is only one that responds to either spin
or sbin (slightly less to sbin). The contrast effect and the
lack-of-contrast effect are due to differences in higher-level
perceptual functions and how they process perceptual signals
representing individual sounds. Lack of contrast is not due to a
lack of lower-level perceptual representation of the different

But in order to make such discriminations possible -- and they
are possible even if people don't normally make them -- there
must be a well-defined and separated set of lower-order sound
perceptions available to higher-order perceptual functions. When
I report on my experiences in making the [b] sound, I'm not
attending at the higher levels, but at a very low level, below
the level where recognizeable morphemes are present. It seems to
me that before we can make a model that behaves correctly at the
higher level, we must find the functions that will allow for fine
discriminations of sounds at the lower level. Once we have well-
separated lower-level perceptions, we can start working on
finding the higher-level perceptual functions that will produce
the observed contrast effects and other effects.

While we're at it, we have to consider not just speech but all
sound perception. I can easily discriminate among an oboe, a
clarinet, and a saxophone. I doubt that a sound spectrogram would
be able to show clearly the difference between these instruments
playing a passage in which pitch, loudness, and timber vary in a
complex way within the same frequency range. I can hear the
difference between a book falling on a rug and a shoe falling on
a rug, between a child's voice and a man's voice. Morphemes and
phonemes are irrelevant in such contexts: the real problem is in
discriminating between qualities of sound at the lowest levels of
perception. And we are VERY good at this (in conparison with the
capabilities of our perceptual-function models).

I have played with a time-domain representation of speech sounds
to use in place of the spectrograms. There are definitely some
clearer discriminations among vowel sounds using this method. By
synchronizing a display using zero-crossings of a sine wave that
tracks the lowest voice frequency, I can create a stationary
oscillograph-like display of the voice waveform. Changing the
vowel causes some radical shifts in the fine structure of the
waveform, radical enough to permit finer discriminations than I
can see in the spectrograms. Unfortunately, when I vary the voice
pitch while holding the vowel constant, there are also radical
shifts in the fine structure, so this is still not the right
approach. Now there is TOO MUCH sensitivity to differences,
particularly to differences that aren't supposed to matter.
Perhaps the frequency information contained in the frequency-
tracking system could be used to insert compensating effects in
the waveform-discriminating system. But that begins to sound

You mentioned a "correlogram" method that someone was presenting.
Could you get some details on how it works? It is just a running
time-delayed cross-correlation, or some kind of fancy filter? The
time-domain approach looks, right now, more promising than the
spectrogram approach. Perhaps they could be combined; perhaps the
tuning characteristics of the cochlea and its receptors would
provide some pre-processing, which when followed by a time-domain
treatment would result in the right discriminations without so
much sensitivity to the wrong ones. Perhaps, too, we need
frequency-tracking to remove pitch effects.

I think we have to reproduce the human capacity to distinguish
among qualities of sounds before we can make sense of higher-
level phenomena like the contrast effects. The mere fact that
there is failure to discriminate at the higher levels says
nothing about lower-level discrimination. If the lower-level
disciminations are EVER made, then we have to assume that they
are ALWAYS made, and that any lack of one-to-one correspondence
with words is due to the way higher-level perceptual functions
are organized. Failure to distinguish between spin and sbin
simply says that the same perceptual function responds to either
one; the input function treats different sets of input signals as
equivalent. I think that by now a lot of participants in csg-l
will have no trouble making that distinction; they have
reorganized that particular input function into two input
functions just through repeatedly mumbling spin and sbin and
knowing that they are supposed to sound different.

Linguists obviously have learned to make some very fine
discriminations of sound qualities -- how else would they know
about all the ways of saying "p" and so on? Those are lower-level
discriminations, and suggest strongly that the lower-level
systems provide a rich source of well-separated sensation-level
perceptions. At the higher levels where we combine those lower-
level signals into morphemes, many subsets of actually-different
lower-level perceptions are treated as equivalent and lead to
perception of the same word or morpheme, albeit with some analog
differences that we call "accents" or "dialects" (consider all my
uses of technical terms to be followed with a (?)).

If linguists can make these distinctions at low levels of
perception, then the human brain can make these distinctions, and
probably does. There is no point in giving up on phonemes just
because of the contrast phenomenon. We're talking about different
levels of perception.



Bill P.