speech synthesis ref

bnhpct · January 25, 1995, 6:54pm

(Bruce Nevin Wed 930125 12:54:08 EST)

Bill Powers (950114.0750 MST)

I don't think that metaphors like "paths through familiar degrees of
freedom" are very helpful.

Thanks for calling me on that. It's easy to forget that metaphors tend
to take on a life of their own -- so to speak. ;->

···

-=+=-=+=-=+=-=+=-=+=-=+=-=+=-=+=-=+=-=+=-=+=-=+=-=+=-=+=-=+=-=+=-=+=-=+=-

It appears to me that some approaches to speech synthesis involve
engineering solutions that are difficult to adapt for modelling what
human beings do. When I say "engineering solutions," I mean methods
analogous to what I am seeing described here as the norm for much of
robotics. An exception appears to be the synthesizer described in:

Klatt, Dennis H and Laura C. Klatt. 1990. Analysis, synthesis, and
       perception of voice quality variations among female and male
       talkers. _The Journal of the Acoustical Society of America_
       87:2.820-857.

(A note says that Dennis Klatt died at the end of December 1988, and that
Laura Klatt was a summer research assistant in 1987. I don't know the
status of this equipment.)

The emphasis of the research reported in this paper is on voice quality.
Perhaps it is for that reason that a "glottal" sound source and various
resonators and other modifiers of its output, with their controllable
parameters and constants, are arranged as an analog of the vocal tract.

This "KLSYN88 Cascade/Parallel formant synthesizer" controls separate
resonators for each formant above the first. It does so separately for
laryngeal sound sources, where F1 through F5 are in series (cascade), and
for frication noise (F2 through F6 in parallel). I do not understand yet
why there is also a parallel arrangement of formant resonators for F1
through F4 for laryngeal sound sources, noted as "normally not used," nor
do I know whether the separation is an artifact of the block diagram, and
in fact the same device is used for e.g. the F2 resonator in all cases.
When I read the article I may understand all this better.

In most cases, as I understand it, it cannot be that the formants are
controlled independently of one another. The coupling between them is a
function of the physical (acoustical) properties of the vocal tract--a
part of the environment, as I take it. It is for this reason that F0
(pitch) and the first two formants, F1 and F2, suffice as cueues for
vowel perception, though not for synthesis of natural-sounding speech.
It would be nice to have some sort of vocal-tract function that
realistically provides this aspect. Perhaps only one input would be
controlled then for changes in pitch of all the formants at once.

Klatt and Klatt make the interesting observation that voice quality
apparently needs to vary over the course of speaking for the speech to
sound natural.

Got to go home and nurse this flu.

Bruce

_Martin_Taylor · January 26, 1995, 5:07pm

[Martin Taylor 950126 12:00]

Bruce Nevin Wed 930125 12:54:08

Klatt, Dennis H and Laura C. Klatt. 1990. Analysis, synthesis, and
      perception of voice quality variations among female and male
      talkers. _The Journal of the Acoustical Society of America_
      87:2.820-857.

(A note says that Dennis Klatt died at the end of December 1988, and that
Laura Klatt was a summer research assistant in 1987. I don't know the
status of this equipment.)

The Klatt synthesizer is, as I understand it, available in pure software
form for ftp from a speech archive named in the faq for comp.speech. There's
a machine whose name I've forgotten, with many of the different faqs. I'm
not going to look it up right now, for reasons that may be apparent to
you, viz.:

Got to go home and nurse this flu.

Did that yesterday, with only partial success. Should do it again today.

This "KLSYN88 Cascade/Parallel formant synthesizer" controls separate
resonators for each formant above the first. It does so separately for
laryngeal sound sources, where F1 through F5 are in series (cascade), and
for frication noise (F2 through F6 in parallel). ...
When I read the article I may understand all this better.

There are differences of opinion among the developers of formant synthesizers,
as to what (if anything) should be series and what (if anything) should be
parallel. Don't put too much effort into worrying about it. Proponents
of each approach claim that their way sounds best.

It would be nice to have some sort of vocal-tract function that
realistically provides this aspect.

One way of looking at LPC analysis is exactly as being this kind of vocal
tract area function. I have never like LPC analysis, for all its popularity,
because (like Fourier transformation) it --in most of its forms -- equalizes
the frequency-time tradeoff throughout the spectrum, and that's not what
the auditory system does.

Martin