13.2111, Disc: Accuracy in Speech Recognition: Priorities

Fri Aug 16 17:24:38 UTC 2002

LINGUIST List:  Vol-13-2111. Fri Aug 16 2002. ISSN: 1068-4875.

Subject: 13.2111, Disc: Accuracy in Speech Recognition: Priorities

Moderators: Anthony Aristar, Wayne State U.<aristar at linguistlist.org>
            Helen Dry, Eastern Michigan U. <hdry at linguistlist.org>

Reviews (reviews at linguistlist.org):
	Simin Karimi, U. of Arizona
	Terence Langendoen, U. of Arizona

Consulting Editor:
        Andrew Carnie, U. of Arizona <carnie at linguistlist.org>

Editors (linguist at linguistlist.org):
	Karen Milligan, WSU 		Naomi Ogasawara, EMU
	James Yuells, EMU		Marie Klopfenstein, WSU
	Michael Appleby, EMU		Heather Taylor, EMU
	Ljuba Veselinova, Stockholm U.	Richard John Harvey, EMU
	Dina Kapetangianni, EMU		Renee Galvis, WSU
	Karolina Owczarzak, EMU		Anita Wang, EMU

Software: John Remmers, E. Michigan U. <remmers at emunix.emich.edu>
          Gayathri Sriram, E. Michigan U. <gayatri at linguistlist.org>
          Zhenwei Chen, E. Michigan U. <zhenwei at linguistlist.org>

Home Page:  http://linguistlist.org/

The LINGUIST List is funded by Eastern Michigan University, Wayne
State University, and donations from subscribers and publishers.

Editor for this issue: Karen Milligan <karen at linguistlist.org>

=================================Directory=================================

1)
Date:  Fri, 16 Aug 2002 10:49:09 +0100
From:  "David Horowitz" <dhorowitz at voxgeneration.com>
Subject:  RE: 13.2065, Disc: Accuracy in Speech Recognition: Priorities

-------------------------------- Message 1 -------------------------------

Date:  Fri, 16 Aug 2002 10:49:09 +0100
From:  "David Horowitz" <dhorowitz at voxgeneration.com>
Subject:  RE: 13.2065, Disc: Accuracy in Speech Recognition: Priorities

I agree with Dr. Sproat's comments (Linguist 13.2065), but wanted to
add my own.

Statistical methods do incorporate some linguistic constraints, and
acoustic front ends, encompass some of the acoustic phonotactics.  In
my own research, we have considered the detection of prosody for
spoken dialog systems.  Probably the best example is some of the
research of Roberto Pieracinni on the detection of negative emotional
states, to ascertain if the user is frustrated with the system,
perhaps poor performance.  However, as Dr. Sproat points out, it is
not high on the priority list of methods to focus on in the hopes of
producing improved performance.

I am enthusiastic that there is a move in the stochastic speech
community to begin thinking about prosody.  However, any analysis of
the problem should not be solely considered as a stochastic algorithm.
If you look at the work of Professor Marie Ostendorff, she begins to
examine the application of distinctive feature theory (posited by
Keyser and Stevens).  Moreover, the Ph.D. thesis of Dr. Mark Johnson
looks at a feature detector for the front end of a speech recogniser.
They posit, for improved performance, we need to embed more knowledge
of the speech signal.  While statistical approaches have shown to be
powerful for commercial recognisers, I believe it is fruitful and
timely to begin to re-examine the literature of traditional speech
science and acoustic phonetics (see Stevens text book, Acoustic
Phonetics - MIT Press).

The fact that people are worried about improved measures of
performance also indicates the traditional acoustic modelling
techniques of speech have a role.  I have talked to a well known
speech synthesis scientist who commented to me that when examining a
spectrogram, it does not inform the scientist on how to measure voice
quality and naturalness.  However, Klatt and Klatt (1987) and Helen
Hanson and Ken Stevens have shown reliable acoustic measures that
reflect voice quality and Klatt showed that this model works by
identically resynthesising human speech with a formant synthesiser.
Measures such as spectral tilt, formant bandwidth and glottal open
quotient need to be modelled for any work to be done in prosody.
These parameters change dynamically with time, especially at phrase
boundaries.  It is a little difficult for me to understand how the
prosody problem can be solved using purely statistical approaches when
subtle spectral changes account for the quality of naturalness or
emotional state.  Furthermore, this investigation has much promise in
the field of speech recognition in the acoustic front end, and
improved benchmarks of system performance.

At Vox Generation, we have worked on an extension of Abney's work
(phrase chunking) and Hirschberg's work (automatic prosody marking)
towards the end of an improved linguistic model for prosody generation
of synthetic speech.  The new research we are pursuing involves taking
these ToBI marks, which overgenerates them, and selecting the
appropriate mark for added intelligence to the unit selection
mechanism.  However, I raise as a question if traditional unit
selection techniques can be trained to interpret these marks of
prosody.

David Horowitz
Executive Chief Scientist
Vox Generation Ltd, London
www.voxgeneration.com

---------------------------------------------------------------------------
LINGUIST List: Vol-13-2111