13.2065, Disc: Accuracy in Speech Recognition: Priorities

Sat Aug 10 18:51:01 UTC 2002

LINGUIST List:  Vol-13-2065. Sat Aug 10 2002. ISSN: 1068-4875.

Subject: 13.2065, Disc: Accuracy in Speech Recognition: Priorities

Moderators: Anthony Aristar, Wayne State U.<aristar at linguistlist.org>
            Helen Dry, Eastern Michigan U. <hdry at linguistlist.org>

Reviews (reviews at linguistlist.org):
	Simin Karimi, U. of Arizona
	Terence Langendoen, U. of Arizona

Consulting Editor:
        Andrew Carnie, U. of Arizona <carnie at linguistlist.org>

Editors (linguist at linguistlist.org):
	Karen Milligan, WSU 		Naomi Ogasawara, EMU
	James Yuells, EMU		Marie Klopfenstein, WSU
	Michael Appleby, EMU		Heather Taylor, EMU
	Ljuba Veselinova, Stockholm U.	Richard John Harvey, EMU
	Dina Kapetangianni, EMU		Renee Galvis, WSU
	Karolina Owczarzak, EMU		Anita Wang, EMU

Software: John Remmers, E. Michigan U. <remmers at emunix.emich.edu>
          Gayathri Sriram, E. Michigan U. <gayatri at linguistlist.org>
          Zhenwei Chen, E. Michigan U. <zhenwei at linguistlist.org>

Home Page:  http://linguistlist.org/

The LINGUIST List is funded by Eastern Michigan University, Wayne
State University, and donations from subscribers and publishers.

Editor for this issue: Karen Milligan <karen at linguistlist.org>

=================================Directory=================================

1)
Date:  Fri, 9 Aug 2002 10:53:57 -0400
From:  Richard Sproat <rws at research.att.com>
Subject:  Re: 13.2050, Disc: Accuracy in Speech Recognition: Priorities

-------------------------------- Message 1 -------------------------------

Date:  Fri, 9 Aug 2002 10:53:57 -0400
From:  Richard Sproat <rws at research.att.com>
Subject:  Re: 13.2050, Disc: Accuracy in Speech Recognition: Priorities

The system Steven Roberts describes sounds interesting, but I don't
see how it relates to the point I was addressing in my comment on the
NY Times article.

I briefly repeat my point: an uninformed reader of the New York Times
article would come away thinking that the main problem in speech
recognition is things like inference of emotional states, and
detection of phrase boundaries. But in many applications, including a
couple that were mentioned in the article, the bigger problem is
simply getting most of the words right. If you are 70% word error rate
worrying about prosody will not get you to 30% word error rate (or at
least nobody has to my knowledge demonstrated that it will). Thus the
article gives a misleading view of the main issues in the field.

[By the way, I completely agree with Kurt Godden that the standard
word error rate (WER) measure leaves much to be desired, but it
generally correlates reasonably well with performance on a given
task. Still it is true that for a real application one might want to
report some other measure, like task completion. In speech-based
information retrieval people will generally report standard measures
such as precision and recall: of course these do also correlate with
WER.]

As far as I can tell from the description Roberts' system was not a
demonstration that detection of emotion or detection of phrasing
improves recognition. As he says, it seems to demonstrate that there
is "value to be obtained from adding even slightly more sophisticated
constraints in the recognition process". But we knew that already:
speech recognition systems depend upon various kinds of constraints
ranging from the phonotactics of the language, to domain-specific
language models. The fact that these are often trainable statistical
models does not nullify the fact that they incorporate linguistic
knowledge.

Anyway, I fail to see how my argument (even less my attitude, which
one could hardly infer) illustrates the kind of problems that have
been "hampering forward progress in speech recognition". I did not say
that people should not be working on prosodic features: in fact I said
precisely the reverse. I also did not say that people should not be
trying to make use of various kinds of linguistic information in
improving recognition: I am all for that (I am a linguist, not an
engineer, after all.)  I was merely trying to make the point that for
many applications, things like detecting emotion do not rank as number
one on the list of things to be solved.

-
Richard Sproat               Information Systems and Analysis Research
rws at research.att.com         AT&T Labs -- Research, Shannon Laborator
       180 Park Avenue, Room B207, P.O.Box 971
       Florham Park, NJ 07932-0000
- --------------http://www.research.att.com/~rws/-----------------------

---------------------------------------------------------------------------
LINGUIST List: Vol-13-2065