13.2044, Disc: New: Accuracy in Speech Recognition: Priorities
LINGUIST List
linguist at linguistlist.org
Wed Aug 7 19:28:44 UTC 2002
LINGUIST List: Vol-13-2044. Wed Aug 7 2002. ISSN: 1068-4875.
Subject: 13.2044, Disc: New: Accuracy in Speech Recognition: Priorities
Moderators: Anthony Aristar, Wayne State U.<aristar at linguistlist.org>
Helen Dry, Eastern Michigan U. <hdry at linguistlist.org>
Reviews (reviews at linguistlist.org):
Simin Karimi, U. of Arizona
Terence Langendoen, U. of Arizona
Consulting Editor:
Andrew Carnie, U. of Arizona <carnie at linguistlist.org>
Editors (linguist at linguistlist.org):
Karen Milligan, WSU Naomi Ogasawara, EMU
James Yuells, EMU Marie Klopfenstein, WSU
Michael Appleby, EMU Heather Taylor, EMU
Ljuba Veselinova, Stockholm U. Richard John Harvey, EMU
Dina Kapetangianni, EMU Renee Galvis, WSU
Karolina Owczarzak, EMU Anita Wang, EMU
Software: John Remmers, E. Michigan U. <remmers at emunix.emich.edu>
Gayathri Sriram, E. Michigan U. <gayatri at linguistlist.org>
Zhenwei Chen, E. Michigan U. <zhenwei at linguistlist.org>
Home Page: http://linguistlist.org/
The LINGUIST List is funded by Eastern Michigan University, Wayne
State University, and donations from subscribers and publishers.
Editor for this issue: Karen Milligan <karen at linguistlist.org>
=================================Directory=================================
1)
Date: Sat, 3 Aug 2002 10:47:19 -0400
From: Richard Sproat <rws at research.att.com>
Subject: Re: Linguist 13.2025: Media: NYT - Speech recognition
-------------------------------- Message 1 -------------------------------
Date: Sat, 3 Aug 2002 10:47:19 -0400
From: Richard Sproat <rws at research.att.com>
Subject: Re: Linguist 13.2025: Media: NYT - Speech recognition
The NYT article that Karen S. Chung pointed us to is a pretty good
example of the kind of reporting that anyone who works on speech
technology (or at least anyone who is honest) should cringe at.
There seems to be the implication that a major problem in speech
recognition is that we can't detect where sentence boundaries are in
running speech, and that we are only beginning to be able to detect
emotional content.
How about the more basic problem of getting most of the words right?
Speech recognition methods that might score in the low 90% range in
terms of word accuracy on relatively "clean" tasks such as dictation
or broadcast news, can easily fall to the 60-70% range on
conversational speech. And if there isn't sufficient training data for
the domain or the acoustic conditions -- a highly realistic scenario
for tasks such as eavesdropping on potential terrorists -- then they
easily drop into the 30% range. When you are getting two out of every
three words wrong, the fact that you are also unable to detect
sentence boundaries, or whether the speaker is angry, somehow doesn't
seem to be that critical. And all of this assumes that the people are
speaking English, or one of the handful of other languages for which
there is enough data to train large vocabulary speech recognizers. One
cannot generally assume that terrorists plotting their next attack
will be speaking one of those languages.
Now I happen to think that trying to detect things like prosodic
phrasing or emotion is worthwhile. Certainly detecting if someone is
angry can have useful applications: one could for example use that
information to route a disgruntled caller to an agent specially
trained to deal with unhappy customers. (There has even been a recent
patent on precisely that application, though the "inventors" were a
bit fuzzy on the implementational details: this may have been the
patent referred to in the NYT article.) And detecting prosodic
phrasing can be useful for such things as deciding how to parse a long
string of numbers, and so forth, as the article points out.
But I also believe it is important to put these kinds of things in
perspective. For many applications detecting sentence boundaries or
the speaker's emotional state just ain't number one on the list of
problems to be solved. This should be obvious: if you hand a security
analyst a near perfect transcription of some speech that omits
sentence boundaries, they are likely to get a whole lot more out of
that than if you hand them a transcript with only 30% of the words
correct, but which puts in sentence boundaries (say with 70% accuracy)
and tells you if the speaker is angry or not (say with 65% accuracy).
- Richard Sproat
-
Richard Sproat Information Systems and Analysis Research
rws at research.att.com AT&T Labs -- Research, Shannon Laboratory
180 Park Avenue, Room B207, P.O.Box 971
Florham Park, NJ 07932-0000
- --------------http://www.research.att.com/~rws/-----------------------
---------------------------------------------------------------------------
LINGUIST List: Vol-13-2044
More information about the LINGUIST
mailing list