13.2044, Disc: New: Accuracy in Speech Recognition: Priorities

LINGUIST List linguist at linguistlist.org
Wed Aug 7 19:28:44 UTC 2002


LINGUIST List:  Vol-13-2044. Wed Aug 7 2002. ISSN: 1068-4875.

Subject: 13.2044, Disc: New: Accuracy in Speech Recognition: Priorities

Moderators: Anthony Aristar, Wayne State U.<aristar at linguistlist.org>
            Helen Dry, Eastern Michigan U. <hdry at linguistlist.org>

Reviews (reviews at linguistlist.org):
	Simin Karimi, U. of Arizona
	Terence Langendoen, U. of Arizona

Consulting Editor:
        Andrew Carnie, U. of Arizona <carnie at linguistlist.org>

Editors (linguist at linguistlist.org):
	Karen Milligan, WSU 		Naomi Ogasawara, EMU
	James Yuells, EMU		Marie Klopfenstein, WSU
	Michael Appleby, EMU		Heather Taylor, EMU
	Ljuba Veselinova, Stockholm U.	Richard John Harvey, EMU
	Dina Kapetangianni, EMU		Renee Galvis, WSU
	Karolina Owczarzak, EMU		Anita Wang, EMU

Software: John Remmers, E. Michigan U. <remmers at emunix.emich.edu>
          Gayathri Sriram, E. Michigan U. <gayatri at linguistlist.org>
          Zhenwei Chen, E. Michigan U. <zhenwei at linguistlist.org>

Home Page:  http://linguistlist.org/

The LINGUIST List is funded by Eastern Michigan University, Wayne
State University, and donations from subscribers and publishers.



Editor for this issue: Karen Milligan <karen at linguistlist.org>

=================================Directory=================================

1)
Date:  Sat, 3 Aug 2002 10:47:19 -0400
From:  Richard Sproat <rws at research.att.com>
Subject:  Re: Linguist 13.2025: Media: NYT - Speech recognition

-------------------------------- Message 1 -------------------------------

Date:  Sat, 3 Aug 2002 10:47:19 -0400
From:  Richard Sproat <rws at research.att.com>
Subject:  Re: Linguist 13.2025: Media: NYT - Speech recognition


The NYT article that Karen S. Chung pointed us to is a pretty good
example of the kind of reporting that anyone who works on speech
technology (or at least anyone who is honest) should cringe at.

There seems to be the implication that a major problem in speech
recognition is that we can't detect where sentence boundaries are in
running speech, and that we are only beginning to be able to detect
emotional content.

How about the more basic problem of getting most of the words right?
Speech recognition methods that might score in the low 90% range in
terms of word accuracy on relatively "clean" tasks such as dictation
or broadcast news, can easily fall to the 60-70% range on
conversational speech. And if there isn't sufficient training data for
the domain or the acoustic conditions -- a highly realistic scenario
for tasks such as eavesdropping on potential terrorists -- then they
easily drop into the 30% range. When you are getting two out of every
three words wrong, the fact that you are also unable to detect
sentence boundaries, or whether the speaker is angry, somehow doesn't
seem to be that critical. And all of this assumes that the people are
speaking English, or one of the handful of other languages for which
there is enough data to train large vocabulary speech recognizers. One
cannot generally assume that terrorists plotting their next attack
will be speaking one of those languages.

Now I happen to think that trying to detect things like prosodic
phrasing or emotion is worthwhile. Certainly detecting if someone is
angry can have useful applications: one could for example use that
information to route a disgruntled caller to an agent specially
trained to deal with unhappy customers. (There has even been a recent
patent on precisely that application, though the "inventors" were a
bit fuzzy on the implementational details: this may have been the
patent referred to in the NYT article.) And detecting prosodic
phrasing can be useful for such things as deciding how to parse a long
string of numbers, and so forth, as the article points out.

But I also believe it is important to put these kinds of things in
perspective. For many applications detecting sentence boundaries or
the speaker's emotional state just ain't number one on the list of
problems to be solved. This should be obvious: if you hand a security
analyst a near perfect transcription of some speech that omits
sentence boundaries, they are likely to get a whole lot more out of
that than if you hand them a transcript with only 30% of the words
correct, but which puts in sentence boundaries (say with 70% accuracy)
and tells you if the speaker is angry or not (say with 65% accuracy).

- Richard Sproat

-
Richard Sproat               Information Systems and Analysis Research
rws at research.att.com         AT&T Labs -- Research, Shannon Laboratory
180 Park Avenue, Room B207, P.O.Box 971
Florham Park, NJ 07932-0000
- --------------http://www.research.att.com/~rws/-----------------------

---------------------------------------------------------------------------
LINGUIST List: Vol-13-2044



More information about the LINGUIST mailing list