17.65, Review: Computational Ling: Nass & Brave (2005)

Wed Jan 11 22:13:01 UTC 2006

LINGUIST List: Vol-17-65. Wed Jan 11 2006. ISSN: 1068 - 4875.

Subject: 17.65, Review: Computational Ling: Nass & Brave (2005)

Moderators: Anthony Aristar, Wayne State U <aristar at linguistlist.org>
            Helen Aristar-Dry, Eastern Michigan U <hdry at linguistlist.org>

Reviews (reviews at linguistlist.org) 
        Sheila Dooley, U of Arizona  
        Terry Langendoen, U of Arizona  

Homepage: http://linguistlist.org/

The LINGUIST List is funded by Eastern Michigan University, Wayne
State University, and donations from subscribers and publishers.

Editor for this issue: Lindsay Butler <lindsay at linguistlist.org>
================================================================  

What follows is a review or discussion note contributed to our 
Book Discussion Forum. We expect discussions to be informal and 
interactive; and the author of the book discussed is cordially 
invited to join in. If you are interested in leading a book 
discussion, look for books announced on LINGUIST as "available 
for review." Then contact Sheila Dooley at dooley at linguistlist.org. 

===========================Directory==============================  

1)
Date: 09-Jan-2006
From: Richard Sproat < rws at uiuc.edu >
Subject: Wired for Speech 

-------------------------Message 1 ---------------------------------- 
Date: Wed, 11 Jan 2006 17:03:21
From: Richard Sproat < rws at uiuc.edu >
Subject: Wired for Speech 

AUTHOR: Nass, Clifford; Brave, Scott 
TITLE: Wired for Speech
SUBTITLE: How Voice Activates and Advances the Human-Computer 
Relationship
PUBLISHER: MIT Press
YEAR: 2005
Announced at http://linguistlist.org/issues/16/16-2705.html 

Richard Sproat, Departments of Linguistics and ECE, University of 
Illinois at Urbana-Champaign

OVERVIEW

The topic of this book is voice user interfaces, an example of which is 
the automated system that one interacts with when one calls United 
Airlines and wishes to check on the arrival or departure time for a 
flight. Other examples include systems where one speaks to a 
graphical avatar (a ''talking head'') that serves as an automated 
information kiosk; or the National Oceanic and Atmospheric 
Administration's Weather Radio, which presents the weather using a 
text-to-speech (TTS) synthesizer. In short, a voice user interface is 
any automated system that allows a user to access information, 
possibly with automatic speech recognition (ASR) for voice input, and 
with either prerecorded prompts or TTS technology to produce output.

The book is not about the technology underlying voice user interfaces. 
Rather, it is about how humans react to them and interact with them in 
controlled experiments, and how this information should guide the 
design of the ''persona'' that the interface presents to the world.

The book is divided into fourteen main chapters, and a two-page 
summary chapter. After a brief stage-setting first chapter, the authors 
turn in chapters 2-3 to their first topic, namely the gender of voices. 
How do people react to synthetic male versus female voices? How 
does gender stereotyping affect people's perception of the quality or 
believability of a voice user interface system? Can one get around 
user prejudices by having a ''gender neutral'' voice?

Chapters 4-5 turn to the issue of voice ''personality''. People infer 
many aspects of other people's personality by the way they talk, and 
the kinds of words they use. And, as it turns out, people 
impute ''personalities'' to voice user interfaces. As with gender, 
preconceived notions about personalities have a strong effect on the 
user's perception of a voice user interface.

Chapter 6 deals with the issue of regional or foreign accents and 
perceived ethnicity. Once again people's prejudices about accent and 
race carry over to machines, even though the notion that a machine 
has a geographical or ethnic background is obviously absurd.

Chapters 7-8 discuss emotion and how that should be expressed, or 
not expressed, in voice user interfaces. One of the clear suggestions 
of this section of the book is that, where possible, it is important for a 
voice user interface to match its emotion to the (expected) emotional 
state of the user. 

Chapter 9 asks when and how a voice user interface should use 
multiple voices. A couple of conclusions are drawn: first, if multiple 
voices are used, they should be matched to the tasks being 
performed. For example, the authors suggest using an officious 
sounding voice to guide users through a complex menu system, and a 
warm friendly sounding voice to reassure users that they are being 
guided to the right place. Second, despite the common notion 
of ''voice fonts'' (e.g. Raman 2004), users do not treat different voices 
the same way as they treat different textual fonts since a change of 
voice has social implications that a change of font does not.

Chapter 10 deals with the question of whether voice interfaces should 
say ''I'', and thereby make perceived claims to being human. From the 
authors' experiments, it seems that systems that use synthetic (TTS) 
voices should not say ''I''.

Chapter 11 deals with recorded speech versus TTS, and real faces 
versus synthetic faces, and concludes that people react better to a 
system that has either a synthetic face speaking with a synthetic 
voice, or a real face speaking with a real voice; users do not like it 
when the conditions are crossed. 

Chapter 12 argues that it is generally bad to mix obviously recorded 
speech with obviously synthetic speech: for example, it would be a 
bad design choice to have a system produce a canned phrase 
like ''Good Morning Ms.'' using a prerecorded voice, and then finish 
the utterance with an obviously synthetic voice saying the name of the 
user. This, at least, is one section of the book that will seem obvious 
to anyone who has worked on the technology of speech synthesis: we 
have known for a long time that it is not a good idea to mix high quality 
prerecorded speech with poor quality synthesis. The chapter also 
contains a discussion of humor, though it is not obvious how this 
relates to the main topic of the chapter.

The final two chapters, 13-14, shift the ground from voice (and video) 
output to voice input. Chapter 13 deals with the issue of how 
comfortably people will interact when they know, or are constantly 
reminded that they are being recorded. A set of experiments with 
various kinds of attached microphones versus unobtrusive array 
microphones, showed that users who used the less obtrusive 
microphones were more creative in their responses and more willing 
to disclose sensitive information. Finally, chapter 14 discusses what 
systems should do when they misrecognize a user: what are the 
relative costs and benefits of the system accepting blame (''I'm sorry, I 
did not understand you'') versus implicating blame on the user (''You 
are speaking too quickly, please slow down.'').

Three main themes run through this book.

The first is simply this: we are ''wired'' for speech. Even though users 
know they are dealing with an automated system, if the system takes 
speech as input, or produces speech as output, users cannot help but 
treat the system as if it were another human, and will apply the same 
beliefs and prejudices to the automated system as they would to a 
human that had the same behavior. That is, if a person feels more 
comfortable with a male human explaining how to operate a complex 
piece of equipment than with a female, then that prejudice will carry 
over to an automated assistant that has a female voice.  This first 
point is consistent with previous work from Nass's lab: in general, 
people seem to treat computers as if they were people, even though 
they know full well that they are not (Nass, Steuer & Tauber, 1994).

The second theme is that there is no way around the first theme: for 
instance, we cannot solve the problem of speakers' inherent gender 
biases by making a system with a voice of ambiguous gender. Users 
will just think that the system is weird and will react to it worse than if 
the system clearly indicates that it is ''female'' or ''male''.

Finally, user perceptions of voice user interfaces have direct 
implications for users' views of whatever service or product the system 
is trying to sell. Just as a skilled salesman can make a product seem 
more desirable than it might otherwise seem, so a well-designed voice 
interface can make claims of a product's value seem more believable.

DETAILED CRITIQUE

To place the current research in some historical perspective it is worth 
noting that Nass's research on ''Computers as Social Actors'' was the 
inspiration for Microsoft ''Bob'' which, after its demise, led eventually 
to ''Clippy'', the Microsoft Office automated assistant. Neither of these 
products have been well received and there has been much 
discussion of why (e.g., Schwartz, 2003), a topic that would take us 
beyond the scope of this review.  The authors are evidently very 
proud of their long experience at providing user-interface design 
advice to corporations: the preface to the book is highly self-laudatory, 
and contains a fairly long list of consulting contracts that Nass's lab 
has had with various companies over the years, including such varied 
companies as BMW, Charles Schwab, General Magic, Macromedia, 
NTT, Philips and US West.

In the overview above, reference was made to experiments conducted 
by the authors to validate their claims about design issues for voice 
user interfaces, and it is worth summarizing one of those experiments 
just to give a flavor of the kind of research the authors performed. For 
example, in assessing the importance of gender stereotyping, the 
authors conducted an experiment, where participants were directed to 
an online auction site which offered a set of stereotypically male and 
stereotypically female merchandise, with descriptions from eBay. 
Descriptions were read to the listeners either with a female voice or a 
male voice generated with the Festival TTS system (Taylor, Black & 
Caley, 1998). Subjects were then asked to rate how credible the 
description they heard was.  The results of this experiment (reported 
on pages 25-27) were that speakers rated the product descriptions as 
more credible if the gender of the voice matched the ''gender'' of the 
product.

While the focus of the book is on the use of technology, rather than 
the technology itself, one cannot forget that any use of a technology 
presumes some understanding of the technology that is being used. 
>From the technological perspective there are a couple of points of 
interest about this book.  First, I personally found it noteworthy that 
the majority of the discussion focusses on synthesis rather than 
recognition. This focus is the opposite of the focus in the speech 
technology community, where synthesis has long taken a back seat to 
recognition, and where synthesis has traditionally been regarded as 
much easier than recognition. But the focus of the current book on 
synthesis is, after all, natural: although speech recognition is an 
important part of many voice user interfaces, it is the voice with which 
the system speaks that gives it its ''personality'' and its apparent 
human-like qualities.

Second, it is unfortunately the case that the authors do not always 
seem to understand the technology that they are evaluating. On 
several occasions they imply that changing voices, changing the 
emotions of voices, and changing the gender of voices is a 
straightforward process. This is misleading at a number of levels. 
First, consider emotion. While Nass and Brave are correct that many 
of the acoustic correlates of emotion are known, and while it is true 
that rendering emotion in synthetic speech has been a research topic 
since Cahn's work (Cahn, 1989), it is still not possible to produce 
convincing renditions of all emotions. Second, while it might seem as if 
it should be easy in general to change the voice or the gender used 
by the system, in practice there are limitations. To understand this, it is 
necessary to briefly remind the reader of the various methods used to 
produce speech output in TTS systems. The oldest approach, 
exemplified by the Klatt synthesizer (Klatt 1980) and its commercial 
offspring DECTalk, is a fully parametric system where all parameters 
of the voice, including pitch, formant values, spectral tilt, and many 
others, are controllable. In such a system it is indeed in principle easy 
to produce new voices --- but at a cost: the quality of the resulting 
speech sounds distinctly mechanical, largely because we do not have 
good models of how to control the parameters over time. Such 
limitations in our understanding have been sidestepped in much of the 
recent work on ''unit selection'' based methods. These methods, 
pioneered in work on the CHATR system (Hunt & Black, 1996), and 
exemplified in commercial systems such as AT&T's ''Natural Voices'', 
depends upon a huge database of speech from one speaker. During 
synthesis, a set of units as closely as possible matching the intended 
utterance is selected on the fly from the database. The resulting 
speech can sound very good in the best case --- and downright silly in 
the worst.  But one of the practical discoveries of this work is that the 
less one fiddles with the speech, the better and more natural the 
resulting synthesis sounds. This means that modification of speech 
such as changing the pitch is to be eschewed. The result: if you want 
a different voice, you have to record a different speaker, and analyze 
their speech. If you want a different emotion, you have to record your 
speaker performing speech with that emotion. This is certainly a lot 
easier to do than it used to be, but at a minimum one is looking at 
recording an hour's worth of a speech. This clearly involves more than 
turning a few knobs, which is all that Nass and Brave seem to imply is 
needed.

Turning away from technological issues, there are problems with the 
design of the book itself.  As the authors state at the outset, they use 
endnotes extensively for background information that can be freely 
skipped by the casual reader. For example, the data and statistical 
analyses of all the experiments are presented in endnotes, not in the 
body of the text, which merely summarizes the results. Also, 
bibliographic references are all given in the endnotes. This design 
choice has both a good and a bad aspect. It surely helps the non-
specialist reader, who will not necessarily be inclined to look at the 
authors' data in detail, but will be satisfied with the authors' synopsis 
of the results. But it is annoying for someone who has a technical 
background in the field, since following up any single point 
necessitates thumbing to the back of the book. The lack of a standard 
bibliography is also an extremely bothersome feature of the book.

Negatives aside, this is a book worth reading by anyone interested in 
speech technology. Those of us who have worked on developing the 
technology underlying voice user interfaces have traditionally not 
thought much about the actual design of the end product. Nass and 
Brave have clearly thought about these issues more than anyone 
else. 

Still, while it is useful to understand what features of speech work best 
for which applications, we should not lose sight of the fact that the 
underlying technology is itself immature, and that just building a 
system that can communicate effectively with inexperienced users is 
still a challenge.  In the ''Restaurant at the End of the Universe'', the 
second book in Douglas Adams' ''Hitchhiker'' series, Ford Prefect (the 
Betelgeusian companion of the hero, Arthur Dent) berates the 
Golgafrincham colonizers of prehistoric Earth for not having made 
much progress on the invention of the wheel. A marketing consultant 
fires back at Ford and asks him, if he is so smart, what color it should 
be.

REFERENCES

Cahn, J. 1989. ''Generating Expression in Synthesized Speech.'' 
Master's thesis. Massachusetts Institute of Technology.

Hunt, A. and Black, A. 1996. ''Unit selection in a concatenative speech 
synthesis system using a large speech database.'' Proceedings of 
ICASSP 96, vol 1, pp 373-376, Atlanta, Georgia.

Klatt, D. 1980. ''Software for a cascade/parallel formant synthesizer'', 
Journal of the Acoustical Society of America, 67.3, 971-995.

Nass, C., Steuer, J. S., & Tauber, E. (1994). ''Computers are social 
actors.'' Proceeding of the CHI Conference, 72-77. Boston, MA

Raman, TV. 2004. ''Emacsspeak -- The Complete Audio Desktop''.  
http://emacspeak.sourceforge.net/

Schwartz, L. 2003. ''Why people hate the paperclip: Labels, 
appearance, behavior and social responses to user interface agents.'' 
Master's Thesis, Stanford University.

Taylor, P., Black, A. and Caley, R. 1998 ''The architecture of the 
Festival Speech Synthesis System'' 3rd ESCA Workshop on Speech 
Synthesis, pp. 147-151, Jenolan Caves, Australia. 

ABOUT THE REVIEWER

Richard Sproat is professor in the departments of Linguistics and 
Electrical & Computer Engineering at the University of Illinois at 
Urbana-Champaign. His interests include multilingual text processing 
and speech technology. Prior to coming to the University of Illinois, 
Sproat worked in industrial research at AT&T Bell Laboratories, with 
his primary area of research being text-to-speech synthesis. Sproat 
was one of the main architects of the Bell Labs multilingual text-to-
speech synthesizer. He was also involved in the design of the SABLE 
text-to-speech markup language, a precursor to the W3C's SSML.

-----------------------------------------------------------
LINGUIST List: Vol-17-65