13.1146, Sum: Vowel Normalization Procedures
linguist at linguistlist.org
Thu Apr 25 01:46:21 UTC 2002
LINGUIST List: Vol-13-1146. Wed Apr 24 2002. ISSN: 1068-4875.
Subject: 13.1146, Sum: Vowel Normalization Procedures
Moderators: Anthony Aristar, Wayne State U.<aristar at linguistlist.org>
Helen Dry, Eastern Michigan U. <hdry at linguistlist.org>
Reviews (reviews at linguistlist.org):
Simin Karimi, U. of Arizona
Terence Langendoen, U. of Arizona
Andrew Carnie, U. of Arizona <carnie at linguistlist.org>
Editors (linguist at linguistlist.org):
Karen Milligan, WSU Naomi Ogasawara, EMU
James Yuells, EMU Marie Klopfenstein, WSU
Michael Appleby, EMU Heather Taylor-Loring, EMU
Ljuba Veselinova, Stockholm U. Richard John Harvey, EMU
Dina Kapetangianni, EMU Renee Galvis, WSU
Karolina Owczarzak, EMU
Software: John Remmers, E. Michigan U. <remmers at emunix.emich.edu>
Gayathri Sriram, E. Michigan U. <gayatri at linguistlist.org>
Home Page: http://linguistlist.org/
The LINGUIST List is funded by Eastern Michigan University, Wayne
State University, and donations from subscribers and publishers.
Editor for this issue: Marie Klopfenstein <marie at linguistlist.org>
Date: Tue, 23 Apr 2002 10:51:53 +0100
From: Dom Watt <djlw1 at york.ac.uk>
Subject: Vowel Normalisation Procedures
-------------------------------- Message 1 -------------------------------
Date: Tue, 23 Apr 2002 10:51:53 +0100
From: Dom Watt <djlw1 at york.ac.uk>
Subject: Vowel Normalisation Procedures
Some weeks ago I posted a query about vowel normalisation procedures, to which
I had a good number of excellent responses. I've pasted these in below (NB:
some are responses to an identical query I put out on the 'phonet' mailing
list). Many thanks to all who contributed!
Here's my original query in full:
>There is a bewildering choice of vowel formant normalisation
>algorithms available which aim to eliminate, among other things, vocal
>tract length-related effects between speakers that result from age and
>gender differences. Some are based on warpings of F1/F2 space using
>frequencies of higher formants, relationships between formants and F0,
>or on logarithmic transforms of values in linear Hz. Others (Barks,
>mels, critical bandwidths, etc.) use psychoperceptual criteria
>deriving from the non-linear response characteristics of the auditory
>What, in the opinion of anyone who's had to choose one (or has
>developed one themselves), would be the criterion/criteria by which
>the successfulness, utility and reliability of a normalisation routine
>for formant frequency measurements could be estimated? Do Disner's
>(1980) conclusions about the value of a procedure lying in the degree
>to which it can trade off scatter reduction against 'linguistic
>realism' still hold?
>Disner, S.F. (1980) Evaluation of vowel normalization procedures.
>Journal of the Acoustical Society of America 67(1): 253-261.
And here are the responses, in no particular order - apologies if I've left
>>From Sylvia Moosmüller <sylvia.moosmueller at oeaw.ac.at>:
There has been a paper on vowel normalization strategies at Eurospeech 2001,
Aalborg, held by Patti Adank, Roeland van Hout, and Roel Smits:
"A comparison between human vowel normalization strategies and acoustic
vowel transformation techniques." In: Proceedings of the 7th International
Conference on Speech Communication and Technology, Eurospeech 2001,
Aalbourg, Vol. 1, 481-484.
Perceptual and acoustic representations of vowel data were
compared directly to evaluate the perceptual relevance of
several speaker normalization transformations. The acoustic
representations consisted of raw F0 and formant data. The
perceptual representations were obtained through an
experimental procedure, with phonetically trained listeners
as subjects. The raw acoustic data were transformed
according to several normalization schemes. The perceptual
and the acoustic representations were compared using
regression techniques. A zscore-transformation of the raw
data appeared to resemble the perceptual data.
Hope this will be of help for you,
>>From Patti Adank <P.Adank at let.kun.nl>:
My guess would be that your criterium should depend on what you want to do
with the transformation's results. Some people are interested in vowel
classification (automatically or by human listeners), while others are more
interested in describing within vowel category variation (allophonic
variation). So for the first task you need a procedure that maximizes
between vowel category variance, while in the latter case the within vowel
vategory variance should be maintained primarily. I think it might be the
case that you need to use different procedures for both tasks. I work in
sociophonetics and am therefore more interested in maintaining the
variation within vowel categories. However, most studies that evaluate
vowel normalizaiton procedures focus only on the classification performance
of the procedures (e.g. Nearey 1978, Syrdal 1984, Deterding 1990); there
are only two studies that evaluate how well within vowel category is
maintied (Hindle, 1978 and Disner, 1980), but these do not provide
I am writing my PhD thesis on vowel normalization and I am comparing
several 'formant-based' (i.e. some of the ones mentioned by Disner 1980,
and by Terry Nearey in his 1989 JASA article) procedures, like Lobanov's
z-transformation, gerstman's end-transformation, Syrdal & Gopal's
bark-difference model. I am evaluating how well the 13 procedures I
selected perform on both vowel classification and maintaining within
category variation. I have not finished all of my research but I can give
some preliminary indications if you like.
Overall, I would say that Lobanov's z-score tranformation works best for
both classification and maintaining variation, followed by Nearey's logmean
(CLIH4) transformation. It might be the case that Nearey's is better at
maintaining variance, but I will have to find that out in the next few
months. Again, these are preliminary results. I still have to deal with the
fact that, usually, not all vowel categories for a certain speaker will be
available, while Lobanov's procedure requires values for all these values.
Nearey's might be the best option, since it needs a minimum of two
categories per speaker. So, we're not there yet.
Regarding Disner's study: her conclusions seem to be still valid; it is
still not advisable to compare markedly different phonological systems
directly to each other, especially with procedures that put a lot of
emphasis on the mean and standard deviations (like Nearey's and Lobanov's).
It might be a better idea to only compare communities if you have enough
values to calculate the means, differences between communities in vowel
targets might be interpreteble even without normalization.
I have presented a paper at the last Eurospeech conference on this issue.
Would you be interested in reading this paper?
I hope this answers your question,
Dept. of Linguistics
University of Nijmegen
Deterding, D. "Speaker normalizaiton for automatic speech recognition." PhD
thesis, University of Cambridge, 1990
Disner, S.F. "Evaluation of Vowel Normalization Procedures", J. Acoust.
Soc. Amer., Vol. 67, 1980, p 253-261.
Hindle, D. "Approaches to Vowel Normalization in the Study of Natural
Speech", In: Linguistic Variation: Models and Methods. Ed. D. Sankoff, New
York. Academic Press. 1978.
Lobanov, B.M. "Classification of Russian Vowels Spoken by Different
Speakers", J. Acoust. Soc. Amer., Vol. 67, 1980, p 253-261.
Nearey, T.M "Applications of Generalized Linear Modeling to Vowel Data",
Proceedings ICSLP 92, p 583-586.
Nearey, T.M. "Phonetic feature systems of vowels." PhD thesis, Indiana
Unuversity Linguistics Club, 1978
Syrdal, A. "Aspects of a model of the auditory representation of American
English vowels." Speech Communication 4, 121-135 1984.
Syrdal, A. and Gopal, H. A. "Perceptual Model of Vowel Recognition based on
the Auditory Representation of American English Vowels", J. Acoust. Soc.
Amer., Vol. 79, 1986, p 1086-1100.
>>From Rob Hagiwara <robh at cc.umanitoba.ca>:
Re your normalization question, I've been working on expanding the
autonormalization procedure I develop in my dissertation, where you take a
full suite of vowels and calculate average F1, F2, F3 etc. frequencies (and
cross-correlate them back to a particular resonating length, i.e. they
should be 1x3x5 multiples to each other or closs), and then express the
deviation (in a token or a class of tokens) from these averages in either
%Hz or Bark-distance. The idea has caught on in a few circles, but it's
difficult to operationalize in anything but a formal experimental context.
>>From David Deterding <dhdeter at nie.edu.sg>:
>My PhD thesis was:
>David Deterding, Speaker Normalisation for Automatic Speech Recognition,
>Unpublished PhD Thesis, Cambridge University, 1990.
>and it concentrated largely on normalisation procedures for vowels. Although
>I tried out all kinds of methods, including formant mappings and
>whole-spectrum shifts, I really can't offer an answer. Perhaps the only
>contribution I can make is some help in understanding the problem -- but
>maybe you already understand the problem well enough!
>If you think it might be useful, you could obtain a copy from Cambridge
>University Library. But maybe someone will provide you with a clear-cut
>answer, so you won't need to access my work.
>>From H.M. Hubey <hubeyh at mail.montclair.edu>:
>There are very good reasons why topics dealing with the complexity
>of fluid dynamics should use the one very successful method.
>That method is "dimensional analysis" and is used in fluid dynamics
>extensively. It is the one thing that allowed Prandtl to connect
>the theoretical results of fluid dynamics (the Navier-Stokes equations)
>with experimental results. And since then many other dimensionless
>groups have been developed and experimentally fitted to data.
>I am the only one to have used this method to derive results in
>this field. It can be found in my book, Mathematical and Computational
>Linguistics (Lincom Europa) and in my paper in the Journal of
>Quantitative Linguistics, "Vector Phase Space for Speech Analysis
>via Dimensional Analysis" VOlume 6, NUmber 2, August 1999.
>>From Dylan Herrick <herrick at ling.ucsc.edu>:
>I'm afraid that I don't have any answers for your question about vowel
>normalization procedures. However, I wanted to let you know that I am
>deeply interested in the responses you might get. You see, I am working on
>a phonetic study of the Catalan vowel system at the moment, and I am
>wondering how to best combine data from various speakers.
>The one (relatively) recent paper that I have seen which takes a fairly
>opinionated view of vowel normalization was:
>Yang, Byunggon. 1996. A comparative study of American English and Korean
>vowels produced by male and female speakers. Journal of Phonetics
>Judging from your message, you have already seen this paper. As I recall,
>the author mentions that he argued for the value of a psychoperceptual
>approach to vowel normalization in an earlier paper
>(which I have been unable to locate). If nothing else, his paper offers a
>model for how vowel normalization could be done - normalizing for vocal
>tract length & using mel scale (or bark... I forget) instead of Hz. I have
>no idea how this relates to Disner's paper (which I have not read).
>>From Bill Labov <labov at earthlink.net> ('Plotnik' is Labov's dedicated vowel
formant frequency plotting program, available for download at
http://www.ling.upenn.edu/~labov/Plotnik.html I had asked him earlier
whether the normalization algorithm he had programmed into Plotnik was
Nearey's (1977) routine, since this is the method favoured in his (Labov's)
recent 'Principles of Linguistic Change, vol II: Social Factors', 2001,
Blackwell - see pp157-164):
> Yes, the Nearey log mean normalization is available in Plotnik.
>The documentation gives general information about it as well as the
>instructions. In the second volume of Principles of Linguistic Change,
>Chapter 5, there is an account of the empirical justification for the
>use of that algorithm. It's worked out very well in the Atlas of North
>America, where I've superimposed 440 speakers in a single view. At the
>same time, it's not an answer to the question of how speakers actually
>do normalize, which operates efficiently with one or two utterances, and
>doesn't need hundreds.
>>From Bill Idsardi <idsardi at UDel.Edu>:
The best discussion I know of for this is Rosner and Pickering, _Vowel
Perception and Production_
Oxford UP 1994, chapter 5.
>>From Mark Huckvale <M.Huckvale at ucl.ac.uk>:
>I think Rosner & Pickering's book "Vowel production & perception" has
>more recent data on the evaluation of normalisation metrics. You
>may have looked there already.
Department of Language & Linguistic Science
University of York
York YO10 5DD
Tel 01904 432665
Fax 01904 432673
LINGUIST List: Vol-13-1146
More information about the Linguist