Conf: Workshop Advances in Speech Technologies, 24 juin 2011, IRCAM, Paris

Thierry Hamon thierry.hamon at UNIV-PARIS13.FR
Fri May 6 14:11:28 UTC 2011

Date: Thu, 5 May 2011 16:01:53 +0200
From: Nicolas OBIN <Nicolas.Obin at>
Message-Id: <4766B302-13DA-47A2-8B48-89CC964FD4CE at>

J'ai le plaisir de vous annoncer la tenue du séminaire "Advances in
Speech Technologies" qui se tiendra à l'IRCAM le vendredi 24 juin 2011
à partir de 9h30.

Le séminaire se déroulera en anglais.

Nicolas OBIN: PhD Student at IRCAM Analysis/Synthesis Team - 
contact | nobin at -
☎ : 33 (0) 1 44 78 48 90, Fax : 33 (0)1 44 78 15 40


Friday, June 24, 2011  
in  Stravinsky conference room, IRCAM, Paris. 

IRCAM, Music and ... Speech.

"Par son pouvoir expressif, par sa pérennité vis-à-vis de l’univers
instrumental, par son pouvoir d’amalgame avec un texte, par la
capacité qu’elle a de reproduire des sons inclassables par rapport aux
grammaires - la grammaire du langage comme la grammaire musicale - ,
la voix peut se soumettre à la hiérarchie, s’y intégrer ou s’en
dégager totalement. Moyen immédiat, qui n’est pas soumis
inéluctablement à la contrainte culturelle pour communiquer, pour
exprimer, la voix peut être, autant qu’un instrument cultivé, un outil
‘‘sauvage'', irréductible."

Pierre Boulez, Automatisme et décision, Jalons (pour une décennie) :
dix ans d'enseignement au Collège de France (1978- 1988), Paris,
Christian Bourgois, 1989.

This workshop will feature top figures in speech processing who will
present works-in-progress in speech technologies, from recognition and
synthesis to interactions on Friday, June 24, 2011.

Free entrance

                              * - * - *

9:30am - 10:00am
Axel Roebel and Xavier Rodet, IRCAM -  Analysis and Synthesis Team.

"Speech analysis, synthesis and transformation in the
Analysis/Synthesis team at IRCAM"

Since about 7 years the interest of composers and musical assistants
at IRCAM on speech synthesis and transformation techniques has
constantly grown. As a result speech processing has become one of the
central research objectives of the Analysis/Synthesis team at
IRCAM. In the present introduction some of the key results of the
research efforts will be presented, providing examples notably related
to the estimation of the spectral envelope, the estimation of the LF
glottal pulse model parameters, text to speech synthesis, shape
invariant signal transformation in the phase vocoder, speaker
transformation, voice conversion, transformation of emotional states.

                              * - * - *

10:00am - 11:00am
Jean-François Bonastre, Laboratoire d'Informatique d'Avignon -
Université d'Avignon.

"Speaker Recognition: a New Binary Representation"

Speaker recognition main approaches are based on statistical modeling
of the acoustic space. This modeling relies usually on a Gaussian
Mixture Model (GMM) denoted Universal Background Model (UBM), with a
large number of components and trained using a large set of speech
data gathered from hundreds of speakers. Each target model is derived
from the UBM thanks to a MAP adaptation of the gaussian mean
parameters only. An important evolution of the UBM/GMM paradigm was to
consider the UBM as a definition of a new data representation space
defined by the concatenation of the Gaussian mean parameters. This
space, denoted "supervector" space, allowed to use Support Vector
Machine (SVM) classifiers feed by the supervector. A second evolution
step was crossed by the direct modelling of the session variability in
the supervector space using the Joint Factor Analysis (JFA)
approach. More recently, the Total Variability Space was introduced,
as an evolution of JFA. It consists on a modelling of the total
variability in the supervector space in order to build a smaller space
which concentrates the information and where it is easier to model
jointly session and speaker variability. Looking at this evolution,
three remarks could be proposed. The evolution is always linked to
large models with thousands of parameters. All the new approaches are
quite unable to work at the frame per frame level and finally, these
approaches rely on the general statistical paradigm where one
information is evaluated as strong when it is present very often.

This speech proposes an analysis of the consequences of these remarks
and presents a new paradigm for speaker recognition, based on a
discrete binary representation, which is able to overpass the previous
approaches limitations.

                              * - * - *

11:00am - 12:00am
Nick Campbell, Centre for Language & Communications Studies - Trinity
College, Dublin.

"Talking with Robots"

This talk describes a robot interface for gathering conversational
data currently on exhibition in the Science Gallery of Trinity College
We use a small LEGO NXT Mindstorms device as a platform for a high
definition webcam and microphones, in conjunction with a finite-state
dialogue machine and recordings of several human utterances that are
played back through a sound-warping device to sound as if the robot is
speaking them.  Visual processing using OpenCV forms the core of the
device, interacting with the discourse model to engage passers-by in a
brief conversation so that we can record the exchange in order to
learn more about such discourse strategies for advanced human-computer

                              * - * - *

12:00am - 1:00pm
Simon King, Centre for Speech Technology Research - The University of

"Synthetic Speech: Beyond Mere Intelligibility"

Some text-to-speech synthesisers are now as intelligible as human
speech. This is a remarkable achievement, but the next big challenge
is to approach human-like naturalness, which will be even harder. I
will describe several lines of research which are attempting to imbue
speech synthesisers with the properties they need to sound more
"natural" - whatever that means.
The starting point is personalised speech synthesis, which allows the
synthesiser to sound like an individual person without requiring
substantial amounts of their recorded speech. I will then describe how
we can work from imperfect recordings or achieve personalised speech
synthesis across languages, with a few diversions to consider what it
means to sound like the same person in two different languages and how
vocal attractiveness plays a role.
Since the voice is not only our preferred means of communication but
also a central part of our identity, losing it can be
distressing. Current voice-output communication aids offer a very poor
selection of voices, but recent research means that soon it will be
possible to provide people who are losing the ability to speak,
perhaps due to conditions such as Motor Neurone Disease, with
personalised communication aids that sound just like they used to,
even if we do not have a recording of their original voice.

There will be plenty of examples, including synthetic child speech,
personalised synthesis across the language barrier, and the
reconstruction of voices from recordings of disordered speech.

This work was done with Junichi Yamagishi, Sandra Andraszewicz, Oliver
Watts, Mirjam Wester and many others.

Message diffuse par la liste Langage Naturel <LN at>
Informations, abonnement :
English version       : 
Archives                 :

La liste LN est parrainee par l'ATALA (Association pour le Traitement
Automatique des Langues)
Information et adhesion  :

More information about the Ln mailing list