Appel: INTERSPEECH 2014, Special Sessions

Thierry Hamon hamon at LIMSI.FR
Tue Feb 11 20:39:46 UTC 2014

Date: Mon, 10 Feb 2014 12:20:47 +0800
From: "Organization @ Interspeech 2014" <organization at>
Message-ID: <52F8539F.8050905 at>

--- September 14-18, 2014

INTERSPEECH is the world's largest and most comprehensive conference on
issues surrounding the science and technology of spoken language
processing, both in humans and in machines.
The theme of INTERSPEECH 2014 is

--- Celebrating the Diversity of Spoken Languages ---

INTERSPEECH 2014 includes a number of special sessions covering
interdisciplinary topics and/or important new emerging areas of interest
related to the main conference topics.
Special sessions proposed for the forthcoming edition are:

- A Re-evaluation of Robustness
- Deep Neural Networks for Speech Generation and Synthesis
- Exploring the Rich Information of Speech Across Multiple Languages
- INTERSPEECH 2014 Computational Paralinguistics ChallengE (ComParE)
- Multichannel Processing for Distant Speech Recognition
- Open Domain Situated Conversational Interaction
- Phase Importance in Speech Processing Applications
- Speaker Comparison for Forensic and Investigative Applications
- Text-dependent for Short-duration Speaker Verification
- Tutorial Dialogues and Spoken Dialogue Systems
- Visual Speech Decoding

A description of each special session is given below.
For paper submission, please follow the main conference procedure and
chose the Special Session track when selecting your paper area.

Paper submission procedure is described at:

For more information, feel free to contact the Special Session Chair,
Dr. Tomi H. Kinnunen, at email tkinnu [at]

Special Session Description

A Re-evaluation of Robustness

The goal of the session is to facilitate a re-evaluation of robust
speech recognition in the light of recent developments. It’s a
re-evaluation at two levels:

- a re-evaluation in perspective brought by breakthroughs in performance
  obtained by Deep Neural Network which leads to a fresh questioning of
  the role and contribution of robust feature extraction.

- A literal re-evaluation on common databases to be able to present and
  compare performances of different algorithms and system approaches to

Paper submissions are invited on the theme of noise robust speech
recognition and required to submit results on the Aurora 4 database to
facilitate cross comparison of the performance between different

Recent developments raise interesting research questions that the
session aims to help
Progress by bringing focus and exploration of these issues. For example

1. What role is there for signal processing to create feature
   representations to use as inputs to Deep Learning or can deep
   learning do all the work?
2. What feature representations can be automatically learnt in a deep
   learning architecture?
3. What other techniques can give great improvement in robustness?
4. What techniques don’t work and why?

The session organizers wish to encourage submissions that bring insight
and understanding to the issues highlighted above. Authors are requested
not only to present absolute performance of the whole system but also to
highlight the contribution made by various components in a complex

Papers that are accepted for the session are encouraged to also evaluate
their techniques on new test data sets (available in July) and submit
their results at the end of August.

Session organization
The session will be structured as a combination of
1. Invited talks
2. Oral paper presentations
3. Poster presentations
4. Summary of contributions and results on newly released test sets
5. Discussion

David Pearce, Audience dpearce [at]
Hans-Guenter Hirsch, Niederrhein University of Applied Sciences, 
hans-guenter.hirsch [at]
Reinhold Haeb-Umbach, University of Paderborn, haeb [at]
Michael Seltzer, Microsoft, mseltzer [at]
Keikichi Hirose, The University of Tokyo, hirose [at]
Steve Renals, University of Edinburgh, s.renals [at]
Sim Khe Chai, National University of Singapore, simkc [at]
Niko Moritz, Fraunhofer IDMT, Oldenburg, niko.moritz [at]
K K Chin, Google, kkchin [at]

Deep Neural Networks for Speech Generation and Synthesis

This special session aims to bring together researchers who work
actively on deep neural networks for speech research, particularly, in
generation and synthesis, to promote and to understand better the
state-of-art DNN research in statistical learning and compare results
with the parametric HMM-GMM model which has been well-established for
speech synthesis, generation, and conversion. DNN, with its neuron-like
structure, can simulate human speech production system in a layered,
hierarchical, nonlinear and self-organized network.  It can transform
linguistic text information into intermediate semantic, phonetic and
prosodic content and finally generate speech waveforms. Many possible
neural network architectures or typologies exist, e.g. feed-forward NN
with multiple hidden layers, stacked RBM or CRBM, Recurrent Neural Net
(RNN), which have been used to speech/image recognition and other
applications.  We would like to use this special session as a forum to
present updated results in the research frontiers, algorithm development
and application scenarios. Particular focused areas will be on
parametric TTS synthesis, voice conversion, speech compression,
de-noising and speech enhancement.

Yao Qian, Microsoft Research Asia, yaoqian [at]
Frank K. Soong, Microsoft Research Asia, frankkps [at]

Exploring the Rich Information of Speech Across Multiple Languages

Spoken language is the most direct means of communication between human
beings. However, speech communication often demonstrates its
language-specific characteristics because of, for instance, the
linguistic difference (e.g., tonal vs. non-tonal, monosyllabic
vs. multisyllabic) across languages. Our knowledge on the diversities of
speech science across languages is still limited, including speech
perception, linguistic and non-linguistic (e.g., emotion) information,
etc.  This knowledge is of great significance to facilitate our design
of language-specific application of speech techniques (e.g., automatic
speech recognition, assistive hearing devices) in the future.  This
special session will provide an opportunity for researchers from various
communities (including speech science, medicine, linguistics and signal
processing) to stimulate further discussion and new research in the
broad cross-language area, and present their latest research on
understanding the language-specific features of speech science and their
applications in the speech communication of machines and human
beings. This special session encourages contributions all fields on
speech science, e.g., production and perception, but with a focus on
presenting the language-specific characteristics and discussing their
implications to improve our knowledge on the diversities of speech
science across multiple languages. Topics of interest include, but are
not limited to: 1. characteristics of acoustic, linguistic and language
information in speech communication across multiple languages;
2. diversity of linguistic and non-linguistic (e.g., emotion)
information among multiple spoken languages; 3. language-specific speech
intelligibility enhancement and automatic speech recognition techniques;
and 4. comparative cross-language assessment of speech perception in
challenging environments.

Junfeng Li, Institute of Acoustics, Chinese Academy of Sciences, [at]
Fei Chen, The University of Hong Kong, feichen1 [at]

INTERSPEECH 2014 Computational Paralinguistics ChallengE (ComParE)

The INTERSPEECH 2014 Computational Paralinguistics ChallengE (ComParE)
is an open Challenge dealing with speaker characteristics as manifested
in their speech signal's acoustic properties.  This year, it introduces
new tasks by the Cognitive Load Sub-Challenge, the Physical Load
Sub-Challenge, and a Multitask Sub-Challenge: For these Challenge tasks,
audio corpus (ADES) with high diversity of speakers and different
languages covered (Australian English and German) are provided by the
organizers.  All corpora provide fully realistic data in challenging
acoustic conditions and feature rich annotation such as speaker
meta-data. They are given with distinct definitions of test,
development, and training partitions, incorporating speaker independence
as needed in most real-life settings. Benchmark results of the most
popular approaches are provided as in the years before.  Transcription
of the train and development sets will be known. All Sub-Challenges
allow contributors to find their own features with their own machine
learning algorithm.  However, a standard feature set will be provided
per corpus that may be used. Participants will have to stick to the
definition of training, development, and test sets. They may report on
results obtained on the development set, but have only five trials to
upload their results on the test sets, whose labels are unknown to them.
Each participation will be accompanied by a paper presenting the results
that undergoes peer-review and has to be accepted for the conference in
order to participate in the Challenge.
The results of the Challenge will be presented in a Special Session at
INTERSPEECH 2014 in Singapore.  Further, contributions using the
Challenge data or related to the Challenge but not competing within the
Challenge are also welcome.

More information is given also on the Challenge homepage:

Björn Schuller, Imperial College London / Technische Universität 
München,schuller [at]
Stefan Steidl, Friedrich-Alexander-University, stefan.steidl [at]
Anton Batliner, Technische Universität München / 
batliner [at]
Jarek Krajweski, Bergische Universität Wuppertal, krajewsk 
Julien Epps, The University of New South Wales / National ICT Australia, 
j.epps [at]

Multichannel Processing for Distant Speech Recognition

Distant speech recognition in real-world environments is still a
challenging problem: reverberation and dynamic background noise
represent major sources of acoustic mismatch that heavily decrease ASR
performance, which, on the contrary, can be very good in close-talking
microphone setups.  In this context, a particularly interesting topic is
the adoption of distributed microphones for the development of
voice-enabled automated home environments based on distant-speech
interaction: microphones are installed in different rooms and the
resulting multichannel audio recordings capture multiple audio events,
including voice commands or spontaneous speech, generated in various
locations and characterized by a variable amount of reverberation as
well as possible background noise.

The focus of the proposed special session will be on multichannel
processing for automatic speech recognition (ASR) in such a
setting. Unlike other robust ASR tasks, where static adaptation or
training with noisy data sensibly ameliorates performance, the
distributed microphone scenario requires full exploitation of
multichannel information to reduce the highly variable dynamic
mismatch. To facilitate better evaluation of the proposed algorithms the
organizers will provide a set of multichannel recordings in a domestic
environment.  The recordings will include spoken commands mixed with
other acoustic events occurring in different rooms of a real apartment.
The data is being created in the frame of the EC project DIRHA (Distant
speech Interaction for Robust Home Applications) which addresses the
challenges of speech interaction for home automation.

The organizers will release the evaluation package (datasets and
scripts) on February 17; the participants are asked to submit a regular
paper reporting speech recognition results on the evaluation set and
comparing their performance with the provided reference baseline.

Further details are available at:

Marco Matassoni, Fondazione Bruno Kessler, matasso [at]
Ramon Fernandez Astudillo, Instituto de Engenharia de Sistemas e 
Computadores, ramon.astudillo [at]
Athanasios Katsamanis, National Technical University of Athens, nkatsam

Open Domain Situated Conversational Interaction

Robust conversational systems have the potential to revolutionize our
interactions with computers.  Building on decades of academic and
industrial research, we now talk to our computers, phones, and
entertainment systems on a daily basis. However, current technology
typically limits conversational interactions to a few narrow
domains/topics (e.g., weather, traffic, restaurants). Users increasingly
want the ability to converse with their devices over broad web-scale
content.  Finding something on your PC or the web should be as simple as
having a conversation.  A promising approach to address this problem is
situated conversational interaction. The approach leverages the
situation and/or context of the conversation to improve system accuracy
and effectiveness.  Sources of context include visual content being
displayed to the user, Geo-location, prior interactions, multi-modal
interactions (e.g., gesture, eye gaze), and the conversation itself. For
example, while a user is reading a news article on their tablet PC, they
initiate a conversation to dig deeper on a particular topic.  Or a user
is reading a map and wants to learn more about the history of events at
mile marker 121.  Or a gamer wants to interact with a game’s characters
to find the next clue in their quest.
All of these interactions are situated – rich context is available to
the system as a source of priors/constraints on what the user is likely
to say.
This special session will provide a forum to discuss research progress
in open domain situated conversational interactions.
Topics of the session will include:
- Situated context in spoken dialog systems
- Visual/dialog/personal/geo situated context
- Inferred context through interpretation and reasoning
- Open domain spoken dialog systems
- Open domain spoken/natural language understanding and generation
- Open domain semantic interpretation
- Open domain dialog management (large-scale belief state/policy)
- Conversational Interactions
- Multi-modal inputs in situated open domains (speech/text + gesture,
  touch, eye gaze)
- Multi-human situated interactions

Larry Heck, Microsoft Research, larry [at]
Dilek Hakkani-Tür, Microsoft Research, dilek [at]
Gokhan Tur, Microsoft Research, gokhan [at]
Steve Young, Cambridge University, sjy [at]

Phase Importance in Speech Processing Applications

In the past decades, the amplitude of speech spectrum is considered to
be the most important feature in different speech processing
applications and phase of the speech signal has received less
attention. Recently, several findings justify the phase importance in
speech and audio processing communities.  The importance of phase
estimation along with amplitude estimation in speech enhancement,
complementary phase-based features in speech and speaker recognition and
phase-aware acoustic modeling of environment are the most prominent
reported works scattered in different communities of speech and audio
processing. These examples suggest that incorporating the phase
information can push the limits of state-of-the-art phase-independent
solutions employed for long in different aspects of audio and speech
signal processing. This Special Session aims to explore the recent
advances and methodologies to exploit the knowledge of signal phase
information in different aspects of speech processing. Without a
dedicated effort to bring researchers from different communities, a
quick advance in investigation towards the phase usefulness in speech
processing applications is difficult to achieve. Therefore, as the first
step in this direction, we aim to promote the "phase-aware speech and
audio signal processing" to form a community of researchers to organize
the next steps.  Our initiative is to unify these efforts to better
understand the pros and cons of using phase and the degree of
feasibility for phase estimation/enhancement in different areas of
speech processing including: speech enhancement, speech separation,
speech quality estimation, speech and speaker recognition, voice
transformation and speech analysis and synthesis. The goal is to promote
the importance of the phase-based signal processing and studying its
importance and sharing interesting findings from different speech
processing applications.

Pejman Mowlaee, Graz University of Technology, pejman.mowlaee [at]
Rahim Saeidi, University of Eastern Finland, rahim.saeidi [at]
Yannis Styilianou, Toshiba Labs Cambridge UK / University of Crete, 
yannis [at]

Speaker Comparison for Forensic and Investigative Applications

In speaker comparison, speech/voice samples are compared by humans
and/or machines for use in investigation or in court to address
questions that are of interest to the legal system.  Speaker comparison
is a high-stakes application that can change people’s lives and it
demands the best that science has to offer; however, methods, processes,
and practices vary widely.  These variations are not necessarily for the
better and though recognized, are not generally appreciated and acted
upon. Methods, processes, and practices grounded in science are critical
for the proper application (and non-application) of speaker comparison
to a variety of international investigative and forensic applications.
This special session will contribute to scientific progress through 1)
understanding speaker comparison for investigative and forensic
application (e.g., describe what is currently being done and critically
analyze performance and lessons learned); 2) improving speaker
comparison for investigative and forensic applications (e.g., propose
new approaches/techniques, understand the limitations, and identify
challenges and opportunities); 3) improving communications between
communities of researchers, legal scholars, and practitioners
internationally (e.g., directly address some central legal, policy, and
societal questions such as allowing speaker comparisons in court,
requirements for expert witnesses, and requirements for specific
automatic or human-based methods to be considered scientific); 4) using
best practices (e.g., reduction of bias and presentation of evidence);
5) developing a roadmap for progress in this session and future
sessions; and 6) producing a documented contribution to the field. Some
of these objectives will need multiple sessions to fully achieve and
some are complicated due to differing legal systems and cultures.  This
special session builds on previous successful special sessions and
tutorials in forensic applications of speaker comparison at INTERSPEECH
beginning in 2003. Wide international participation is planned,
including researchers from the ISCA SIGs for the Association Francophone
de la Communication Parlée (AFCP) and the Speaker and Language
Characterization (SpLC).

Joseph P. Campbell, PhD, MIT Lincoln Laboratory, jpc [at]
Jean-François Bonastre, l'Université d'Avignon, jean-francois.bonastre 

Text-dependent for Short-duration Speaker Verification

In recent years, speaker verification engines have reached maturity and
have been deployed in commercial applications. Ergonomics of such
applications is especially demanding and imposes a drastic limitation in
terms of speech duration during authentication.  A well known tactic to
address the problem of lack of data, due to short duration, is using
text-dependency. However, recent breakthroughs achieved in the context
of text-independent speaker verification in terms of accuracy and
robustness do not benefit text-dependent applications. Indeed, large
development data required by the recent approaches is not available in
the text-dependent context. The purpose of this special session is to
gather the research efforts from both academia and industry toward a
common goal of establishing a new baseline and explore new directions
for text-dependent speaker verification.  The focus of the session is on
robustness with respect to duration and modeling of lexical information.
To support the development and evaluation of text-dependent speaker
verification technologies, the Institute for Infocomm Research (I2R) has
recently released the RSR2015 database, including 150 hours of data
recorded from 300 speakers. The papers submitted to the special session
are encouraged, but not limited, to provide results based on the RSR2015
database in order to enable comparison of algorithms and methods. For
this purpose, the organizers strongly encourage the participants to
report performance on the protocol delivered with the database in terms
of EER and minimum cost (in the sense of NIST 2008 Speaker Recognition
To get the database, please contact the organizers.

Further details are available at:

Anthony LARCHER (alarcher [at] Institute for Infocomm 
Hagai ARONOWITZ (hagaia [at] IBM Research – Haifa
Kong Aik LEE (kalee [at] Institute for Infocomm Research
Patrick KENNY (patrick.kenny [at] CRIM – Montréal

Tutorial Dialogues and Spoken Dialogue Systems

The growing interest in educational applications that use spoken
interaction and dialogue technology has boosted research and development
of interactive tutorial systems, and over the recent years, advances
have been achieved in both spoken dialogue community and education
research community, with sophisticated speech and multi-modal technology
which allows functionally suitable and reasonably robust applications to
be built.

The special session combines spoken dialogue research, interaction
modeling, and educational applications, and brings together the two
INTERSPEECH SIG communities: SLaTE and SIGdial. The session focuses on
methods, problems and challenges that are shared by both communities,
such as sophistication of speech processing and dialogue management for
educational interaction, integration of the models with theories of
emotion, rapport, and mutual understanding, as well as application of
the techniques to novel learning environments, robot interaction,
etc. The session aims to survey issues related to the processing of
spoken language in various learning situations, modeling of the
teacher-student interaction in MOOC-like environments, as well as
evaluating tutorial dialogue systems from the point of view of natural
interaction, technological robustness, and learning outcome.  The
session encourages interdisciplinary research and submissions related to
the special focus of the conference, "Celebrating the Diversity of
Spoken Languages".

For further information click

Maxine Eskenazi, max+ [at]
Kristiina Jokinen, kristiina.jokinen [at]
Diane Litman, litman [at]
Martin Russel, M.J.RUSSELL [at]

--- Visual Speech Decoding

Speech perception is a bi-modal process that takes into account both the
acoustic (what we hear) and visual (what we see) speech information. It
has been widely acknowledged that visual clues play a critical role in
automatic speech recognition (ASR) especially when audio is corrupted
by, for example, background noise or voices from untargeted speakers, or
even inaccessible.  Decoding the visual speech is utterly important for
ASR technologies to be widely implemented to realize truly natural
human-computer interactions. Despite the advances in acoustic ASR,
visual speech decoding remains a challenging problem.  The special
session aims to attract more effort to tackle this important problem. In
particular, we would like to encourage researchers to focus on some
critical questions in the area.
We propose four questions as the initiative as follows:
1. How to deal with the speaker dependency in visual speech data?
2. How to cope with the head-pose variation?
3. How to encode temporal information in visual features?
4. How to automatically adapt the fusion rule when the quality of the
   two individual (audio and visual) modalities varies?

Researchers and participants are encouraged to raise more questions
related to visual speech decoding.
We expect the session to draw a wide range of attention from both the
speech recognition and machine vision communities to the problem of
visual speech decoding.

Ziheng Zhou, University of Oulu, ziheng.zhou [at]
Matti Pietikäinen, University of Oulu, matti.pietikainen [at]
Guoying Zhao, University of Oulu, gyzhao [at]

Message diffuse par la liste Langage Naturel <LN at>
Informations, abonnement :
English version       : 
Archives                 :

La liste LN est parrainee par l'ATALA (Association pour le Traitement
Automatique des Langues)
Information et adhesion  :

ATALA décline toute responsabilité concernant le contenu des
messages diffusés sur la liste LN

More information about the Ln mailing list