Conf: INTERSPEECH 2014 Tutorials, Singapore, September 14-18, 2014

Thierry Hamon hamon at LIMSI.FR
Thu Jun 5 10:03:15 UTC 2014

Date: Mon, 02 Jun 2014 17:55:53 +0800
From: "Organization @ Interspeech 2014" <organization at>
Message-ID: <538C4A29.3050301 at>

--- September 14-18, 2014

The INTERSPEECH 2014 Organising Committee is pleased to announce the
following 8 tutorials presented by distinguished speakers at the
conference and will be offered on Sunday, 14 September 2014.  All
Tutorials will be of three (3) hours duration and require an additional
registration fee (separate from the conference registration fee).

  - Non-speech acoustic event detection and classification
  - Contribution of MRI to Exploring and Modeling Speech Production
  - Computational Models for Audiovisual Emotion Perception
  - The Art and Science of Speech Feature Engineering
  - Recent Advances in Speaker Diarization
  - Multimodal Speech Recognition with the AusTalk 3D Audio-Visual
  - Semantic Web and Linked Big Data Resources for Spoken Language
  - Speech and Audio for Multimedia Semantics


Additionally, the ISCSLP 2014 Organising Committee welcomes the
INTERSPEECH 2014 delegates to join the 4 ISCSLP tutorials which will be
offered on Saturday, 13 September 2014.

  - Adaptation Techniques for Statistical Speech Recognition
  - Emotion and Mental State Recognition: Features, Models, System
    Applications and Beyond
  - Unsupervised Speech and Language Processing via Topic Models
  - Deep Learning for Speech Generation and Synthesis

More information available at:

Tutorials Description

T1: Non-speech acoustic event detection and classification

     The research in audio signal processing has been dominated by
     speech research, but most of the sounds in our real-life
     environments are actually non-speech events such as cars passing
     by, wind, warning beeps, and animal sounds. These acoustic events
     contain much information about the environment and physical events
     that take place in it, enabling novel application areas such as
     safety, health monitoring and investigation of biodiversity.  But
     while recent years have seen wide-spread adoption of applications
     such as speech recognition and song recognition, generic computer
     audition is still in its infancy.

     Non-speech acoustic events have several fundamental differences to
     speech, but many of the core algorithms used by speech researchers
     can be leveraged for generic audio analysis. The tutorial is a
     comprehensive review of the field of acoustic event detection as it
     currently stands. The goal of the tutorial is foster interest in
     the community, highlight the challenges and opportunities and
     provide a starting point for new researchers. We will discuss what
     acoustic event detection entails, the commonalities differences
     with speech processing, such as the large variation in sounds and
     the possible overlap with other sounds. We will then discuss basic
     experimental and algorithm design, including descriptions of
     available databases and machine learning methods. We will then
     discuss more advanced topics such as methods to deal with
     temporally overlapping sounds and modelling the relations between
     sounds. We will finish with a discussion of avenues for future

     Organizers: Tuomas Virtanen and Jort F. Gemmeke

T2: Contribution of MRI to Exploring and Modeling Speech Production

     Magnetic resonance imaging (MRI) provides us a magic vision to look
     into the human body in various ways not only with static imaging
     but also with motion imaging. MRI has been a powerful technique for
     speech research to study finer anatomy of the speech organs or to
     visualize true vocal tracts in three dimensions. Inherent problems
     of slow image acquisition for speech tasks or insufficient signal-
     to-noise ratio for microscopic observation have been the cost for
     researchers to search for task-specific imaging techniques.  The
     recent advances of the 3-Tesla technology suggest more practical
     solutions to broader applications of MRI by overcoming previous
     technical limitations. In this joint tutorial in two parts, we
     summarize our previous effort to accumulate scientific knowledge
     with MRI and to advance speech modeling studies for future
     development. Part I, given by Kiyoshi Honda, introduces how to
     visualize the speech organs and vocal tracts by presenting
     techniques and data for finer static imaging, synchronized motion
     imaging, surface marker tracking, real-time imaging, and
     vocal-tract mechanical modeling. Part 2, presented by Jianwu Dang,
     focuses on applications of MRI for phonetics of Mandarin vowels,
     acoustics of the vocal tracts with side branches, analysis and
     simulation in search of talker characteristics, physiological
     modeling of the articulatory system, and motor control paradigm for
     speech articulation.

     Organizers: Kiyoshi HONDA and Jianwu DANG

T3: Computational Models for Audiovisual Emotion Perception

     In this tutorial we will explore engineering approaches to
     understanding human emotion perception, focusing both on modeling
     and application. We will highlight both current and historical
     trends in emotion perception modeling, focusing on both
     psychological and engineering-driven theories of perception
     (statistical analyses, data-driven computational modeling, and
     implicit sensing). The importance of this topic can be appreciated
     from both an engineering viewpoint, any system that either models
     human behavior or interacts with human partners must understand
     emotion perception as it fundamentally underlies and modulates our
     communication, or from a psychological perspective, emotion
     perception is also used in the diagnosis of many mental health
     conditions and is tracked in therapeutic interventions. Research in
     emotion perception seeks to identify models that describe the felt
     sense of ‘typical’ emotion expression – i.e., an
     observer/evaluator’s attribution of the emotional state of the
     speaker. This felt sense is a function of the methods through which
     individuals integrate the presented multimodal emotional
     information.  We will cover psychological theories of emotion,
     engineering models of emotion, and experimental approaches to
     measure emotion. We will demonstrate how these modeling strategies
     can be used as a component of emotion classification frameworks and
     how they can be used to inform the design of emotional behaviors.

     Organizers: Emily Mower Provost and Carlos Busso

T4: The Art and Science of Speech Feature Engineering

     With significant advances in mobile technology and audio sensing
     devices, there is a fundamental need to describe vast amounts of
     audio data in terms of well representative lower dimensional
     descriptors for efficient automatic processing. The extraction of
     these signal representations, also called features, constitutes the
     first step in processing a speech signal. The art and science of
     feature engineering relates to addressing the two inherent
     challenges - extracting sufficient information from the speech
     signal for the task at hand and suppressing the unwanted
     redundancies for computational efficiency and robustness. The area
     of speech feature extraction combines a wide variety of disciplines
     like signal processing, machine learning, psychophysics,
     information theory, linguistics and physiology.  It has a rich
     history spanning more than five decades and has seen tremendous
     advances in the last few years. This has propelled the transition
     of the speech technology from controlled environments to millions
     of end user applications.

     In this tutorial, we review the evolution of speech feature
     processing methods, summarize the recent advances of the last two
     decades and provide insights into the future of feature
     engineering. This will include the discussions on the spectral
     representation methods developed in the past, human auditory
     motivated techniques for robust speech processing, data driven
     unsupervised features like ivectors and recent advances in deep
     neural network based techniques. With experimental results, we will
     also illustrate the impact of these features for various
     state-of-the-art speech processing systems. The future of speech
     signal processing will need to address various robustness issues in
     complex acoustic environments while being able to derive useful
     information from big data.

     Organizers: Sriram Ganapathy and Samuel Thomas

T5: Recent Advances in Speaker Diarization

     The tutorial will start with an introduction to speaker diarization
     giving a general overview of the subject. Afterwards, we will cover
     the basic background including feature extraction, and common
     modeling techniques such as GMMs and HMMs. Then, we will discuss
     the first processing step usually done in speaker diarization which
     is voice activity detection. We will consequently describe the
     classic approaches for speaker diarization which are widely used
     today. We will then introduce state-of-the-art techniques in
     speaker recognition required to understand modern speaker
     diarization techniques.  Following, we will describe approaches for
     speaker diarization using advanced representation methods
     (supervectors, speaker factors, i-vectors) and we will describe
     supervised and unsupervised learning techniques used for speaker
     diarization. We will also discuss issues such as coping with
     unknown number of speakers, detecting and dealing with overlapping
     speech, diarization confidence estimation, and online speaker
     diarization.  Finally we will discuss two recent works: exploiting
     a-prioiri acoustic information (such as processing a meeting when
     some of the participants are known in advanced to the system, and
     training data is available for them), The second recent work is
     modeling speaker-turn dynamics. If time permits, we will also
     discuss concepts such as multi-modal diarization and using TDOA
     (time difference of arrival) for diarization of meetings.

     Organizers: Hagai Aronowitz

T6: Multimodal Speech Recognition with the AusTalk 3D Audio-Visual

     This tutorial will provide attendees a brief overview of 3D based
     AVSR research. In this tutorial, attendees will learn how to use
     the newly developed 3D based audio visual data corpus we derived
     from the AusTalk corpus ( for audio-visual
     speech/speaker recognition. In addition, we also plan to introduce
     some results using this newly developed 3D audio- visual data
     corpus, which show that there is a significant speech accuracy
     increase by integrating both depth-level and grey-level visual
     features. In the first part of the tutorial, we will review some
     recent works published in the last decade, so that attendees can
     obtain an overview of the fundamental concepts and challenges in
     this field. In the second part of the tutorial, we will briefly
     describe the recording protocol and contents of the 3D data corpus,
     and show attendees how to use this corpus for their own
     research. In the third part of this tutorial, we will present our
     results using the 3D data corpus. The experimental results show
     that, compared with the conventional AVSR based on the audio and
     grey-level visual features, the integration of grey and depth
     visual information can boost the AVSR accuracy
     significantly. Moreover, we will also experimentally explain why
     adding depth information can benefit the standard AVSR
     systems. Eventually, through our tutorial, we hope we can inspire
     more researchers in the community to contribute to this exciting

     Organizers: Roberto Togneri, Mohammed Bennamoun and Chao (Luke) Sui

T7: Semantic Web and Linked Big Data Resources for Spoken Language

     State-of-the-art statistical spoken language processing typically
     requires significant manual effort to construct domain-specific
     schemas (ontologies) as well as manual effort to annotate training
     data against these schemas. At the same time, a recent surge of
     activity and progress on semantic web-related concepts from the
     large search-engine companies represents a potential alternative to
     the manually intensive design of spoken language processing
     systems. Standards such as have been established for
     schemas (ontologies) that webmasters can use to semantically and
     uniformly markup their web pages.  Search engines like Bing,
     Google, and Yandex have adopted these standards and are leveraging
     them to create semantic search engines at the scale of the web. As
     a result, the open linked data resources and semantic graphs
     covering various domains (such as Freebase [3]) have grown
     massively every year and contains far more information than any
     single resource anywhere on the Web.  Furthermore, these resources
     contain links to text data (such as Wikipedia pages) related to the
     knowledge in the graph.

     Recently, several studies on speech language processing started
     exploiting these massive linked data resources for language
     modeling and spoken language understanding. This tutorial will
     include a brief introduction to the semantic web and the linked
     data structure, available resources, and querying languages.  An
     overview of related work on information extraction and language
     processing will be presented, where the main focus will be on
     methods for learning spoken language understanding models from
     these resources.

     Organizers: Dilek Hakkani-Tür and Larry Heck

T8: Speech and Audio for Multimedia Semantics

     Internet media sharing sites and the one-click upload capability of
     smartphones are producing a deluge of multimedia content. While
     visual features are often dominant in such material, acoustic and
     speech information in particular often complements it.  By
     facilitating access to large amounts of data, the text-based
     Internet gave a huge boost to the field of natural language
     processing. The vast amount of consumer-produced video becoming
     available now will do the same for video processing, eventually
     enabling semantic understanding of multimedia material, with
     implications for human computer interaction, robotics, etc.

     Large-scale multi-modal analysis of audio-visual material is now
     central to a number of multi-site research projects around the
     world. While each of these have slightly different targets, they
     are facing largely the same challenges: how to robustly and
     efficiently process large amounts of data, how to represent and
     then fuse information across modalities, how to train classifiers
     and segmenters on unlabeled data, how to include human feedback,

     In this tutorial, we will present the state of the art in
     large-scale video, speech, and non-speech audio processing, and
     show how these approaches are being applied to tasks such as
     content based video retrieval (CBVR) and multimedia event detection
     (MED). We will introduce the most important tools and techniques,
     and show how the combination of information across modalities can
     be used to induce semantics on multimedia material through ranking
     of information and fusion.  Finally, we will discuss opportunities
     for research that the INTERSPEECH community specifically will find
     interesting and fertile.

     Organizers: Florian Metze and Koichi Shinoda

ISCSLP Tutorials @ INTERSPEECH 2014 Description

ISCSLP-T1: Adaptation Techniques for Statistical Speech Recognition

     Adaptation is a technique to make better use of existing models for
     test data from new acoustic or linguistic conditions. It is an
     important and challenging research area of statistical speech
     recognition. This tutorial gives a systematic review of fundamental
     theories as well as introduction of state- of-the-art adaptation
     techniques. It includes both acoustic and language model
     adaptation. Following a simple example of acoustic model
     adaptation, basic concepts, procedures and categories of adaptation
     will be introduced. Then, a number of advanced adaptation
     techniques will be discussed, such as discriminative adaptation,
     Deep Neural Network adaptation, adaptive training, relationship to
     noise robustness etc. After the detailed review of acoustic model
     adaptation, an introduction of language model adaptation, such as
     topic adaptation will also be given. The whole tutorial is then
     summarised and future research direction will be discussed.

     Organizers: Kai Yu

ISCSLP-T2: Emotion and Mental State Recognition: Features, Models,
           System Applications and Beyond

     Emotion recognition is the ability to identify what you are feeling
     from moment to moment and to understand the connection between your
     feelings and your expressions. In today’s world, human-computer
     interaction (HCI) interface undoubtedly plays an important role in
     our daily life. Toward harmonious HCI interfaces, automated
     analysis and recognition of human emotion has attracted increasing
     attention from researchers in multidisciplinary research fields. A
     specific area of current interest that also has key implications
     for HCI is the estimation of cognitive load (mental workload),
     research into which is still at an early stage. Technologies for
     processing daily activities including speech, text and music have
     expanded the interaction modalities between humans and computer-
     supported communicational artifacts.

     In this tutorial, we will present theoretical and practical work
     offering new and broad views of the latest research in emotional
     awareness from audio and speech. We discuss several parts spanning
     a variety of theoretical background and applications ranging from
     salient emotional features, emotional-cognitive models,
     compensation methods for variability due to speaker and linguistic
     content, to machine learning approaches applicable to emotion
     recognition. In each topic, we will review the state of the art by
     introducing current methods and presenting several applications. In
     particular, the application to cognitive load estimation will be
     discussed, from its psychophysiological origins to system design
     considerations.  Eventually, technologies developed in different
     areas will be combined for future applications, so in addition to a
     survey of future research challenges, we will envision a few
     scenarios in which affective computing can make a difference.

     Organizers: Chung-Hsien Wu, Hsin-Min Wang, Julien Epps and
                 Vidhyasaharan Sethu

ISCSLP-T3: Unsupervised Speech and Language Processing via Topic Models

     In this tutorial, we will present state-of-art machine learning
     approaches for speech and language processing with highlight on the
     unsupervised methods for structural learning from the unlabeled
     sequential patterns. In general, speech and language processing
     involves extensive knowledge of statistical models. We require
     designing a flexible, scalable and robust system to meet
     heterogeneous and non-stationary environments in the era of big
     data. This tutorial starts from an introduction of unsupervised
     speech and language processing based on factor analysis and
     independent component analysis. The unsupervised learning is
     generalized to a latent variable model which is known as the topic
     model. The evolution of topic models from latent semantic analysis
     to hierarchical Dirichlet process, from non-Bayesian parametric
     models to Bayesian nonparametric models, and from single-layer
     model to hierarchical tree model shall be surveyed in an organized
     fashion. The inference approaches based on variational Bayesian and
     Gibbs sampling are introduced. We will also present several case
     studies on topic modeling for speech and language applications
     including language model, document model, retrieval model,
     segmentation model and summarization model. At last, we will point
     out new trends of topic models for speech and language processing.

     Organizers: Jen-Tzung Chien

ISCSLP-T4: Deep Learning for Speech Generation and Synthesis

     Deep learning, which can represent high-level abstractions in data
     with an architecture of multiple non-linear transformation, has
     made a huge impact on automatic speech recognition (ASR) research,
     products and services. However, deep learning for speech generation
     and synthesis (i.e., text-to-speech), which is an inverse process
     of speech recognition (i.e., speech-to-text), has not generated the
     similar momentum as it is for ASR yet.  Recently, motivated by the
     success of Deep Neural Networks in speech recognition, some neural
     network based research attempts have been tried successfully on
     improving the performance of statistical parametric based speech
     generation/synthesis. In this tutorial, we focus on deep learning
     approaches to the problems in speech generation and synthesis,
     especially on Text-to-Speech (TTS) synthesis and voice conversion.

     First, we give a review for the current main stream of statistical
     parametric based speech generation and synthesis, or the GMM-HMM
     based speech synthesis and GMM-based voice conversion with emphasis
     on analyzing the major factors responsible for the quality problems
     in the GMM-based voice synthesis/conversion and the intrinsic
     limitations of a decision-tree based, contextual state clustering
     and state-based statistical distribution modeling. We then present
     the latest deep learning algorithms for feature parameter
     trajectory generation, in contrast to deep learning for recognition
     or classification. We cover common technologies in Deep Neural
     Network (DNN) and improved DNN: Mixture Density Networks (MDN),
     Recurrent Neural Networks (RNN) with Bidirectional Long Short Term
     Memory (BLSTM) and Conditional RBM (CRBM). Finally, we share our
     research insights and hand-on experience on building speech
     generation and synthesis systems based upon deep learning

     Organizers: Yao Qian and Frank K. Soong

Message diffuse par la liste Langage Naturel <LN at>
Informations, abonnement :
English version       : 
Archives                 :

La liste LN est parrainee par l'ATALA (Association pour le Traitement
Automatique des Langues)
Information et adhesion  :

ATALA décline toute responsabilité concernant le contenu des
messages diffusés sur la liste LN

More information about the Ln mailing list