« Crossroads of Speech and Language »


The 20th edition of INTERSPEECH conferences introduces a range of new presentation formats. Given the complexity of speech communication science and technology the need for detailed technical review of sub-areas of research has become more critical than ever.

We have invited proposals for innovative and engaging Research Survey Presentations. The talks are scheduled at the start of suitable oral presentation sessions in the Main Hall (without having plenary status as the usual parallel tracks operate continuously), and they will be allocated a 40-minute time slot including presentation and discussion. Presentations will aim to give an overview of the state of the art for a specific topic covered by one or more of the main technical areas of INTERSPEECH 2019. We are proud to present a series of 10 excellent surveys in our program!

Modeling in automatic speech recognition: beyond Hidden Markov Models
Ralf Schlüter (RWTH Aachen University, Aachen, Germany)
Monday, 16 September, 1100–1140, Main Hall

The general architecture and modeling of the state-of-the-art statistical approach to automatic speech recognition (ASR) have not been challenged significantly for decades. The classical statistical approach to ASR is based on Bayes decision rule, a separation of acoustic and language modeling, hidden Markov modeling (HMM), and a search organization based on dynamic programming and hypothesis pruning methods. Even when artificial neural networks for acoustic modeling and language modeling started to considerably boost ASR performance, the general architecture of state-of-the-art ASR systems was not altered considerably. The hybrid deep neural network (DNN)/HMM approach, together with recurrent long short-term memory (LSTM) neural network language modeling currently marks the state-of-the-art on many tasks, covering a wide range of training set sizes. However, currently more and more alternative approaches occur, moving gradually towards so-called end-to-end approaches. Gradually, these novel end-to-end approaches replace explicit time alignment modeling and dedicated search space organization by more implicit, integrated neural-network based representations, while also dropping the separation between acoustic and language modeling. Corresponding approaches show promising results, especially using large training sets. In this presentation, an overview of current modeling approaches to ASR will be given, including variations of both HMM-based and end-to-end modeling.

Ralf Schlüter works as academic director and senior lecturer in the Computer Science Department at RWTH Aachen University. He leads the automatic speech recognition group at the Lehrstuhl Informatik 6: Human Language Technology and Pattern Recognition. He studied physics at RWTH Aachen University, Germany, and Edinburgh University, UK, and received the Diplom degree in physics, the Dr.rer.nat. degree in computer science, and completed his habilitation in computer science, all at RWTH Aachen University. His research interests cover speech recognition in general, discriminative training, neural network modeling, information theory, stochastic modeling, and speech signal analysis.

When attention meets speech applications: speech & speaker recognition perspective
Kyu Han (ASAPP, Inc.)
Co-contributors: Ramon Prieto, Tao Ma
Monday, 16 September, 1430–1510, Main Hall

Attention is to let neural layers pay more attention to what is relevant to a given task while giving less attention to what is less important, and since its introduction in 2015 for machine translation, has been successfully applied to speech applications in a number of different forms. This survey presents how the attention mechanisms have been applied to speech and speaker recognition tasks. The attention mechanism was firstly applied to sequence-to-sequence speech recognition and later became the critical part of Google's well-known Listen, Attend and Spell ASR system. In the framework of hybrid DNN/HMM approaches or CTC-based ASR systems, the attention mechanisms recently started to get more traction in the form of self-attention. In a speaker recognition perspective, the attention mechanisms have been utilized to improve the capability of representing speaker characteristics in neural outputs, mostly in the form of attentive pooling. In this survey we detail the attentive strategies that have been successful in both speech and speaker recognition tasks, and discuss challenging issues in practice.

Kyu Jeong Han received his PhD from USC in 2009 and is currently working for ASAPP, Inc. as a Principal Speech Scientist, focusing on deep learning technologies for speech applications. Dr. Han held research positions at IBM, Ford, Capio.ai (acquired by Twilio) and JD.com. He is actively involved in the speech community as well, serving as reviewers for IEEE and ISCA journals and conferences, and a Technical Committee member in the Speech and Language Processing committee of the IEEE SPS since 2019. In 2018, he won the ISCA Award for the Best Paper Published in Computer Speech & Language 2013-2017.

Spoken language translation
Jan Niehues (Maastricht University, Maastricht, the Netherlands)
Co-contributors: Sebastian Stüker, Marco Turchi, Matthias Sperber, Matteo Negri
Tuesday, 17 September, 1000–1040, Main Hall

We will start with an overview on the different use cases and difficulties of speech translation. Due to the wide range of possible application these systems differ in data, difficulty of the language and spontaneous effects. Furthermore, the interaction with human has an important influence. In the main part of the talk, we will review state-of-the-art methods to build speech translation system. We will start with reviewing the translation approach of spoken language translation, a cascade of an automatic speech recognition system and a machine translation system. We will highlight the challenges when combining both systems. Especially, techniques to adapt the system to scenario will be reviewed. With the success of neural models in both areas, we see a rising research interest in end-to-end speech translation. While we see promising results on this approach, international evaluation campaigns like the Shared Task of the International Workshop on Spoken Language Translation (IWSLT) have shown that currently often cascaded systems still achieve a better translation performance. We will highlight the main challenges of end-to-end speech translation. In the final part of the talk, we will review techniques that address key challenges of speech translation, e.g. Latency, spontaneous effects, sentence segmentation and stream decoding.

Jan Niehues is an assistant professor at Maastricht University. He received his doctoral degree from Karlsruhe Institute of Technology in 2014 on the topic of “Domain Adaptation in Machine Translation”. He has conducted research at Carnegie Mellon University and LIMSI/CNRS, Paris. His research has covered different aspects of Machine Translation and Spoken Language Translation. He has been involved in several international projects on spoken language translation e.g. the German-French Project Quaero, the H2020 EU project QT21 EU-Bridge and Elitr. Currently, he is one of the main organizer of the spoken language track in the IWSLT shared task.

End-to-end deep neural network-based speaker and language recognition
Ming Li (Data Science Research Center, Duke Kunshan University, China)
Co-contributors: Weicheng Cai, Danwei Cai
Tuesday, 17 September, 1330–1410, Main Hall

Speech signal not only contains lexicon information, but also delivers various kinds of paralinguistic speech attribute information, such as speaker, language, gender, age, emotion, etc. The core technique question behind it is utterance level supervised learning based on text independent or text dependent speech signal with flexible duration. In section 1, we will first formulate the problem of speaker and language recognition. In section 2, we introduce the traditional framework with different modules in a pipeline, namely, feature extraction, representation, variability compensation and backend classification. Then we naturally introduce the end-to-end idea and compare with the traditional framework. We will show the correspondence between feature extraction and CNN layers, representation and encoding layer, backend modeling and fully connected layers. Specifically, we will introduce the modules in the end-to-end frameworks with more details here, e.g. variable length data loader, frontend convolutional network structure design, encoding (or pooling) layer design, loss function design, data augmentation design, transfer learning and multitask learning, etc. In section 4, we will introduce some robust methods using the end-to-end framework for far-field and noisy conditions. Finally, we will connect the introduced end-to-end frameworks to other related tasks, e.g. speaker diarization, paralinguistic speech attribute recognition, anti-spoofing countermeasures, etc.

Ming Li received his Ph.D. from University of Southern California in 2013. His research interests are speech processing and multimodal behavior signal analysis. He served as the member of APSIPA speech and language processing committee, area chairs for INTERSPEECH 2016 and 2018. Works co-authored with his colleagues have won first prizes at Body Computing Slam Contest 2009, INTERSPEECH 2011 Speaker State Challenge and INTERSPEECH 2012 Speaker Trait Challenge; best paper awards at IEEE DCOSS 2009, ISCSLP 2014, IEEE CPTECE 2018. He received the IBM faculty award in 2016 and ISCA computer speech and language best journal paper award in 2018.

Preserving privacy in speaker and speech characterization
Andreas Nautsch (EURECOM, Sophia Antipolis, France)
Co-contributors: Abelino Jiménez, Amos Treiber, Jascha Kolberg, Catherine Jasserand, Els Kindt, Héctor Delgado, Massimiliano Todisco, Mohamed Amine Hmani, Aymen Mtibaa, Mohammed Ahmed Abdelraheem, Alberto Abad, Francisco Teixeira, Driss Matrouf, Marta Gomez-Barrero, Dijana Petrovska-Delacrétaz, Gérard Chollet, Nicholas Evans, Thomas Schneider, Jean-François Bonastre, Bhiksha Raj, Isabel Trancoso, Christoph Busch
Tuesday, 17 September, 1600–1640, Main Hall

The survey addresses recent work that has the aim of preserving privacy in speech communication applications. The talk discusses recent privacy legislation in the US and especially the European Union, and focuses upon the GDPR (EU Regulation 2016/679) and the Police Directive (EU Directive 2016/680), covering also ‘Privacy by Design’ and ‘Privacy by Default’ policy concepts. Emphasis is placed on voice biometrics and non-biometric speech technology. Since there is no “one size fits all” solution, specific cryptographic solutions to privacy preservation are highlighted. Among other classification tasks, voice biometrics can intrude on privacy when misused; the talk surveys a number of privacy safeguards. The international standard for biometric information protection is reviewed and figures of merit are proposed regarding, e.g., the extent to which privacy is preserved. More interdisciplinary efforts are necessary to reach a common understanding between speech technology, legislation, and cryptography communities (among many others). Future challenges include the need to not only carry out decision inference securely, but also to preserve privacy, where cryptographic methods need to meet the demands of speech signal processing. In communication, speech is a medium, not a message.

Andreas Nautsch is with the Audio Security and Privacy research group (EURECOM). From 2014 to 2018, he was with the da/sec Biometrics and Internet-Security research group (Hochschule Darmstadt) within the German National Research Center for Applied Cybersecurity. He served as an expert delegate to ISO/IEC and as project editor of the ISO/IEC 19794-13:2018 standard. He is a co-organizer of the ASVspoof 2019 evaluation, the related special sessions at INTERSPEECH 2019 and ASRU 2019, and is a Guest Editor of the related CSL special issue. He is a co-initiator of the emerging ISCA SIG on Security & Privacy.

Prosody research and applications: the state of the art
Nigel G. Ward (University of Texas at El Paso, TX, USA)
Wednesday, 18 September, 1000–1040, Main Hall

Prosody is essential in human interaction and relevant to every area of speech science and technology. Our understanding of prosody, although still fragmentary, is rapidly advancing. This survey will give non-specialists the knowledge needed to decide whether and how to integrate prosodic information into their models and systems. It will start with the basics: the paralinguistic, phonological and pragmatic functions of prosody, its physiology and perception, commonly and less-commonly-used prosodic features, and the three main approaches to modeling prosody. Regarding practical applications, it will overview ways to use prosody in speech recognition, speech synthesis, dialog systems, and the inference of speaker states and traits. Recent trends will then be presented, including modeling pitch as more than a single scalar value, modeling prosody beyond just intonation, representing prosodic knowledge with constructions of multiple prosodic features in specific temporal configurations, modeling observed prosody as the result of the superposition of patterns representing independent intents, modeling multi-speaker phenomenon, and the use of unsupervised methods. Finally, we will consider remaining challenges in research and applications.

Nigel G. Ward is the author of Prosodic Patterns in English Conversation, Cambridge University Press, 2019, and the chair of ISCA's Speech Prosody SIG. His research focuses on models of and applications involving prosody. He co-taught the Introduction to Prosody short course at the 2019 Linguistics Institute. Ward received his Ph.D. from the University of California at Berkeley in 1991 and was member of the Engineering faculty at the University of Tokyo for ten years before taking up his current position at the University of Texas at El Paso. In 2015-2016 he was a Fulbright Scholar and Visiting Professor at Kyoto University.

Recognition of foreign-accented speech: challenges and opportunities for human and computer speech communication
Ann Bradlow (Northwestern University, Evanston, IL, USA)
Wednesday, 18 September, 1330—1410, Main Hall

This presentation will consider the causes, characteristics, and consequences of second-language (L2) speech production through the lens of a talker-listener alignment model. Rather than focusing on L2 speech as deviant from the L1 target, this model views speech communication as a cooperative activity in which interlocutors adjust their speech production and perception in a bi-directional, dynamic manner. Three lines of support will be presented. First, principled accounts of salient acoustic-phonetic markers of L2 speech will be developed with reference to language-general challenges of L2 speech production and to language-specific L1-L2 structural interactions. Next, we will examine recognition of L2 speech by listeners from various language backgrounds, noting in particular that for L2 listeners, L2 speech can be equally (or sometimes, more) intelligible than L1 speech. Finally, we will examine perceptual adaptation to L2 speech by L1 listeners, highlighting studies that focused on interactive, dialogue-based test settings where we can observe the dynamics of talker adaptation to the listener and vice versa. Throughout this survey, I will refer to current methodological and technical developments in corpus-based phonetics and interactive testing paradigms that open new windows on the dynamics of speech communication across a language barrier.

Ann Bradlow received her PhD in Linguistics from Cornell University (1993). She completed postdoctoral fellowships in Psychology at Indiana University (1993-1996) and Hearing Science at Northwestern University (1996-1998). Since 1998, Bradlow has been a faculty member in the Linguistics Department at Northwestern University (USA) where she directs the Speech Communication Research Group (SCRG). It pursues an interdisciplinary research program in acoustic phonetics and speech perception with a focus on speech intelligibility under conditions of talker-, listener-, and situation-related variability. A central line of current work investigates bilingual speech production and perception, with a focus on perceptual adaptation to foreign-accented speech.

Multi-modal processing of speech and language
Florian Metze (Carnegie Mellon University, Pittsburgh, PA, USA)
Wednesday, 18 September, 1600–1640, Main Hall

Human information processing is inherently multimodal. Speech and language are therefore best processed and generated in a situated context. Future human language technologies must be able to jointly process multimodal data, and not just text, images, acoustics or speech in isolation. Despite advances in Computer Vision, Automatic Speech Recognition, Multimedia Analysis and Natural Language Processing, state-of-the-art computational models are not integrating multiple modalities nowhere near as effectively and efficiently as humans. Researchers are only beginning to tackle these challenges in “vision and language” research. In this talk, I will show the potential of multi-modal processing to (1) improve recognition for challenging conditions (i.e. lip-reading), (2) adapt models to new conditions (i.e. context or personalization), (3) ground semantics across modalities or languages (i.e. translation and language acquisition), (4) training models with weak or non-existent labels (i.e. SoundNet or bootstrapping of recognizers without parallel data), and (5) make models interpretable (i.e. representation learning). I will present and discuss significant recent research results from each of these areas and will highlight the commonalities and differences. I hope to stimulate exchange and cross-fertilization of ideas by presenting not just abstract concepts, but by pointing the audience to new and existing tasks, datasets, and challenges.

Florian Metze is an Associate Research Professor at Carnegie Mellon University, in the School of Computer Science’s Language Technologies Institute. His work covers many areas of speech recognition and multimedia analysis with a focus on end-to-end deep learning. Currently, he focuses on multimodal processing of speech in how-to videos, and information extraction from medical interviews. He has also worked on low resource and multilingual speech processing, speech recognition with articulatory features, large-scale multimedia retrieval and summarization, along with recognition of personality or similar meta-data from speech.

Realistic physics-based computational voice production
Oriol Guasch (La Salle, Universitat Ramon Llull, Barcelona, Spain)
Co-contributors: Marc Arnela, Arnau Pont, Francesc Alías, Marc Freixas, Joan-Claudi Socoró
Thursday, 19 September, 1000–1040, Main Hall

Simulating the very complex physics of voice on realistic vocal tract geometries looked daunting a few years ago but has recently experienced a very significant boom. Earlier works mainly dealt with vowel production. Solving the wave equation in a three-dimensional vocal tract suffices for that purpose. As we depart from vowels, however, things quickly get harder. Simulating a few milliseconds of sibilant /s/ demands high-performance computers to solve the sound turbulent eddies generate. Producing a diphthong implies dealing with dynamic geometries. A syllable like /sa/ seems out of reach of current computation capabilities, though some modelling techniques inspired on one-dimensional approaches may lead to more than acceptable results. The shaping of dynamic vocal tracts shall be linked to biomechanical models to gain flexibility and achieve a more complete representation on how, we humans, generate voice. Besides, including phonation in the computations implies resolving the vocal fold self-oscillations and the very demanding coupling of the mechanical, fluid and acoustic fields. Finally, including naturalness in computational voice generation is a newborn and challenging task. In this talk, a general overview on realistic physics-based computational voice production will be given. Current achievements and remaining challenges will be highlighted and discussed.

Oriol Guasch is full professor at La Salle, Universitat Ramon Llull, in Barcelona, where he heads the research on acoustics. He holds a five-year degree in Physics and a PhD in Computational Mechanics and Applied Mathematics. His research lines involve computational methods for numerical voice production, acoustic black holes in mechanics, graph theory in vibroacoustics, transmission path analysis and parametric arrays. Prof. Guasch has authored about 50 journal articles, 70 conference papers and 2 patents. He was the general chair of the 6th NOVEM congress, in 2018, and currently serves as subject editor for the Journal of Sound and Vibration.

Reaching over the gap: cross- and interdisciplinary research on human and automatic speech processing
Odette Scharenborg (Delft University of Technology, the Netherlands)
Thursday, 19 September, 1330–1410, Main Hall

The fields of human speech recognition (HSR) and automatic speech recognition (ASR) both investigate parts of the speech recognition process and have word recognition as their central issue. Although the research fields appear closely related, their aims and research methods are quite different. Despite these differences there is, however, in the past two decades a growing interest in possible cross-fertilization. Researchers from both ASR and HSR are realizing the potential benefit of looking at the research field on the other side of the ‘gap’. In this survey talk, I will provide an overview of past and present efforts to link human and automatic speech recognition research and present an overview of the literature describing the performance difference between machines and human listeners. The focus of the talk is on the mutual benefits to be derived from establishing closer collaborations and knowledge interchange between ASR and HSR.

Odette Scharenborg is an Associate Professor and Delft Technology Fellow at Delft University of Technology, the Netherlands. Her research interests focus on narrowing the gap between automatic and human spoken-word recognition. Particularly, she is interested in the question where the difference between human and machine recognition performance originates, and whether it is possible to narrow this performance gap. She co-organized the INTERSPEECH 2008 Consonant Challenge, which aimed at promoting comparisons of human and machine speech recognition in noise. In 2017, she was elected onto the ISCA board, and in 2018 onto the IEEE Speech and Language Processing Technical Committee.