« Crossroads of Speech and Language »


INTERSPEECH conferences are attended by researchers with a long-term track-record in speech sciences and technology, as well as by early-stage researchers or researchers interested in a new domain within the Interspeech areas. An important part of the conference are the tutorials held on the first day of the conference, September 15, 2019. Presented by speakers with long and deep expertise in speech (but they do have changed their looks and methods in the past 150 years since the cartoon on the left side appeared), they will provide their audience with a rich learning experience and an exposure to longstanding research problems, contemporary topics of research as well as emerging areas.

Date and Venue of the Tutorials

September 15, 2019; two 3h sessions, in the morning and in the afternoon, in the main conference location (Messecongress Graz).

Morning tutorials (900 — 1230)

[T1] Generative adversarial network and its applications to speech signal and natural language processing
Sunday, 15 September, 900–1230, Hall 1

Generative adversarial network (GAN) is a new idea for training models, in which a generator and a discriminator compete against each other to improve the generation quality. Recently, GAN has shown amazing results in image generation, and a large amount and a wide variety of new ideas, techniques, and applications have been developed based on it. Although there are only few successful cases, GAN has great potential to be applied to text and speech generations to overcome limitations in the conventional methods.

There are three parts in this tutorial. In the first part, we will give an introduction of generative adversarial network (GAN) and provide a thorough review about this technology. In the second part, we will focus on the applications of GAN to speech signal processing, including speech enhancement, voice conversion, speech synthesis, and the applications of domain adversarial training to speaker recognition and lip reading. In the third part, we will describe the major challenge of sentence generation by GAN and review a series of approaches dealing with the challenge. Meanwhile, we will present algorithms that use GAN to achieve text style transformation, machine translation and abstractive summarization without paired data.


  • Hung-yi Lee (Department of Electrical Engineering, National Taiwan University)
  • Yu Tsao (Research Center for Information Technology Innovation, Academia Sinica)

Hung-yi Lee received the M.S. and Ph.D. degrees from National Taiwan University (NTU), Taipei, Taiwan, in 2010 and 2012, respectively. From September 2012 to August 2013, he was a postdoctoral fellow in Research Center for Information Technology Innovation, Academia Sinica. From September 2013 to July 2014, he was a visiting scientist at the Spoken Language Systems Group of MIT Computer Science and Artificial Intelligence Laboratory (CSAIL). He is currently an assistant professor of the Department of Electrical Engineering of National Taiwan University, with a joint appointment at the Department of Computer Science & Information Engineering of the university. His research focuses on machine learning (especially deep learning), spoken language understanding and speech recognition. He owns a YouTube channel teaching deep learning (in Mandarin) with more than 3M views and 38k subscribers (link).

Yu Tsao received the B.S. and M.S. degrees in Electrical Engineering from National Taiwan University in 1999 and 2001, respectively, and the Ph.D. degree in Electrical and Computer Engineering from Georgia Institute of Technology in 2008. From 2009 to 2011, Dr. Tsao was a researcher at National Institute of Information and Communications Technology (NICT), Japan, where he engaged in research and product development in automatic speech recognition for multilingual speech-to-speech translation. Currently, he is an Associate Research Fellow at the Research Center for Information Technology Innovation (CITI), Academia Sinica, Taipei, Taiwan. He received the Academia Sinica Career Development Award in 2017. Dr. Tsao’s research interests include speech and speaker recognition, acoustic and language modeling, audio-coding, and bio-signal processing.

[T2] Statistical voice conversion with direct waveform modeling
Sunday, 15 September, 900–1230, Hall 12

Statistical voice conversion (VC) has been attracted attention as one of the most popular research topics in speech synthesis thanks to significant progress of fundamental techniques, the development of freely available resources, and its great potential to develop various applications. In this tutorial, we will give participants an overview of VC techniques by reviewing the basics of VC and recent progress especially highlighting direct waveform modeling techniques, which have been demonstrated as a promising approach in the Voice Conversion Challenge 2018 (VCC2018). Moreover, we will introduce freely-available software, “sprocket” as a statistical VC toolkit and “PytorchWaveNetVocoder” as a neural vocoder toolkit, making it possible for the participants to develop the state-of-the art VC systems as well as the VCC2018 baseline system.


  • Tomoki Toda (Information Technology Center, Nagoya University)
  • Kazuhiro Kobayashi (Information Technology Center, Nagoya University)
  • Tomoki Hayashi (Information Technology Center, Nagoya University)

Tomoki Toda is a Professor of the Information Technology Center at Nagoya University, Japan. His research interests include statistical approaches to speech, music, and environmental sound processing. He has served as an Associate Editor of the IEEE Signal Processing Letter since 2016. He was a member of the Speech and Language Technical Committee of the IEEE SPS from 2007 to 2009, and from 2014 to 2016. He received more than 10 paper and achievement awards including the IEEE SPS 2009 Young Author Best Paper Award and the 2013 EURASIP-ISCA Best Paper Award (Speech Communication Journal).

Kazuhiro Kobayashi (kobayashi.kazuhiro@g.sp.m.is.nagoya-u.ac.jp) received his B.E. degree from the Department of Electrical and Electronic Engineering, Faculty of Engineering Science, Kansai University, Japan, in 2012, and his M.E. and Ph.D. degree from the Graduate School of Information Science, NAIST, Nara, Japan, in 2014 and 2017, respectively. He is currently working as a postdoctoral researcher at the Information Technology Center, Nagoya University, Aichi, Japan and a chief executive officer of TARVO, Inc.. He has received a few awards including a Best Presentation Award from the Acoustical Society of Japan. He is a developer of “sprocket”, open software of statistical voice conversion.

Tomoki Hayashi received the B.E. degree in engineering and the M.E. and Ph.D. degrees in information science from Nagoya University, Aichi, Japan, in 2014, 2016, and 2019, respectively. He received the Acoustical Society of Japan 2014 Student Presentation Award. His research interests include statistical speech and audio signal processing. He is currently working as a postdoctoral researcher at Nagoya University and the chief operating officer of Human Dataware Lab. Co., Ltd.

[T3] Neural machine translation
Sunday, 15 September, 900–1230, Hall 11

Although machine translation technology evolved over more than half a century, it was not until recent years that translation quality is finally approaching human-level performance, thanks to the emergence of neural machine translation (NMT) techniques. NMT is one of the biggest revolutions in the history of machine translation and led to the launch of the Google NMT system in late 2016. In this tutorial, we give an overview of the history, mainstream techniques and recent advancements of NMT.


  • Wolfgang Macherey (Google AI)
  • Yuan Cao (Google AI)

Wolfgang Macherey is a staff research scientist at Google AI and the research tech lead for Google Translate. He joined Google Translate as a research intern in 2005 to work on discriminative training for machine translation for the NIST Machine Translation Evaluation competition, which Google won by a large margin. Since 2006, he has been with Google full-time. Wolfgang Macherey received the M.S. degree in Computer Science and the PhD degree in Computer Science from RWTH Aachen University, Germany, where he worked with Prof. Dr.-Ing. Herman Ney and Dr. Ralf Schlüter on discriminative training and acoustic modeling for automatic speech recognition. Wolfgang Macherey has authored over 40 publications. His research interests include machine learning, neural modeling, deep learning, machine translation, natural language processing and automatic speech recognition.

Yuan Cao is a research software engineer at Google AI. He first joined Google in 2015, working on neural machine translation. From 2016-2017 he was working on a startup KITT.AI focusing on speech and language technologies, which was later acquired by Baidu Inc. He rejoined Google in late 2017 and kept working on neural sequence prediction models. Yuan Cao received his PhD from the Center for Speech and Language Processing (CLSP) at Johns Hopkins University in 2015, working with advisor Prof. Sanjeev Khudanpur. Before that he earned a Master and Bachelor degree in Electrical Engineering from Shanghai Jiaotong University in 2008 and 2005 respectively. His research interests include machine learning, natural language processing and automatic speech recognition.

[T4] Biosignal-based speech processing: from silent speech to brain-computer interfaces
Sunday, 15 September, 900–1230, Hall 2

Speech production is a complex motor process involving several physiological phenomenons, such as the brain, nervous and muscular activities that drive our respiratory, laryngeal and articulatory systems. In the last 10 years, an increasing number of studies have proposed to exploit measurement of these activities, called biosignals, to build prosthetic devices, from silent speech interfaces converting (silent) articulation into text or sound, to brain-computer interfaces decoding speech-related brain activity. A third line of research focus on the use of speech biosignals to provide valuable feedback for speech therapy and language learning.

In this tutorial, we will give an extensive overview of these research domains. We will show that they face common challenges and can thus share common methodological frameworks. Both data and code will be shared with the participants to enable a swift start in this exciting research field.


  • Thomas Hueber (GIPSA-lab/CNRS, Université Grenoble Alpes)
  • Christian Herff (School for Mental Health and Neuroscience, Maastricht University)

Thomas Hueber's research focuses on multimodal speech processing with a special interest in modeling the relationships between the articulatory movements and the acoustic signal using machine learning. He develops assistive technologies aiming at restoring oral communication (e.g. silent speech interfaces) or at facilitating the treatment of a speech sound disorder (biofeedback systems). After his Ph.D. in Computer Science at UPMC-Sorbonne University (Paris) in 2009, he was appointed tenured CNRS researcher at GIPSA-lab (Grenoble, France) in 2011. He co-authored 14 articles in peer-reviewed international journals, more than 30 articles in peer-reviewed international conferences, 3 book chapters, one patent, and co-edited in IEEE/ACM TASLP the special issue on Biosignal-based speech processing.

Christian Herff's research focuses on decoding higher order cognition from neurophysiological data through the application of machine learning. He is particularly interested in the representation of speech processes in various layers of the brain. After his Diploma in Computer Science from Karlsruhe Institute of Technology in 2011 and his PhD in Computer Science from University of Bremen in 2016, he has joined the department of Neurosurgery at Maastricht University in 2018 to record invasive brain signals to advance the field towards his goal of a speech neuroprosthesis.

Afternoon tutorials (1400 — 1730)

[T5] Generating adversarial examples for speech and speaker recognition and other systems
Sunday, 15 September, 1400–1730, Hall 12

As neural network classifiers become increasingly successful at various tasks ranging from speech recognition and image classification to various natural language processing tasks and even recognizing malware, a second, somewhat disturbing discovery has also been made. It is possible to fool these systems with carefully crafted inputs that appear to the lay observer to be natural data, but cause the neural network to misclassify in random or even targeted ways.

In this tutorial we will discuss the problem of designing, identifying, and avoiding attacks by such crafted "adversarial" inputs. In the first part, we will explain how the basic training algorithms for neural networks may be turned around to learn adversarial examples, and explain why such learning is nearly always possible. Subsequently, we will explain several approaches to producing adversarial examples to fool systems such as image classifiers, speech recognition and speaker verification systems, and malware detection systems. We will describe both "glass box" approaches, where one has access to the internals of the classifier, and "black box" approaches where one does not. We will subsequently move on to discuss current approaches to identifying such adversarial examples when they are presented to a classifier. Finally, we will discuss recent work on introducing "backdoors" into systems through poisoned training examples, such that the system can be triggered into false behaviors when provided specific types of inputs, but not otherwise.


  • Bhiksha Raj (School of Computer Science, Carnegie Mellon University)
  • Joseph Keshet (Department of Computer Science, Bar-Ilan University)

Bhiksha Raj is a professor in the School of Computer Science at Carnegie Mellon University. His areas of interest are automatic speech recognition, audio processing, machine learning, and privacy. Dr. Raj is a fellow of the IEEE.

Joseph Keshet is an associate professor in the Dept. of Computer Science at Bar-Ilan University. His areas of interest are both machine learning and computational study of human speech and language. In machine learning his research has been focused on deep learning and structured prediction, while his research on speech and language has been focused on speech processing, speech recognition, acoustic phonetics, and pathological speech.

[T6] Advanced methods for neural end-to-end speech processing – unification, integration, and implementation
Sunday, 15 September, 1400–1730, Hall 1

An end-to-end neural approach has become a popular alternative to conventional modular approaches in various speech applications including speech recognition and synthesis. One of the benefits of this end-to-end neural framework is that we can use a unified framework for different speech processing problems based on sequence-to-sequence modeling, and can also tightly integrate these problems in a joint training manner.

This tutorial aims to introduce various end-to-end speech processing applications by focusing on the above unified framework and several integrated systems (e.g., speech recognition and synthesis, speech separation and recognition, speech recognition and translation) as implemented within a new open source toolkit named ESPnet (end-to-end speech processing toolkit https://github.com/espnet/espnet).


  • Takaaki Hori (Mitsubishi Electric Research Laboratories)
  • Tomoki Hayashi (Department of Information Science, Nagoya University)
  • Shigeki Karita (NTT Communication Science Laboratories)
  • Shinji Watanabe (Center for Language and Speech Processing, Johns Hopkins University)

Takaaki Hori is a Senior Principal Research Scientist at Mitsubishi Electric Research Laboratories (MERL), Cambridge, MA, USA. He re- ceived the B.E., M.E., and Ph.D. degrees in system and information engineering from Yamagata University, Yonezawa, Japan, in 1994, 1996, and 1999, respectively. From 1999 to 2015, he had been engaged in research on speech and language technology at NTT Communica- tion Science Laboratories, Kyoto, Japan. He was a Visiting Scientist at the Massachusetts Institute of Technology in 2006-2007. In 2015, he joined MERL. He has authored more than 100 peer-reviewed papers in speech and language research fields.

Tomoki Hayashi received the B.E. degree in engineering and the M.E. and Ph.D. degrees in information science from Nagoya University, Aichi, Japan, in 2014, 2016, and 2019, respectively. He received the Acoustical Society of Japan 2014 Student Presentation Award. His research interests include statistical speech and audio signal processing. He is currently working as a postdoctoral researcher at Nagoya University and the chief operating officer of Human Dataware Lab. Co., Ltd.

Shigeki Karita is a research scientist at NTT Communication Science Laboratories, Kyoto, Japan. He received the B.E. and M.E. degrees from Osaka University, Japan in 2014 and 2016, respectively. He received Young Researcher's Award in 2014 from IEICE, Japan. His research interests include speech recognition, speech translation, and speech enhancement. Recently, he is working on semi-supervised training and sequence discriminative training using reinforcement learning for end-to-end ASR. He is also a main developer of ESPnet. He mainly contributed PyTorch backend and Transformer implementation parts.

Shinji Watanabe is an Associate Research Professor at Johns Hopkins University, Baltimore, MD, USA. He received his B.S., M.S., and PhD (Dr. Eng.) Degrees in 1999, 2001, and 2006, from Waseda University, Japan. He was a research scientist at NTT Communication Science Laboratories, Kyoto, Japan, from 2001 to 2011 and a Senior Principal Research Scientist at Mitsubishi Electric Research Laboratories (MERL), Cambridge, MA from 2012 to 2017. His research interests include automatic speech recognition, speech enhancement, and machine learning for speech and language processing. He has been published more than 200 papers in peer-reviewed journals and conferences.

[T7] Modeling and deploying dialog systems from scratch using open-source tools
Sunday, 15 September, 1400–1730, Hall 2

Advancements in machine learning, NLP, and far-field speech recognition in recent years has lead to a new breed of commercial conversational agents that are capable of assisting millions of users for a variety of tasks. It is the combination of several underlying technologies such as Speech Recognition, Language Understanding, Dialog Management and State Tracking, and Response Generation which makes it possible to build such assistants at scale. Despite active research and development in the field of Spoken Dialog Systems (SDS), there hardly exists any library which supports modeling for each of the technologies mentioned above in a coherent and unified experimental setting.

In this tutorial, we will show how to build each single component of a SDS and how to orchestrate them all together to quickly obtain both research prototypes and scalable production-ready systems, i.e. from speech to understanding to dialogue management and response generation. This is made possible by the use of recently open-sourced tools that simplify the building of such systems from scratch. We will use publicly available datasets to train the models and to provide a hands-on experience during the tutorial.


  • Alexandros Papangelis (Uber AI)
  • Piero Molino (Uber AI)
  • Chandra Khatri (Uber AI)

Alexandros Papangelis is currently with Uber AI, on the Conversational AI team; his interests include statistical dialogue management, natural language processing, and human-machine social interactions. Prior to Uber, he was with Toshiba Research Europe, leading the Cambridge Research Lab team on Statistical Spoken Dialogue. Before joining Toshiba, he was a post-doctoral fellow at CMU's Articulab, working with Justine Cassell on designing and developing the next generation of socially-skilled virtual agents. He received his PhD from the University of Texas at Arlington, MSc from University College London, and BSc from the University of Athens.

Piero Molino is a research scientist focusing on Natural Language Processing. He received a PhD in computer science from the University of Bari in Italy with a thesis on distributional semantics in question answering. He worked for Yahoo Labs in Barcelona on learning to rank on community question answering mixing language, user modeling and social network analysis techniques. He then joined IBM Watson in New York on and worked on question answering, query autocomplete, misspell correction using deep learning. Before joining Uber he worked at Geometric Intelligence, where his main focus was on grounded language understanding mixing computer vision and language using deep learning approaches.

Chandra Khatri is a Senior Research Scientist interested in Conversational AI and Multi-modal efforts at Uber. Currently, he is interested in making AI Systems smarter and scalable while addressing the fundamental challenges pertaining to understanding and reasoning. Prior to Uber, he was the Lead Scientist at Alexa and was driving the Science for the Alexa Prize Competition, which is a university competition for advancing the state of Conversational AI. Prior to Alexa, he was a Research Scientist at eBay, wherein he led various Deep Learning and NLP initiatives within the eCommerce domain, which has led to significant gains for eBay.

[T8] Microphone array signal processing and deep learning for speech enhancement – strong together
Sunday, 15 September, 1400–1730, Hall 11

While multi-channel speech enhancement was traditionally approached by linear or non-linear time-variant filtering techniques, in the last years neural network-based solutions have achieved remarkable performance by data-driven learning techniques. Even more recently, hybrid techniques, which blend traditional signal processing with deep learning, have been shown to combine the best of both worlds: achieving excellent enhancement performance, while at the same time being resource efficient and amenable to human interpretability due to the underlying physical model.

In this tutorial we discuss recent advances in signal processing based and neural network based methods, as well as hybrid techniques, to process multi-channel speech input for enhancement, with a focus on robust automatic speech recognition. This will include, but is not limited to, acoustic beamforming, speech dereverberation, and source separation.


  • Reinhold Haeb-Umbach (Department of Communications Engineering, Paderborn University)
  • Tomohiro Nakatani (NTT Communication Science Laboratories)

Reinhold Haeb-Umbach (haeb@nt.uni-paderborn.de) is a professor of Communications Engineering at Paderborn University, Germany. His main research interests are in the fields of statistical signal processing and pattern recognition, with applications to speech enhancement, acoustic beamforming and source separation, as well as automatic speech recognition and unsupervised learning from speech and audio. He (co-)authored more than 200 scientific publications, and recently co-authored the book Robust Automatic Speech Recognition -- a Bridge to Practical Applications (Academic Press, 2015). He is a fellow of ISCA.

Tomohiro Nakatani (tnak@ieee.org) is a Senior Distinguished Researcher of NTT Communication Science Laboratories, NTT Corporation, Kyoto, Japan. He received the B.E., M.E., and PhD degrees from Kyoto University, Kyoto, Japan, in 1989, 1991, and 2002, respectively. His research interests are in audio signal processing for intelligent human-machine interfaces, including dereverberation, denoising, source separation, and robust ASR. Currently, he is an associate member of the IEEE SPS AASP-TC, and a member of the IEEE SPS SL-TC. He was a co-chair of the 2014 REVERB Challenge Workshop, and a General co-chair of the 2017 IEEE ASRU Workshop.

Questions? Please contact tutorials@interspeech2019.org

The tutorial chairs: