« Crossroads of Speech and Language »


INTERSPEECH conferences are attended by researchers with a long-term track-record in speech sciences and technology, as well as by early-stage researchers or researchers interested in a new domain within the Interspeech areas. An important part of the conference are the tutorials held on the first day of the conference, September 15, 2019. Presented by speakers with long and deep expertise in speech (but they do have changed their looks and methods in the past 150 years since the cartoon on the left side appeared), they will provide their audience with a rich learning experience and an exposure to longstanding research problems, contemporary topics of research as well as emerging areas.

Date and Venue of the Tutorials

September 15, 2019; two 3h sessions, in the morning and in the afternoon, in the main conference location (Messecongress Graz).

Morning tutorials (900 — 1230)

[T1] Generative adversarial network and its applications to speech signal and natural language processing

Generative adversarial network (GAN) is a new idea for training models, in which a generator and a discriminator compete against each other to improve the generation quality. Recently, GAN has shown amazing results in image generation, and a large amount and a wide variety of new ideas, techniques, and applications have been developed based on it. Although there are only few successful cases, GAN has great potential to be applied to text and speech generations to overcome limitations in the conventional methods.

There are three parts in this tutorial. In the first part, we will give an introduction of generative adversarial network (GAN) and provide a thorough review about this technology. In the second part, we will focus on the applications of GAN to speech signal processing, including speech enhancement, voice conversion, speech synthesis, and the applications of domain adversarial training to speaker recognition and lip reading. In the third part, we will describe the major challenge of sentence generation by GAN and review a series of approaches dealing with the challenge. Meanwhile, we will present algorithms that use GAN to achieve text style transformation, machine translation and abstractive summarization without paired data.


  • Hung-yi Lee (Department of Electrical Engineering, National Taiwan University)
  • Yu Tsao (Research Center for Information Technology Innovation, Academia Sinica)
[T2] Statistical voice conversion with direct waveform modeling

Statistical voice conversion (VC) has been attracted attention as one of the most popular research topics in speech synthesis thanks to significant progress of fundamental techniques, the development of freely available resources, and its great potential to develop various applications. In this tutorial, we will give participants an overview of VC techniques by reviewing the basics of VC and recent progress especially highlighting direct waveform modeling techniques, which have been demonstrated as a promising approach in the Voice Conversion Challenge 2018 (VCC2018). Moreover, we will introduce freely-available software, “sprocket” as a statistical VC toolkit and “PytorchWaveNetVocoder” as a neural vocoder toolkit, making it possible for the participants to develop the state-of-the art VC systems as well as the VCC2018 baseline system.


  • Tomoki Toda (Information Technology Center, Nagoya University)
  • Kazuhiro Kobayashi (Information Technology Center, Nagoya University)
  • Tomoki Hayashi (Information Technology Center, Nagoya University)
[T3] Neural machine translation

Although machine translation technology evolved over more than half a century, it was not until recent years that translation quality is finally approaching human-level performance, thanks to the emergence of neural machine translation (NMT) techniques. NMT is one of the biggest revolutions in the history of machine translation and led to the launch of the Google NMT system in late 2016. In this tutorial, we give an overview of the history, mainstream techniques and recent advancements of NMT.


  • Wolfgang Macherey (Google AI)
  • Yuan Cao (Google AI)
[T4] Biosignal-based speech processing: from silent speech to brain-computer interfaces

Speech production is a complex motor process involving several physiological phenomenons, such as the brain, nervous and muscular activities that drive our respiratory, laryngeal and articulatory systems. In the last 10 years, an increasing number of studies have proposed to exploit measurement of these activities, called biosignals, to build prosthetic devices, from silent speech interfaces converting (silent) articulation into text or sound, to brain-computer interfaces decoding speech-related brain activity. A third line of research focus on the use of speech biosignals to provide valuable feedback for speech therapy and language learning.

In this tutorial, we will give an extensive overview of these research domains. We will show that they face common challenges and can thus share common methodological frameworks. Both data and code will be shared with the participants to enable a swift start in this exciting research field.


  • Thomas Hueber (GIPSA-lab/CNRS, Université Grenoble Alpes)
  • Christian Herff (School for Mental Health and Neuroscience, Maastricht University)

Afternoon tutorials (1400 — 1730)

[T5] Generating adversarial examples for speech and speaker recognition and other systems

As neural network classifiers become increasingly successful at various tasks ranging from speech recognition and image classification to various natural language processing tasks and even recognizing malware, a second, somewhat disturbing discovery has also been made. It is possible to fool these systems with carefully crafted inputs that appear to the lay observer to be natural data, but cause the neural network to misclassify in random or even targeted ways.

In this tutorial we will discuss the problem of designing, identifying, and avoiding attacks by such crafted "adversarial" inputs. In the first part, we will explain how the basic training algorithms for neural networks may be turned around to learn adversarial examples, and explain why such learning is nearly always possible. Subsequently, we will explain several approaches to producing adversarial examples to fool systems such as image classifiers, speech recognition and speaker verification systems, and malware detection systems. We will describe both "glass box" approaches, where one has access to the internals of the classifier, and "black box" approaches where one does not. We will subsequently move on to discuss current approaches to identifying such adversarial examples when they are presented to a classifier. Finally, we will discuss recent work on introducing "backdoors" into systems through poisoned training examples, such that the system can be triggered into false behaviors when provided specific types of inputs, but not otherwise.


  • Bhiksha Raj (School of Computer Science, Carnegie Mellon University)
  • Joseph Keshet (Department of Computer Science, Bar-Ilan University)
[T6] Advanced methods for neural end-to-end speech processing – unification, integration, and implementation

An end-to-end neural approach has become a popular alternative to conventional modular approaches in various speech applications including speech recognition and synthesis. One of the benefits of this end-to-end neural framework is that we can use a unified framework for different speech processing problems based on sequence-to-sequence modeling, and can also tightly integrate these problems in a joint training manner.

This tutorial aims to introduce various end-to-end speech processing applications by focusing on the above unified framework and several integrated systems (e.g., speech recognition and synthesis, speech separation and recognition, speech recognition and translation) as implemented within a new open source toolkit named ESPnet (end-to-end speech processing toolkit https://github.com/espnet/espnet).


  • Takaaki Hori (Mitsubishi Electric Research Laboratories)
  • Tomoki Hayashi (Department of Information Science, Nagoya University)
  • Shigeki Karita (NTT Communication Science Laboratories)
  • Shinji Watanabe (Center for Language and Speech Processing, Johns Hopkins University)
[T7] Modeling and deploying dialog systems from scratch using open-source tools

Advancements in machine learning, NLP, and far-field speech recognition in recent years has lead to a new breed of commercial conversational agents that are capable of assisting millions of users for a variety of tasks. It is the combination of several underlying technologies such as Speech Recognition, Language Understanding, Dialog Management and State Tracking, and Response Generation which makes it possible to build such assistants at scale. Despite active research and development in the field of Spoken Dialog Systems (SDS), there hardly exists any library which supports modeling for each of the technologies mentioned above in a coherent and unified experimental setting.

In this tutorial, we will show how to build each single component of a SDS and how to orchestrate them all together to quickly obtain both research prototypes and scalable production-ready systems, i.e. from speech to understanding to dialogue management and response generation. This is made possible by the use of recently open-sourced tools that simplify the building of such systems from scratch. We will use publicly available datasets to train the models and to provide a hands-on experience during the tutorial.


  • Alexandros Papangelis (Uber AI)
  • Piero Molino (Uber AI)
  • Chandra Khatri (Uber AI)
[T8] Microphone array signal processing and deep learning for speech enhancement – strong together

While multi-channel speech enhancement was traditionally approached by linear or non-linear time-variant filtering techniques, in the last years neural network-based solutions have achieved remarkable performance by data-driven learning techniques. Even more recently, hybrid techniques, which blend traditional signal processing with deep learning, have been shown to combine the best of both worlds: achieving excellent enhancement performance, while at the same time being resource efficient and amenable to human interpretability due to the underlying physical model.

In this tutorial we discuss recent advances in signal processing based and neural network based methods, as well as hybrid techniques, to process multi-channel speech input for enhancement, with a focus on robust automatic speech recognition. This will include, but is not limited to, acoustic beamforming, speech dereverberation, and source separation.


  • Reinhold Haeb-Umbach (Department of Communications Engineering, Paderborn University)
  • Tomohiro Nakatani (NTT Communication Science Laboratories)

Questions? Please contact tutorials@interspeech2019.org

The tutorial chairs: