Title: Statistical approach to speech synthesis: past, present and future
When and where: Monday, 16 September, 930–1030, Main Hall
We are proud to announce that one of the keynote speeches will be delivered by this year’s ISCA medalist Keiichi Tokuda.
Abstract: The basic problem of statistical speech synthesis is quite simple: we have a speech database for training, i.e., a set of speech waveforms and corresponding texts; given a text not included in the training data, what is the speech waveform corresponding to the text? The whole text-to-speech generation process is decomposed into feasible subproblems: usually, text analysis, acoustic modeling, and waveform generation, combined as a statistical generative model. Each submodule can be modeled by a statistical machine learning technique: first, Hidden Markov Models were applied to acoustic modeling module and then various types of deep neural networks (DNN) have been applied to not only acoustic modeling module but also other modules. I will give an overview of such statistical approaches to speech synthesis, looking back on the evolution in the last couple of decades. Recent DNN-based approaches drastically improved the speech quality, causing a paradigm shift from concatenative speech synthesis approach to generative model-based statistical approach. However, for realizing human-like talking machines, the goal is not only to generate natural-sounding speech but also to flexibly control variations in speech, such as speaker identities, speaking styles, emotional expressions, etc. This talk will also discuss such future challenges and the direction in speech synthesis research.
Keiichi Tokuda received the B.E. degree in electrical and electronic engineering from Nagoya Institute of Technology, Nagoya, Japan, the M.E. and Dr.Eng. degrees in information processing from the Tokyo Institute of Technology, Tokyo, Japan, in 1984, 1986, and 1989, respectively. From 1989 to 1996 he was a Research Associate at the Department of Electronic and Electric Engineering, Tokyo Institute of Technology. From 1996 to 2004 he was an Associate Professor at the Department of Computer Science, Nagoya Institute of Technology as Associate Professor, and now he is a Professor at the same institute. He is also an Honorary Professor at the University of Edinburgh. He was an Invited Researcher at ATR Spoken Language Translation Research Laboratories, Japan from 2000 to 2013 and was a Visiting Researcher at Carnegie Mellon University from 2001 to 2002 and at Google from 2014 to 2015. He published over 90 journal papers and over 200 conference papers, and received seven paper awards and three achievement awards. He was a member of the Speech Technical Committee of the IEEE Signal Processing Society from 2000 to 2003, a member of ISCA Advisory Council and an associate editor of IEEE Transactions on Audio, Speech & Language Processing, and acts as organizer and reviewer for many major speech conferences, workshops and journals. He is an IEEE Fellow and ISCA Fellow. His research interests include speech coding, speech synthesis and recognition, and statistical machine learning.
Title: Biosignal processing for human-machine interaction
When and where: Tuesday, 17 September, 830–930, Main Hall
Abstract: Human interaction is a complex process involving modalities such as speech, gestures, motion, and brain activities emitting a wide range of biosignals, which can be captured by a broad panoply of sensors. The processing and interpretation of these biosignals offer an inside perspective on human physical and mental activities and thus complement the traditional way of observing human interaction from the outside. As recent years have seen major advances in sensor technologies integrated into ubiquitous devices, and in machine learning methods to process and learn from the resulting data, the time is right to use of the full range of biosignals to gain further insights into the process of human-machine interaction.
In my talk I will present ongoing research at the Cognitive Systems Lab (CSL), where we explore interaction-related biosignals with the goal of advancing machine-mediated human communication and human-machine interaction. Several applications will be described such as Silent Speech Interfaces that rely on articulatory muscle movement captured by electromyography to recognize and synthesize silently produced speech, as well as Brain Computer Interfaces that use brain activity captured by electrocorticography to recognize speech (brain-to-text) and directly convert electrocortical signals into audible speech (brain-to-speech). I will also describe the recording, processing and automatic structuring of human everyday activities based on multimodal high-dimensional biosignals within the framework of EASE, a collaborative research center on cognition-enabled robotics. This work aims to establish an open-source biosignals corpus for investigations on how humans plan and execute interactions with the aim of facilitating robotic mastery of everyday activities.
Tanja Schultz received her diploma and doctoral degree in Informatics from University of Karlsruhe, Germany, in 1995 and 2000. Prior to these degrees she completed her Masters degree in Mathematics, Sports, Physical and Educational Science from Heidelberg University, Germany in 1989. Dr. Schultz is the Professor for Cognitive Systems at the University of Bremen, Germany and adjunct Research Professor at the Language Technologies Institute of Carnegie Mellon, PA USA. Since 2007, she directs the Cognitive Systems Lab, where her research activities include multilingual speech recognition and the processing, recognition, and interpretation of biosignals for human-centered technologies and applications. Since 2019 she is the spokesperson of the University Bremen high-profile area “Minds, Media, Machines”. Prior to joining University of Bremen, she was a Research Scientist at Carnegie Mellon (2000-2007) and a Full Professor at Karlsruhe Institute of Technology in Germany (2007-2015). Dr. Schultz is an Associate Editor of ACM Transactions on Asian Language Information Processing (since 2010), serves on the Editorial Board of Speech Communication (since 2004), and was Associate Editor of IEEE Transactions on Speech and Audio Processing (2002-2004). She was President (2014-2015) and elected Board Member (2006-2013) of ISCA, and a General Co-Chair of Interspeech 2006. She was elevated to Fellow of ISCA (2016) and to member of the European Academy of Sciences and Arts (2017). Dr. Schultz was the recipient of the Otto Haxel Award in 2013, the Alcatel Lucent Award for Technical Communication in 2012, the PLUX Wireless Biosignals Award in 2011, the Allen Newell Medal for Research Excellence in 2002, and received the ISCA / EURASIP Speech Communication Best paper awards in 2001 and 2015.
Title: Physiology and physics of voice production
When and where: Wednesday, 18 September, 830–930, Main Hall
Abstract: Our knowledge-based societies in the information age are highly dependent on efficient verbal communication. Today most people have employments which rely on their communication competence. Consequently, communication disorders became a worldwide socio-economic factor. To increase the quality of life on one hand and to keep the economic costs under control on the other, new medical strategies are needed to prevent communication disorders, enable early diagnosis and eventually treat and rehabilitate people concerned.
The key issue for communication is phonation, a complex process of voice production taking place in the larynx. Based on aeroacoustic principles, the sound is generated by the pulsating air jet and supra-glottal turbulent structures. The laryngeal sound is further filtered and amplified by the supra-glottal acoustic resonance spaces, radiated at the lips and perceived as voice. There is no doubt that the possibility to produce voice is crucial for human communication, although many people do not realize this until they lose their voice temporarily, e.g. due to common respiratory inflammations.
The talk will focus on the physiology and physics of voice production and will provide a state of art towards current computer simulations for pre-surgical predictions of voice quality, and development of examination and training of voice professionals.
Manfred Kaltenbacher received his Dipl.-Ing. in electrical engineering from Graz University of Technology, Austria in 1992, his Ph.D. in technical science from the Johannes Kepler University of Linz, Austria in 1996, and his habilitation from Friedrich-Alexander-University of Erlangen-Nuremberg, Germany, in 2004. In 2008 he became a full professor for Applied Mechatronics at the Alps-Adriatic University Klagenfurt, Austria. In 2012 he moved to Vienna University of Technology, Austria, as a full professor for Measurement and Actuator Technology. Manfred Kaltenbacher is author and co-author of a book, seven book chapters and more than 80 peer-reviewed journal publications. His main research interests are focused towards advanced Finite Element (FE) methods for multi-physics (vibro- and aero-acoustics, magneto-mechanics and piezoelectrics) and combined experimental and simulation based methods for material parameter determination as well as acoustic source localization. He is a member of the Austrian Academy of Sciences, American Institute of Aeronautics and Astronautics, European Acoustics Association, the German Society of Acoustics, the Austrian Acoustic Association, Institute of Electrical and Electronics Engineers, and International Association of Applied Mathematics and Mechanics. Currently, he is the president of the Austrian National Committee for Theoretical and Applied Mechanics, head of section Aeroacoustics of DEGA, member of Editorial Advisory Board of Acta Mechanica, editor of Journal of Theoretical and Computational Acoustics, and associate editor Acta Acustica united with Acustica.
Title: Learning natural language interfaces with neural models
When and where: Thursday, 19 September, 830–930, Main Hall
Abstract: In Spike Jonze’s futuristic film “Her”, Theodore, a lonely writer, forms a strong emotional bond with Samantha, an operating system designed to meet his every need. Samantha can carry on seamless conversations with Theodore, exhibits a perfect command of language, and is able to take on complex tasks. She filters his emails for importance, allowing him to deal with information overload, she proactively arranges the publication of Theodore’s letters, and is able to give advice using common sense and reasoning skills.
In this talk I will present an overview of recent progress on learning natural language interfaces which might not be as clever as Samantha but nevertheless allow uses to interact with various devices and services using everyday language. I will address the structured prediction problem of mapping natural language utterances onto machine-interpretable representations and outline the various challenges it poses. For example, the fact that the translation of natural language to formal language is highly non-isomorphic, data for model training is scarce, and natural language can express the same information need in many different ways. I will describe a general modeling framework based on neural networks which tackles these challenges and improves the robustness of natural language interfaces.
Mirella Lapata is a professor of natural language processing in the School of Informatics at the University of Edinburgh. Her research focuses on getting computers to understand, reason with, and generate natural language. She is the first recipient (2009) of the British Computer Society and Information Retrieval Specialist Group (BCS/IRSG) Karen Sparck Jones award and a Fellow of the Royal Society of Edinburgh. She has also received best paper awards in leading NLP conferences and has served on the editorial boards of the Journal of Artificial Intelligence Research, the Transactions of the ACL, and Computational Linguistics. She was president of SIGDAT (the group that organizes EMNLP) in 2018.