Special Sessions & Challenges
The Organizing Committee of INTERSPEECH 2019 is proudly announcing the following special sessions and challenges for INTERSPEECH 2019.
Special sessions and challenges focus on relevant ‘special’ topics which may not be covered in regular conference sessions.
Papers have to be submitted following the same schedule and procedure as regular papers; the papers undergo the same review process by anonymous and independent reviewers.
Styrian Dialects, Continuous Sleepiness, Baby Sounds & Orca Activity
Interspeech ComParE is an open Challenge dealing with states and traits of speakers as manifested in their speech signal’s properties. In this 11th edition, we introduce four new tasks and Sub-Challenges:
- Styrian Dialects Recognition in Spoken Language,
- Continuous Sleepiness Estimation in Speech,
- Baby Sound Recognition,
- Orca Activity Detection.
Sub-Challenges allow contributors to find their own features with their own machine learning algorithm. However, a standard feature set and tools are provided that may be used. Participants have five trials on the test set per Sub-Challenge. Participation has to be accompanied by a paper presenting the results that undergoes the Interspeech peer-review.
Contributions using the provided or equivalent data are sought for (but not limited to):
- Participation in a Sub-Challenge
- Contributions focusing centered around the Challenge topics
Results of the Challenge and Prizes will be presented at Interspeech 2019 in Graz, Austria.
Please visit: http://www.compare.openaudio.eu/compare2019/
- Björn Schuller (U Augsburg, Germany / Imperial College, UK / audEERING)
- Anton Batliner (U Augsburg, Germany)
- Christian Bergler (FAU, Germany)
- Florian Pokorny (MU Graz, Austria)
- Jarek Krajewski (U Wuppertal / RUAS Cologne, Germany)
- Meg Cychosz (UC Berkeley, USA)
The VOiCES from a distance challenge will be focused on benchmarking and further improving state-of-the-art technologies in the area of speaker recognition and automatic speech recognition (ASR) for far-field speech. The challenge is based on the recently released corpus Voices Obscured on Complex Environmental Settings (VOiCES), were noisy speech was recorded in real reverberant rooms with multiple microphones. Noise sources included babble, music, or television. The challenge will have two tracks for speaker recognition and ASR:
- Fixed System - Training data is limited to specific datasets
- Open System - Participants can use any external datasets they have access to (private or public)
The participating teams will get early access to the VOiCES phase II data, which will form the evaluation set for the challenge. The special session will be dedicated to the discussion of applied technology, performance thereof and any issues highlighted as a result of the challenge.
For more information visit: https://voices18.github.io/Interspeech2019-Special-Session/
- Aaron Lawson (SRI International)
- Colleen Richey (SRI International)
- Maria Alejandra Barros (Lab41, In-Q-Tel)
- Mahesh Kumar Nandwana (SRI International)
- Julien van Hout (SRI International)
The INTERSPEECH 2019 special session on Automatic Speaker Verification Spoofing and Countermeasures (ASVspoof) will accelerate anti-spoofing research for automatic speaker verification (ASV).
The first challenge, ASVspoof 2015, focused on speech synthesis and voice conversion spoofing attacks. The second challenge, ASVspoof 2017, focused on replay spoofing attacks. ASVspoof 2019, the third in a series of such challenges will be the first challenge with a broad focus on all three types of spoofing attacks. In a continuation of 2015 and 2017 editions, ASVspoof 2019 promotes the development of generalised spoofing countermeasures, namely countermeasures that perform reliably in the face of unpredictable variation in attack types and algorithms.
ASVspoof 2019 has two sub-challenges:
- Logical access and speech synthesis/voice conversion attack:
The data used for ASVspoof 2015 included spoofing attacks generated with text-to-speech (TTS) and voice conversion (VC) attacks generated with the state-of-the-art systems at that time. Since then, considerable progress has been reported by both TTS and VC communities. The quality of synthetic speech produced with today’s best technology is now perceptually indistinguishable from bona fide speech. Since these technologies can be used to project convincing speech signals over the telephone, they pose substantial threats to the reliability of ASV. This scenario is referred to as logical access. The assessment of countermeasures, namely automatic systems that can detect non bona fide, spoofed speech produced with the latest TTS and VC technologies is therefore needed urgently.
- Physical access and replay attack:
The ASVspoof 2017 database included various types of replayed audio files recorded at several places via many different devices. Progress in the development of countermeasures for replay detection has been rapid, with substantial improvements in performance being reported each year. The 2019 edition of ASVspoof features a distinct physical access and replay attack condition in the form of a far more controlled evaluation setup than that of the 2017 condition. The physical access scenario is relevant not just to ASV, but also to the emerging problem of fake audio detection that is faced in a host of additional applications including voice interaction and authentication with smart objects (e.g. smart-speakers and voice-driven assistants).
In addition, ASVspoof 2019 will adopt a new t-DCF evaluation metric that reflects the impact of spoofing and of countermeasures on ASV performance.
For more details, please see the challenge site at http://www.asvspoof.org
- Junichi Yamagishi (NII, Japan & Univ. of Edinburgh, UK)
- Massimiliano Todisco (EURECOM, France)
- Md Sahidullah (Inria, France)
- Héctor Delgado (EURECOM, France)
- Xin Wang (National Institute of Informatics, Japan)
- Nicholas Evans (EURECOM, France)
- Tomi Kinnunen (University of Eastern Finland, Finland)
- Kong Aik Lee (NEC, JAPAN)
- Ville Vestman (University of Eastern Finland, Finland)
Typical speech synthesis systems are built with an annotated corpus made of audio from a target voice plus text (and/or aligned phonetic labels). Obtaining such an annotated corpus is costly and not scalable considering the thousands of 'low resource' languages lacking in linguistic expertise or without a reliable orthography.
The ZeroSpeech 2019 challenge addresses this problem by proposing to build a speech synthesizer without any text or phonetic labels, hence, 'TTS without T' (text-to-speech without text). In this challenge, similarly, we provide raw audio for the target voice(s) in an unknown language, but no alignment, text or labels.
Participants will have to rely on automatically discovered subword units and align them to the voice recording in a way that works best for the purpose of synthesizing novel utterances from novel speakers. The task extends previous challenge editions with the requirement to synthesize speech, which provides an additional objective, thereby helping the discovery of acoustic units that are linguistically useful.
For more information please visit: http://www.zerospeech.com/2019/
- Ewan Dunbar (Laboratoire de Linguistique Formelle, Cognitive Machine Learning [CoML])
- Emmanuel Dupoux (Cognitive Machine Learning [CoML], Facebook A.I. Research)
- Robin Algayres (Cognitive Machine Learning [CoML])
- Sakriani Sakti (Nara Institute of Science and Technology, RIKEN Center for Advanced Intelligence Project)
- Xuan-Nga Cao (Cognitive Machine Learning [CoML])
- Mathieu Bernard (Cognitive Machine Learning [CoML])
- Julien Karadayi (Cognitive Machine Learning [CoML])
- Juan Benjumea (Cognitive Machine Learning [CoML])
- Lucas Ondel (Department of Computer Graphics and Multimedia, Brno University of Technology)
- Alan W. Black (Language Technologies Institute, Carnegie Mellon University)
- Laurent Besacier (Laboratoire d’Informatique de Grenoble, équipe GETALP)
This special session aims to bring together researchers and practitioners from academia and industry working on the challenging task of processing spoken language produced by children.
While recent years have seen dramatic advances in the performance of a wide range of speech processing technologies (such as automatic speech recognition, speaker identification, speech-to-speech machine translation, sentiment analysis, etc.), the performance of these systems often degrades substantially when they are applied to spoken language produced by children. This is partly due to a lack of large-scale data sets containing examples of children's spoken language that can be used to train models but also because children's speech differs from adult speech at many levels, including acoustic, prosodic, lexical, morphosyntactic, and pragmatic.
We envision that this session will bring together researchers working in the field of processing children's spoken language for a variety of downstream applications to share their experiences about what approaches work best for this challenging population.
For more information please visit: https://sites.google.com/view/wocci/home/interspeech-2019-special-session
- Keelan Evanini (Educational Testing Service)
- Maryam Najafian (MIT)
- Saeid Safavi (University of Surrey)
- Kay Berkling (Duale Hochschule Baden-Württemberg)
The assessment of voice quality is relevant to the clinical care of disordered voices. It contributes to the selection and optimization of clinical treatment as well as to the evaluation of the treatment outcome. Levels of description of voice quality include the biomechanics of the vocal folds and their kinematics, temporal and spectral acoustic features, as well as the auditory scoring of hoarseness, hyper- and hypo-functionality, creakiness, diplophonia, harshness, etc. Broad and fuzzy definitions of terms regarding voice quality are in use, which impede scientific and clinical communication.
The aim of the special session is to contribute to the improvement of the clinical assessment of voice quality via a translational approach, which focuses on quantifying and explaining relationships between several levels of description. The goal is to objectify voice quality via (i) the analysis and simulation of vocal fold vibrations by means of high-speed videolaryngoscopy in combination with kinematic or mechanical modelling, (ii) the synthesis of disordered voices joint with auditory experimentation involving disordered voice stimuli, as well as (iii) the statistical analysis and automatic classification of distinct types of voice quality via video and/or audio features.
- Philipp Aichinger (firstname.lastname@example.org)
- Abeer Alwan (email@example.com)
- Carlo Drioli (firstname.lastname@example.org)
- Jody Kreiman (email@example.com)
- Jean Schoentgen (firstname.lastname@example.org)
Research devoted to understanding the relationship between verbal and nonverbal communication modes, and investigating the perceptual and cognitive processes involved in the coding/decoding of emotional states is particularly relevant in the fields of Human-Human and Human-Computer Interaction.
When it comes to speech, it is unmistakable that the same linguistic expression may be uttered for teasing, challenging, stressing, supporting, inquiring, answering or as expressing an authentic doubt. The appropriate continuance of the interaction depends on detecting the addresser’s mood.
To progress towards a better understanding of such interactional facets, more accurate solutions are needed for defining emotional and empathic contents underpinning daily interactional exchanges, developing signal processing algorithms able to capture emotional features from multimodal social signals and building mathematical models integrating emotional behaviour in interaction strategies.
The themes of this special session are multidisciplinary in nature and closely connected in their final aims to identify features from realistic dynamics of emotional speech exchanges. Of particular interest are analyses of visual, textual and audio information and corresponding computational efforts to automatically detect and interpret their semantic and pragmatic contents.
A special issue of the Journal Computer Speech and Language is foreseen as an outcome of this special session.
Details can be found on the web page: http://www.empathic-project.eu/index.php/ssinterspeech2019/
- ANNA ESPOSITO (email@example.com; firstname.lastname@example.org)
- MARIA INÉS TORRES (email@example.com)
- OLGA GORDEEVA (firstname.lastname@example.org)
- RAQUEL JUSTO (email@example.com)
- ZORAIDA CALLEJAS CARRIÓN (firstname.lastname@example.org)
- KRISTIINA JOKINEN (email@example.com
- GENNARO CORDASCO (firstname.lastname@example.org)
- BJIOERN SCHULLER (email@example.com)
- CARL VOGEL (firstname.lastname@example.org)
- ALESSANDRO VINCIARELLI (Alessandro.Vinciarelli@glasgow.ac.uk)
- GERARD CHOLLET (email@example.com)
- NEIL GLACKIN (firstname.lastname@example.org)
While service quality of speech and audio interfaces can be improved using interconnected devices and cloud services, it simultaneously increases the likelihood and impact of threats to the users’ privacy. This special session is focused on understanding the privacy issues that appear in speech and audio interfaces, as well as on the methods we have for retaining a level of privacy which is appropriate for the user.
Contributions to this session are invited especially for
- Privacy-preserving processing methods for speech and audio
- De-identification and obfuscation for speech and audio
- User-interface design for privacy in speech and audio
- Studies and resources on the experience and perception of privacy in speech and audio signals
- Detection of attacks on privacy in speech and audio interfaces
More information at http://speechprivacy2019.aalto.fi
- Tom Bäckström (email@example.com)
- Stephan Sigg (firstname.lastname@example.org)
- Rainer Martin (email@example.com)
Speech technologies exist for many high resource languages, and attempts are being made to reach the next billion users by building resources and systems for many more languages. Multilingual communities pose many challenges for the design and development of speech processing systems. One of these challenges is code-switching, which is the switching of two or more languages at the conversation, utterance and sometimes even word level.
Code-switching is found in text in social media, instant messaging and blogs in multilingual communities in addition to conversational speech. Monolingual natural language and speech systems fail when they encounter code-switched speech and text. There is a lack of data and linguistic resources for code-switched speech and text. Code-switching provides various interesting challenges to the speech community, such as language modeling for mixed languages, acoustic modeling of mixed language speech, pronunciation modeling and language identification from speech.
The third edition of the special session on speech technologies for code-switching will span these topics, in addition to discussions about data and resources for building code-switched systems.
- Kalika Bali (Researcher, Microsoft Research India: firstname.lastname@example.org)
- Alan W Black (Professor, Language Technologies Institute, Carnegie Mellon University, USA: email@example.com)
- Julia Hirschberg (Professor, Computer Science Department, Columbia University, USA: firstname.lastname@example.org)
- Sunayana Sitaram (Senior Applied Scientist, Microsoft Research India: email@example.com)
- Thamar Solorio (Associate Professor, Department of Computer Science, University of Houston, USA: firstname.lastname@example.org)
The Second DIHARD Speech Diarization Challenge (DIHARD II) is an open challenge of speech diarization in challenging acoustic environments including meeting speech, child language acquisition data, speech in restaurants, and web video. Whereas DIHARD I focused exclusively on diarization from single channel recordings, in conjunction with the organizers of the CHiME challenges, DIHARD II will also include tracks focusing on diarization from multichannel recordings of dinner parties.
Submissions are invited from both academia and industry and may use any dataset (publicly available or proprietary) subject to the challenge rules. Additionally, a development set, which may be used for training, and a baseline system will be provided. Performance will be evaluated using diarization error rate (DER) and a modified version of the Jaccard index. If you are interested and wish to be kept informed, please send an email to the organizers at email@example.com and visit the website: https://coml.lscp.ens.fr/dihard/.
- Neville Ryant (Linguistic Data Consortium, University of Pennsylvania, Philadelphia, PA, USA)
- Alejandrina Cristia (Laboratoire de Sciences Cognitives et Psycholinguistique, ENS, Paris, France)
- Kenneth Church (Baidu Research, Sunnyvale, CA, USA)
- Christopher Cieri (Linguistic Data Consortium, University of Pennsylvania, Philadelphia, PA, USA)
- Jun Du (University of Science and Technology of China, Hefei, China)
- Sriram Ganapathy (Electrical Engineering Department, Indian Institute of Science, Bangalore, India)
- Mark Liberman (Linguistic Data Consortium, University of Pennsylvania, Philadelphia, PA, USA)