International Conference on Speech and Computer


Björn Schuller

University of Passau, Germany and Imperial College London, UK

Big Data, Deep Learning – At the Edge of X-Ray Speaker Analysis


Abstract: With two years, one has roughly heard a thousand hours of speech - with ten years, around ten thousand. Similarly, an automatic speech recogniser's data hunger these days is often fed in these dimensions. In stark contrast, however, only few databases to train a speaker analysis system contain more than ten hours of speech. Yet, these systems are ideally expected to recognise the states and traits of speakers independent of the person, spoken content, language, cultural background, and acoustic disturbances at human parity or even super-human levels. While this is not reached at the time for many tasks such as speaker emotion recognition, deep learning - often described to lead to "dramatic improvements" - in combination with sufficient learning data satisfying the "deep data cravings" holds the promise to get us there. Luckily, every second, more than two hours of video are uploaded to the web and several hundreds of hours of audio and video communication in most languages of the world take place. If only a fraction of these data would be shared and labelled reliably, "x-ray"-alike automatic speaker analysis could be around the corner for next gen human-computer interaction, mobile health applications, and many further benefits to society. In this light, first a solution towards utmost efficient exploitation of the "big" (unlabelled) data available is presented. Small-world modelling in combination with unsupervised learning help to rapidly identify potential target data of interest. Then, gamified dynamic cooperative crowdsourcing turn its labelling into an entertaining experience, while reducing the amount of required labels to a minimum by learning alongside the target task also the labellers' behaviour and reliability. Then, increasingly autonomous deep holistic end-to-end learning solutions are presented for the task at hand. Benchmarks are given from the 15 research challenges organised by the speaker over the years at Interspeech, ACM Multimedia, and related venues. The concluding discussion will contain some crystal ball gazing alongside practical hints not missing out on ethical aspects.

Biography: Björn W. Schuller received his diploma, doctoral degree, habilitation, and Adjunct Teaching Professor all in electrical engineering and information technology from TUM in Munich/Germany. At present, he is Full Professor and head of the Chair of Complex and Intelligent Systems at the University of Passau/Germany and Reader (Associate Professor) in Machine Learning at Imperial College London/UK. Further, he is the co-founding CEO of audEERING. He is also a permanent Visiting Professor at the Harbin Institute of Technology/P.R. China among further Associateships. Previous major stations include Joanneum Research in Graz/Austria, and the CNRS-LIMSI in Orsay/France. Dr. Schuller is an elected member of the IEEE Speech and Language Processing Technical Committee, Senior Member of the IEEE, and was President of the Association for the Advancement of Affective Computing. He (co-)authored 5 books and more than 600 publications (>12000 citations, h-index > 50). He is the Editor in Chief of the IEEE Transactions on Affective Computing, Associate Editor for Computer Speech and Language, IEEE Signal Processing Letters, IEEE Transactions on Cybernetics, and the IEEE Transactions on Neural Networks and Learning Systems, and a General Chair of ACII 2019 and ACM ICMI 2014, a Program Chair of Interspeech 2019, ACII 2015 and 2011, ACM ICMI 2013, and IEEE SocialCom 2012. He won a range of awards including being honoured as one of 40 extraordinary scientists under the age of 40 by the World Economic Forum in 2015 and 2016. His research has garnered over 8 million EUR in extramural funding: He served as Coordinator or PI in more than 10 European Projects, and is consultant of companies such as Huawei.

Mark Gales

University of Cambridge, Engineering Department, UK

Low-Resource Speech Recognition and Keyword-Spotting


Abstract: The IARPA Babel program ran from March 2012 to November 2016. The aim of the program was to develop agile and robust speech technology that can be rapidly applied to any human language in order to provide effective search capability on large quantities of real world data. This talk will describe developments in speech recognition and keyword-spotting during the lifetime of the project. Three distinct technical areas will be discussed: 1) the application of deep learning for low-resource speech recognition; 2) data augmentation approaches for audio and text; and 3) efficient approaches for keyword spotting. The talk will give an overview of the research from all the participating sites, though with a bias towards approaches developed and evaluated at Cambridge University. Finally a brief analysis of the Babel speech corpora and language characteristics, and language performance will be given.

Biography: Mark Gales studied for the B.A. in Electrical and Information Sciences at the University of Cambridge from 1985-88. Following graduation he worked as a consultant at Roke Manor Research Ltd. In 1991 he took up a position as a Research Associate in the Speech Vision and Robotics group in the Engineering Department at Cambridge University. In 1995 he completed his doctoral thesis: Model-Based Techniques for Robust Speech Recognition supervised by Professor Steve Young. From 1995-1997 he was a Research Fellow at Emmanuel College Cambridge. He was then a Research Staff Member in the Speech group at the IBM T.J.Watson Research Center until 1999 when he returned to Cambridge University Engineering Department as a University Lecturer. He was appointed Reader in Information Engineering in 2004. He is currently a Professor of Information Engineering and a College Lecturer and Official Fellow of Emmanuel College. Mark Gales is a Fellow of the IEEE, a Senior Area Editor of IEEE/ACM Transactions on Audio Speech and Language Processing for speech recognition and synthesis, and a member of the Speech and Language Processing Technical Committee (2015-2017, previously a member from 2001-2004). He was an associate editor for IEEE Signal Processing Letters from 2008-2011 and IEEE Transactions on Audio Speech and Language Processing from 2009-2013. He is currently on the Editorial Board of Computer Speech and Language.