21st International Conference on Speech and Computer


Hynek Hermansky

Professor, The Johns Hopkins University, USA

If You Can’t Beat Them, Join Them

It is often argued that in in processing of sensory signals such as speech, engineering should apply knowledge of properties of human perception - both have the same goal of getting information from the signal. We show on examples from speech technology that perceptual research can also learn from advances in technology. After all, speech evolved to be heard and properties of hearing are imprinted on speech. Subsequently, engineering optimizations of speech technology often yield human-like processing strategies. Our current focus is on searching for support for our model of human speech communication which suggests that redundancies introduced in speech production in order to protect the message during its transmission through a realistic noisy acoustic environment are being used by human speech perception for a reliable decoding of the message. That leads to a particular architecture of an automatic recognition (ASR) system in which longer temporal segments of spectrally-smoothed temporal trajectories of spectral energies in individual frequency bands of speech are used to derive estimates of the posterior probabilities of speech sounds. Combinations of these estimates in reliable frequency bands are then adaptively fused to yield the final probability vectors, which best satisfy the adopted performance monitoring criteria. Some ASR systems, which already use elements of the suggested architecture are mentioned in this paper.


Hynek Hermansky (F'01, SM'92. M'83, SM'78) received the Dr. Eng. Degree from the University of Tokyo, and Dipl. Ing. Degree from Brno University of Technology, Czech Republic. He is the Julian S. Smith Professor of Electrical Engineering and the Director of the Center for Language and Speech Processing at the Johns Hopkins University in Baltimore, Maryland. He is also a Research Professor at the Brno University of Technology, Czech Republic. He is a Life Fellow of the Institute of Electrical and Electronic Engineers (IEEE) IEEE, and a Fellow of the International Speech Communication Association (ISCA), was twice an elected Member of the Board of ISCA, a Distinguished Lecturer for ISCA and for IEEE, and is the recipient of the 2013 ISCA Medal for Scientific Achievement. He has been working in speech processing for over 30 years, mainly in acoustic processing for speech recognition.

Odette Scharenborg

Associate Professor, Delft University of Technology, the Netherlands

The representation of speech in the human and artificial brain

Speech recognition is the mapping of a continuous, highly variable speech signal onto discrete, abstract representations. In both human and automatic speech processing, the phoneme is considered to play an important role. Abstractionist theories of human speech processing assume the presence of abstract, phoneme-like units which sequenced together constitute words, while basically all best-performing, large vocabulary automatic speech recognition (ASR) systems use phoneme acoustic models. There is however ample evidence that phonemes might not be the unit of speech representation during human speech processing. Moreover, phoneme-based acoustic models are known to not be able to deal well with the high-variability of speech due to, e.g., coarticulation, faster speaking rates, or conversational speech. The question how is speech represented in the human/artificial brain, although crucial in both the field of human speech processing and the field of automatic speech processing, has historically been investigated in the two fields separately. I will argue that comparisons between humans and DNNs and cross-fertilization of the two research fields can provide valuable insights into the way humans process speech and improve ASR technology.

Specifically, I will present results of several experiments carried out on both human listeners and DNN-based ASR systems on lexically-guided perceptual learning, i.e., the ability to adapt a sound category on the basis of new incoming information resulting in improved processing of subsequent information. Both human listeners and machines need to adapt their sound categories whenever a new speaker is encountered. Human listeners have been found to do this very fast. I will explain how listeners adapt to the speech of new speakers, and I will present the results of a lexically-guided perceptual study we carried out on a DNN-based ASR system, similar to the human experiments. In order to investigate the speech representations and adaptation processes in the DNN-based ASR systems, we visualized the activations in the hidden layers of the DNN. These visualizations revealed that the DNNs showed an adaptation of the phoneme categories similar to what is assumed happens in the human brain. These visualization techniques were also used to investigate what speech representations are inherently learned by a naïve DNN. In this particular study, the input frames were labeled with different linguistic categories: sounds in the same phoneme class, sounds with the same manner of articulation, and sounds with the same place of articulation. The resulting visualizations showed evidence that the DNN appears to learn structures that humans use to understand speech without being explicitly trained to do so.


Odette Scharenborg (PhD) is an associate professor and Delft Technology Fellow at the Multimedia Computing Group at Delft University of Technology, the Netherlands. Previously, she was an associate professor at the Centre for Language Studies, Radboud University Nijmegen, The Netherlands, and a research fellow at the Donders Institute for Brain, Cognition and Behavior at the same university. Her research interests focus on narrowing the gap between automatic and human spoken-word recognition. Particularly, she is interested in the question where the difference between human and machine recognition performance originates, and whether it is possible to narrow this performance gap. She investigates these questions using a combination of computational modelling, machine learning, behavioral experimentation, and EEG. In 2008, she co-organized the Interspeech 2008 Consonant Challenge, which aimed at promoting comparisons of human and machine speech recognition in noise in order to investigate where the human advantage in word recognition originates. She was one of the initiators of the EU Marie Curie Initial Training Network “Investigating Speech Processing In Realistic Environments” (INSPIRE, 2012-2015). In 2017, she co-organized a 6-week Frederick Jelinek Memorial Summer Workshop on Speech and Language Technology on the topic of the automatic discovery of grounded linguistic units for languages without orthography. In 2017, she was elected onto the board of the International Speech Communication Association (ISCA), and in 2018 onto the IEEE Speech and Language Processing Technical Committee.

Erol Şahin

Associate Professor, KOVAN Research Lab, Dept. of Computer Engineering, Middle East Technical University, Turkey

Animating Industrial Robots for Human-Robot Interaction

Assembly operations in the production lines of factories, which require fast and fine manipulation of parts and tools, will remain beyond the capabilities of robotic systems in the near future. Hence, robotic systems are predicted not to replace, but to collaborate the humans working on the assembly lines to increase their productivity. In this talk, I will briefly summarize the vision and goals of our TUBITAK project, titled CIRAK, which aims to develop a robotic manipulator system that will help humans in an assembly task by handing him the proper tools and parts at the right time in a proper manner. Towards this end, I will share our recent studies in which we try to make a commercial robotic manipulator platform, more life-like by making a minimal extension to its look and modifying certain aspects of its behavior. I will present our experimental methodology and initial results from our human-robot interaction experiments.


Dr. Sahin received his PhD in Cognitive and Neural Systems from Boston University in 2000, after getting his BS and MS in Electrical and Electronics Engineering from Bilkent University, and Computer Engineering from METU in 1991 and 1995 respectively. Dr. Sahin worked as a post-doctoral researcher at IRIDIA of Universite Libre de Bruxelles, before assuming his faculty position at the Dept. of Computer Engineering of METU in 2002.

He founded the KOVAN Research Lab., which hosts 4 faculty members and 12 graduate students at the moment. The Lab. has received more than 2,000,000€ of funding from EU, TUBITAK and industry. Dr. Sahin’s has focused on swarm robotics, robotic learning and manipulation during the last decade. Besides publishing in major conferences and journals, Dr. Sahin has edited three journal special issues, three conference proceedings, two books (one published as the State-of-the-Art series of Springer as the “first book on swarm robotics”). In 2007, he was awarded a free iCub humanoid platform from the RobotCub consortium for his research in robotic learning. Between 2013 and 2015, Dr. Sahin visited the Robotics Institute of Carnegie Mellon University, USA through a Marie Curie International Outgoing Fellowship project about learning in robotic manipulation. Dr. Sahin is serving as an Associate Editor for the Adaptive Behavior journal since 2008 and as Editorial Board member of the Swarm Intelligence journal since 2007.