21st International Conference on Speech and Computer


Vanessa Evers

Professor, University of Twente, the Netherlands


The classic image in the psychology of Human-Robot Interaction is that of a person who is focused and eager to learn how to work with or control a robot. The job of the roboticist then is primarily to avoid mistakes in accuracy of detection, manipulation, navigation, decision making, planning and so on to optimize human robot collaboration.


In this talk I will argue that social norms embedded in people, robots and the context in which the robots are used make this approach obsolete. Specifically, I will address the following questions:

– How do people understand robot behaviors?

– What do we know about people and robots collaborating?

– Can a robot understand human social behaviors?

– How does knowledge about human social relationships necessitate a change in our thinking about how humans should be modeled?

– How can the design of robots and their behavior improve acceptance of robots in everyday environments such as our homes, airports, museums, schools, roads, and hospitals?

Through examples of practical deployment of robots, I will explore the fundamentally social relationship people have with autonomous robots and offer essential rules for effective human-robot collaboration.


Vanessa Evers is a full Professor of Human Media Interaction at the University of Twente. Her research focuses on the design and development of Socially Intelligent Agents. This concerns human interaction with autonomous agents such as robots or machine learning systems and cultural aspects of Human Computer Interaction. She is best known for her work on social robotics such as the FROG robot (fun robotic outdoor guide), SPENCER (The airport service robot) and DE-ENIGMA (robot for autism education) that can interpret human behavior automatically and respond to people in a socially acceptable way. She is very active organizing scientific conferences and as editor of academic journals, she is a speaker on AI and Robotics at international events such as the World Economic Forum and is a frequent contributor to the media in newspapers and tv-shows.

Odette Scharenborg

Associate Professor, Delft University of Technology, the Netherlands

The representation of speech in the human and artificial brain

Speech recognition is the mapping of a continuous, highly variable speech signal onto discrete, abstract representations. In both human and automatic speech processing, the phoneme is considered to play an important role. Abstractionist theories of human speech processing assume the presence of abstract, phoneme-like units which sequenced together constitute words, while basically all best-performing, large vocabulary automatic speech recognition (ASR) systems use phoneme acoustic models. There is however ample evidence that phonemes might not be the unit of speech representation during human speech processing. Moreover, phoneme-based acoustic models are known to not be able to deal well with the high-variability of speech due to, e.g., coarticulation, faster speaking rates, or conversational speech. The question how is speech represented in the human/artificial brain, although crucial in both the field of human speech processing and the field of automatic speech processing, has historically been investigated in the two fields separately. I will argue that comparisons between humans and DNNs and cross-fertilization of the two research fields can provide valuable insights into the way humans process speech and improve ASR technology.

Specifically, I will present results of several experiments carried out on both human listeners and DNN-based ASR systems on lexically-guided perceptual learning, i.e., the ability to adapt a sound category on the basis of new incoming information resulting in improved processing of subsequent information. Both human listeners and machines need to adapt their sound categories whenever a new speaker is encountered. Human listeners have been found to do this very fast. I will explain how listeners adapt to the speech of new speakers, and I will present the results of a lexically-guided perceptual study we carried out on a DNN-based ASR system, similar to the human experiments. In order to investigate the speech representations and adaptation processes in the DNN-based ASR systems, we visualized the activations in the hidden layers of the DNN. These visualizations revealed that the DNNs showed an adaptation of the phoneme categories similar to what is assumed happens in the human brain. These visualization techniques were also used to investigate what speech representations are inherently learned by a naïve DNN. In this particular study, the input frames were labeled with different linguistic categories: sounds in the same phoneme class, sounds with the same manner of articulation, and sounds with the same place of articulation. The resulting visualizations showed evidence that the DNN appears to learn structures that humans use to understand speech without being explicitly trained to do so.


Odette Scharenborg (PhD) is an associate professor and Delft Technology Fellow at the Multimedia Computing Group at Delft University of Technology, the Netherlands. Previously, she was an associate professor at the Centre for Language Studies, Radboud University Nijmegen, The Netherlands, and a research fellow at the Donders Institute for Brain, Cognition and Behavior at the same university. Her research interests focus on narrowing the gap between automatic and human spoken-word recognition. Particularly, she is interested in the question where the difference between human and machine recognition performance originates, and whether it is possible to narrow this performance gap. She investigates these questions using a combination of computational modelling, machine learning, behavioral experimentation, and EEG. In 2008, she co-organized the Interspeech 2008 Consonant Challenge, which aimed at promoting comparisons of human and machine speech recognition in noise in order to investigate where the human advantage in word recognition originates. She was one of the initiators of the EU Marie Curie Initial Training Network “Investigating Speech Processing In Realistic Environments” (INSPIRE, 2012-2015). In 2017, she co-organized a 6-week Frederick Jelinek Memorial Summer Workshop on Speech and Language Technology on the topic of the automatic discovery of grounded linguistic units for languages without orthography. In 2017, she was elected onto the board of the International Speech Communication Association (ISCA), and in 2018 onto the IEEE Speech and Language Processing Technical Committee.