International Conference
on Speech and Computer
Athens, Greece
September 20-24, 2015


Gerhard Rigoll

Institute for Human-Machine-Communication TU Munich

Multimodal Human-Robot Interaction from the Perspective of a Speech Scientist

Human-Robot-Interaction (HRI) is a research area that developed steadily with the rise of robotics and of multimodality in classical human-computer interaction (HCI) during the last decades. Looking at typical situations where robots are nowadays mostly deployed in industrial environments, such as manufacturing in the automotive industry, it is obvious that the direct communication between humans and robots have so far never played a major role in such environments. The same is true for the field of autonomous robotics, where robots navigate independently through complex environments. The importance of HRI however changed dramatically, when the new research direction of service robotics became more and more popular. Service robots are designed to solve problems in cooperation with humans and therefore an effective communication between them is crucial in this case. In the first part of this talk, the major differences between classical human-machine communication and human-robot interaction will be discussed and from there the most important challenges in HRI will be derived. The first major difference to classical HCI is the fact that a robot is typically a mobile device and thus interaction will be mostly carried out over a certain distance between human and robot. This involves in almost all cases a strong impact of robustness for every communication channel, especially for the acoustic interaction modality. An even more important difference is the “embodiment effect”, because a robot has typically a body that can be more or less directly involved in the interaction process. One of the consequences is the option that it is not only possible to interact with robots similarly to e.g. interacting with smartphones, but instead to actively perform a physical joint cooperation with robots, e.g. carrying something together or mounting a heavy piece jointly with the robot. The embodiment has also other implications, e.g. concerning social behavior and acceptance issues. In the remaining part, this talk will cover other important issues of HRI, e.g. the impact of modalities that play a less important role in classical HCI, such as e.g. the design of gaze, face or facial expressions as output modality. Concerning typical input modalities, the talk will eventually investigate the role of speech communication as well as acoustic scene analysis for HRI and it will be shown that this acoustic modality has a direct connection to the area of social robotics. Finally, an additional aspect of HRI will be discussed, that is even applicable to the before mentioned industrial manufacturing robots that never employed a sophisticated communication channel between robot and human so far, namely the role of HRI during the learning or training phase of any kind of robot, e.g. for the purpose of imitation learning or programming by demonstration.

Yannis Stylianou

Computer Science Dept. Univ. of Crete, and Toshiba, Cambridge Research Lab, Cambridge UK

Speech Intelligibility

Speech output, whether live, recorded or generated from text, is increasingly used in a range of applications, including public address systems, vehicle navigation devices and mobile phones. Maintaining intelligibility in such settings without resorting to increases in output level is a challenge, particular in the presence of additive and convolutional distortions. Human talkers appear to adapt their speech generation strategies to the immediate context at a number of levels, resulting in changes to the acoustic, phonetic and linguistic content of speech. Recently, speech modification algorithms designed to promote intelligibility have been proposed and useful gains in intelligibility in noise have been reported.

The purpose of the talk is to present the main current and state of the art approaches as these have been evaluated within a common framework referred to as Hurricane Challenges. The talk will quantify the effect on intelligibility of modifications to both natural and synthetic speech under energy and durational constraints.

I will finally make a focus on a particularly successful approach for boosting the intelligibility of speech in noise, referred to as SSDRC (Spectral Shaping and Dynamic Range Compression). Approaches of SSDRC for natural and synthetic speech will be shown as well real-time demos will be provided.

Murat Saraclar

Bogazici University


This talk will summarize the research on discriminative language modeling focusing on its application to automatic speech recognition (ASR). Discriminative language modeling is a feature based approach that complements traditional generative n-gram language modeling. A discriminative language model (DLM) is typically a linear or log-linear model consisting of a weight vector associated with a feature vector representation of a sentence. This flexible representation can include linguistically and statistically motivated features that incorporate morphological and syntactic information. At test time, DLMs are used to rerank the output of an ASR system, represented as an n-best list or lattice. During training, both negative and positive examples are used with the aim of directly optimizing the error rate. Various machine learning methods, including the structured perceptron, large margin methods and maximum regularized conditional log-likelihood, have been used for estimating the parameters of DLMs. Typically positive examples for DLM training come from the manual transcriptions of acoustic data while the negative examples are obtained by processing the same acoustic data with an ASR system. Recent extensions to DLMs attempt to generalize its use by either using automatic transcriptions for the positive examples or simulating the negative examples. Discriminative language modeling outperforms the conventional approaches, partly due to the improved parameter estimates with discriminative training and partly due to using features that can reflect complex language characteristics.