Invited speakers

Prof. Alexander Petrovsky

Head of computer engineering department, Belarusian State University of Informatics and Radioelectronics (BSUIR)

Instantaneous Harmonic Analysis: Techniques and Applications to Speech Signal Processing

Parametric speech modeling is a key issue in various processing applications such as text to speech synthesis, voice morphing, voice conversion and other. Building an adequate parametric model is a complicated problem considering time-varying nature of speech. In this talk we give an overview of tools for instantaneous harmonic modeling covering all underlying stages: analysis, morphing and synthesis. We show how instantaneous analysis can be applied to stationary, frequency-modulated and quasiperiodic signals in order to extract and manipulate instantaneous pitch, excitation and spectrum envelope. Some practical results of speech morphing are given in order to demonstrate capacity of presented techniques.

Dr. Andrew Breen

Director of Speech Synthesis Innovation, Nuance

Creating Expressive TTS Voices for Conversation Agent Applications

Text-to-Speech has traditionally been viewed as a “black box” component, where standard “portfolio” voices are typically offered with a professional but “neutral” speaking style. For commercially important languages many different portfolio voices may be offered all with similar speaking styles. A customer wishing to use TTS will typically choose one of these voices. The only alternative is to opt for a “custom voice” solution. In this case, a customer pays for a TTS voice to be created using their preferred voice talent. Such an approach allows for some “tuning” of the scripts used to create the voice. Limited script elements may be added to provide better coverage of the customers expected domain and “glided phrases” can be included to ensure that specific phrase fragments are spoken perfectly. However, even with such an approach the recording style is strictly controlled and standard scripts are augmented rather than redesigned from scratch. The “black box” approach means that TTS systems can be produced which satisfy the needs of a large number of customers, even if this means that solutions may be limited in the personas they present. Recent advances in conversational agent applications have changed people’s expectations of how a computer voice should sound and interact. Suddenly, it’s much more important for the TTS system to present a persona which matched the goals of the application. Such systems demanded a more flamboyant, upbeat and expressive voice. The “black box” approach is no longer sufficient, voices for high-end conversational agents are being explicitly “designed” to meet the needs of such applications. These voices are both expressive and light, and a complete contrast to the more conservative voices available for traditional markets. This presentation will describe how Nuance is addressing this new and challenging market.

Prof. Geza Nemeth

Head of the Speech Communication and Smart Interactions Laboratories, Department of Telecommunications and Media Informatics, Budapest University of Technology and Economics

Gaps to Bridge in Speech Technology

Although recently there has been significant progress in the general usage and acceptance of speech technology in several developed countries there are still major gaps that prevent the majority of possible users from daily use of speech technology-based solutions. In the talk I will list some of them and propose some directions for bridging these gaps.

Perhaps the most important gap is the "Black box" thinking of software developers. They suppose that inputting text into a text-to-speech (TTS) system will result in voice output that is relevant to the given context of the application. In case of automatic speech recognition (ASR) they wait for accurate text transcription (even punctuation). It is ignored that even humans are strongly influenced by a priori knowledge of the context, the communication partners, etc. For example by serially combining ASR + machine translation + TTS in a speech-to-speech translation system a male speaker at a slow speaking rate might be represented by a fast female voice at the other end. The science of semantic modelling is still in its infancy. In order to produce successful applications researchers of speech technology should find ways to build-in the a priori knowledge into the application environment, adapt their technologies and interfaces to the given scenario. This leads us to the gap between generic and domain specific solutions. For example intelligibility and speaking rate variability are the most important TTS evaluation factors for visually impaired users while human-like announcements at a standard rate and speaking style are required for railway station information systems. An increasing gap is being built between "large" languages/markets and "small" ones. Another gap is the one between closed and open application environments. For example there is hardly any mobile operating system that allows TTS output re-direction into a live telephone conversation. That is a basic need for rehabilitation applications of speech impaired people. Creating an open platform where "smaller" and "bigger" players of the field could equally plug-in their engines/solutions at proper quality assurance and with a fair share of income could help the situation. In the talk some examples will be given about how our teams at BME TMIT try to bridge the gaps listed.