|
Invited speakers
|
|
Head of computer engineering department, Belarusian
State University of Informatics and Radioelectronics (BSUIR)
|
Instantaneous Harmonic Analysis: Techniques and
Applications to
Speech Signal Processing
Parametric speech modeling is a key issue in various processing
applications
such as
text to
speech synthesis, voice morphing, voice conversion and other. Building an
adequate
parametric model is a complicated problem considering time-varying nature
of
speech. In this
talk we give an overview of tools for instantaneous harmonic modeling
covering all
underlying stages: analysis, morphing and synthesis. We show how
instantaneous
analysis can
be applied to stationary, frequency-modulated and quasiperiodic signals in
order to
extract
and manipulate instantaneous pitch, excitation and spectrum envelope. Some
practical results
of speech morphing are given in order to demonstrate capacity of presented
techniques.
|
|
|
Director of Speech Synthesis Innovation, Nuance
|
Creating Expressive TTS Voices for Conversation
Agent Applications
Text-to-Speech has traditionally been viewed as a “black box” component,
where standard “portfolio” voices are typically offered with a professional
but “neutral” speaking style. For commercially important languages many
different portfolio voices may be offered all with similar speaking styles.
A customer wishing to use TTS will typically choose one of these voices.
The only alternative is to opt for a “custom voice” solution. In this case,
a customer pays for a TTS voice to be created using their preferred voice
talent. Such an approach allows for some “tuning” of the scripts used to
create the voice. Limited script elements may be added to provide better
coverage of the customers expected domain and “glided phrases” can be
included to ensure that specific phrase fragments are spoken perfectly.
However, even with such an approach the recording style is strictly
controlled and standard scripts are augmented rather than redesigned from
scratch. The “black box” approach means that TTS systems can be produced
which satisfy the needs of a large number of customers, even if this means
that solutions may be limited in the personas they present. Recent advances
in conversational agent applications have changed people’s expectations of
how a computer voice should sound and interact. Suddenly,
it’s much more important for the TTS system to present a persona which
matched the goals of the application. Such systems demanded a more
flamboyant, upbeat and expressive voice. The “black box” approach is no
longer sufficient, voices for high-end conversational agents are being
explicitly “designed” to meet the needs of such applications. These voices
are both expressive and light, and a complete contrast to the more
conservative voices available for traditional markets. This presentation
will describe how Nuance is addressing this new and challenging market.
|
|
|
Head of the Speech Communication and Smart Interactions Laboratories, Department
of
Telecommunications and Media Informatics, Budapest University of Technology and
Economics
|
Gaps to Bridge in Speech Technology
Although recently there has been significant progress in the general usage
and acceptance of speech technology in several developed countries there
are still major gaps that prevent the majority of possible users from daily
use of speech technology-based solutions. In the talk I will list some of
them and propose some directions for bridging these gaps.
Perhaps the most important gap is the "Black box" thinking of software
developers. They suppose that inputting text into a text-to-speech (TTS)
system will result in voice output that is relevant to the given context of
the application. In case of automatic speech recognition (ASR) they wait
for accurate text transcription (even punctuation). It is ignored that even
humans are strongly influenced by a priori knowledge of the context, the
communication partners, etc. For example by serially combining ASR +
machine translation + TTS in a speech-to-speech translation system a male
speaker at a slow speaking rate might be represented by a fast female voice
at the other end. The science of semantic modelling is still in its
infancy. In order to produce successful applications researchers of speech
technology should find ways to build-in the a priori knowledge into the
application environment, adapt their technologies and interfaces to the
given scenario. This leads us to the gap between generic and domain
specific solutions. For example intelligibility and speaking rate
variability are the most important TTS evaluation factors for visually
impaired users while human-like announcements at a standard rate and
speaking style are required for railway station information systems. An
increasing gap is being built between "large" languages/markets and "small"
ones. Another gap is the one between closed and open application
environments. For example there is hardly any mobile operating system that
allows TTS output re-direction into a live telephone conversation. That is
a basic need for rehabilitation applications of speech impaired people.
Creating an open platform where "smaller" and "bigger" players of the field
could equally plug-in their engines/solutions at proper quality assurance
and with a fair share of income could help the situation. In the talk some
examples will be given about how our teams at BME TMIT try to bridge the
gaps listed.
|
|