Nothing Special   »   [go: up one dir, main page]

Issues in The Development of The Next Ge

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

ISSUES IN THE DEVELOPMENT OF THE NEXT GENERATION OF

CONCATENATIVE SPEECH SYNTHESIS SYSTEMS


Andrew Breen'

lnfroduction

Not so long ago, Text-to-Speech (TTS)synthesis was viewed by many organisations as being of only limited commercial
use, relegated to the low revenue sectors of specialist devices for the disabled and novelty applications. In contrast
speech recognition was considered to be fundamental to many applications. This somewhat unbalanced view is
changing, while speech recognition is still seen as the dominant speech technology, there is a growing appreciationof the
sigdcance of Text-to-Speech synthesis in the market place. Firms are now keen to develop a suite of speech
technologies, which include both recognition and synthesis. This change in fortune of TTS has a good deal to do with the
development of automated information services and the explosion of multi-media applications. Today speech technology
is not seen in isolation, but as part of a complete package which may include a wide diversity of technologies from
advanced dialogues, sophisticatedinformation agents to chata-bots. This change in fortune comes as no surprise to many
people working in the field, nor does it alter many of the underlying problems facing these researchers. However, this
increased commercial attention has encouraged a growing number of researchers to investigate computationally efficient
and less expert intensive methods of building synthesis systems.
The requirements of a product differ from those of a research tool, not simply in the robustness of code, but more
si&icantly in the focus of work. Again this tension between the need to undertake basic research and the need to
investigate commercially relevant areas is not new, but as those researchers working in recognition found, these pressures
can lead to an over emphasis on one approach. Text-to-Speech synthesis faces many of the challenges encountered by
speech recognition a decade ago, when the expectation greatly out striped the ability of the technology to deliver, and the
general population did not appreciate the ddference between speech understandmg and recognition. Today there is a
greater understanding of the limitations of the technology, but it is still all to tempting to exaggerate what is possible with
our current level of understanding.

While commercial organisationsare as interested in the ability of TTS systems to perform well at text normalisation and
pronunciation as they are in the overall quality of the synthetic voice, there is still a general opinion that TTS systems are
only usable for limited applications. The vast majority of applications still require more natural speech and a greater
verity of styles and emotion. Work on mark-up has attempted to address the limitations of a plain text interface, but has
done little to solve the basic problems of how to maintain a high quality synthetic voice and apply different speaking
styles. As a consequencethis paper is concemed with one narrow aspect of Text-to-Speech synthesis, that is the
improvement of voice quality through the progressive improvement of existing approaches to unit selection.

Current Approaches to unit Selection

Unit selection is in fact two distinct processes, the process of database recording and design and the process of searching
and matching candidateunits against a target input stream. These processes as discussed in (Breen and Jackson' 1998,
Breen and Jackson' 1998 and Breen and Jackson 1999) are not independent, the ability of the unit selection to select the
desired sounds is fundamentally limited by the design and coverage of the database. In fact the process of unit selection
has two major constraints, those imposed by the richness of mformation provided by the higher linguistic components of
a system and the expressive power of the annotation strategy adopted in the database and search algorithms. Apart from
these external effects, the unit selection process must perform an optimal selection process, which maximises the use of
information stored with in the database while providing information on deficiencies in coverage and hopefully adopting
strategies to minimise the effect of these deficiencies.

In attempting to defme a more systematic method of characterising the requirements of a unit selection process a metric
based on the assessment of knowledge representation schemes within AI, is proposed. This system adapted from (Finlay
and Dix, 1999) summaries the requirements of a knowledge representation scheme under four headings: expressiveness,

* The School of Information Systems, The University of East Anglia, Norwich. (www.sys.uea.ac.uk)

11/1
effectiveness, eficiency and explicitness. These headings, adapted for use with the process of unit selection, are
explained in more detail below and will be referenced throughout the remainder of the paper.

- Expressiveness: An expressive system is capable of representing a variety of types of mformation. In the


context of unit selection, this means that the system of annotation used to record phonological information
within the database and hence the structures manipulated by the unit selection process should be rich
enough to encode all systematic variation of the speech signal. Expressiveness also relates to clarity of
representation. Ideally the annotation strategy should be natural and usable by both engineers and
phoneticians. In other words an expressive system should be characterised by completeness and clarity of
expression.
- Effectiveness: In order to be effective, a system must be able to manipulate information. In other words the
metric used to measure the distance between a fragment of speech stored in the database of sounds and the
target linguistic representation should be powerful enough to capture appropriate attributes needed to select
the correct fragment. Particularly when no exact match can be found.
- Eficiency: The efficiency of a system has a number of interpretations depending upon the constraints
imposed by the developers of the ‘ITS system. Any unit selection process should be both computationally
and theoretically efficient.
- Explicitness: Finally a good selection process must be able to provide an explanation of why a particular
fragment of speech was selected. In other words the selection process should be traceable.

The issue over effectiveness as described above, has occupied the attention of a number of researchers. Especially
regarding the features used in the selection process (Donovan and Woodland 1996, Conkie 1999, Taylor and Black
1999). Today, TTS systems fall into two broad categories, those systems which only consider symbolic phonological
features in the unit selection feature set, and those which consider both acoustic and phonological features. Acoustic
features include segmental duration, intonation and spectral properties of the candidate units.

All acoustic features suffer from the same problems, which are the accuracy of determination and the interactionwith the
symbolic information. The problem of accuracy is particularly acute when spectral properties are used. Determining the
significance of a spectral feature is notoriously difficult, as is determining the time frame over which the spectral analysis
should be performed. In general mixing symbolic information with acoustically derived features leads to problems when
the requirements of one feature set conflict with another. For example, imagine a system where both the segmental
duration of the target phoneme and its phonological environment are used to select a given phone from a database. If
during the process of unit selection no match can be found that satisfies both the duration and the phonological
requirements of the target some form of decision mechanism will need to be employed to weight the relative s i g ” c e
of the partial phonological and duration matches. Further more what happens if the two requirements are in conflict?
Which of the two should cany the majority weight?
This type of conflict need not exist for the case outlined above. If we assume that the selection procedure is sufficiently
expressive, then we need not consider the acoustic signal at all, but rely on the abilities of the abstract model to encode
all the systematic segmental duration variability in the signal. It is for this reason that the existing Laureate system uses a
purely phonologically motivated selection process. In fact, with an appropriate unit selectionprocess, the need to m o w
the segmental durations of the speech signal is greatly reduced. In deed attempting to modify the segmental durations of
a phoneme is likely to result in a noticeable degradation in quality. For small or relatively small databases however, this
is not an option, as the database of sounds may not be rich enough to cover all the major combinations of phonological
environment. In which case poor phonological choices will be audible in the resultant prosody.

Database design

As mentioned above, the ability of a unit selection process to perform is dependent upon the completeness and quality of
the underlying database of speech, and as discussed in (Breen and Jackson 1999) the process of generating an appropriate
database for synthesis is not a simple matter. Unit selection methods of synthesis are not generative, but descriptive.
Their power lays in being able to match the richness of the linguistic representation of the input string with the annotation
strategy adopted w i b the database of speech (they exhibit good expressiveness and relatively poor eflectiveness). As a
result, a great deal of the perceived style or emotional content of a concatenative system is inherent in the database and
not a property of the prosody generation process. Any attempt to deviate significantly from this style will result in a
reduction of voice quality. This is true even if the prosody models employed to m o w the speech data are accurate. The
situation is M e r complicated by the fact that many databases, in an attempt to capture a variety speech, mix a number
of different styles within the same database. Unless these different styles are adequately differentiated the resulting
speech will sound unnaturally stilted. The patchwork of various styles being clearly audible within the generated speech

11/2
stream. Alternatively, if the higher linguistic components of a synthesiser are unable to recognise these styles fiom text,
then not amount of sophisticated unit selection will compensate for this. As a result the speech database will balloon in
size with little actual benefit. It is generally the case that simply increasing the size of the speech database without a
clear idea of what is expected will not have a significant impact on quality but will drastically effect the computation of
data foot print of the system.

The case for large databases is simple and in many ways compelling. It stems fiom the belief that the more data available
to the unit selection process, the greater the likely hood of a near match and consequently the less need for post selection
signal processing. The ability to reduce the need for signal processing is significant, as it is typically the case that any
signal processing is likely to reduce the quality of the resulting synthetic speech signal. In summary:
- The larger the database, the greater the coverage (only if done with sufficient care) and hence the greater the
chance of finding “representative” samples.
The case against large databases is based on pra,omatic issues concerning the efficiency of the technique. These include:
- Large databases are difficult to collect, are prone to annotation errors and are difficult to maintain.
- During the recording of large databases it is difficult to ensure consistent voice quality.
- Large databases are inflexible. Changing speakers is a time consuming and costly process.
- Large databases result in a large data footprint for systems.
Possible ways ahead

This section describes one of the approaches currently being investigated at UEA to improving the quality of
concatenative TTS systems. In addition to the obvious aim of trying to improve the quality of synthetic speech, this work
has two further basic aims, which are:

To minimise the size of the speech database, while maximising the potential of the concatenation technique.
To minimise the need for post selection signal processing.

These secondary aims could also be considered to be commercial constraints as the cost effectiveness of practical TTS
systems often boils down to the number TTS channels available per processor. In addition these aims force researchers
to examine the basic principles on which concatenative systems are based. Taking each subsidiary aim in turn.

Issues in minimising the size of the speech database


As discussed earlier, it is no simple matter to determine.what the minimum requirement of a speech database should be.
Laureate for example imposed the arbitrary minimum database completeness criteria of having all diphones stored within
the inventory, even thought the unit selection process could select any sized unit up to and including a triphone. In
addition Laureates matching process did not even recognise the concept of a unit at all. A better approach to defining the
expressiveness of a database would be to tie it to the expressiveness of the unit selection process. In this definition, the
database would need to contain sufficient material to ensure that all the expressive power of the selection process was
used. In addition, the database should attempt to minimise any properties of the speech signal which were not
systematically catered for within the linguisticmodels under pinning the selection process. This last point is specifically
aimed at addressing the issues of speaking style and intonation.

Phonological

T
model

Style Intonation
model model

Figure 1.

11/3
To further emphasis h s point consider figure 1. In this picture the speech signal has been decomposed into three
orthogonal properties; those associated with phonological models, intonation models and speaking style models. While
this is clearly an over simplification, the author believes that viewing the signal in this way clarifies some of the basic
issues facing researcherswanting to improve unit selection. Figure 1 suggests that the process of unit selection must
consider three different, weakly interacting, criteria. The term weakly interacting in this situation does not mean that
these properties weakly interact in the speech signal, but that the theories used to describe each of these properties has
little to say about the others. In other words, three separate models are needed to fully describe the speech signal. In this
situation, the aim of unit selection is to accommodate these different models within one process, and that this process
must attempt to maximise the similarity of the selected unit to the target input stream in all three dimensions. Further
more, unless a new approach is used which unifies these three models, the selectionprocess should not tie the selection
based on one model with that of the other two. For example, selection based on phonological similarity should not be
tied to that of intonationor speaking style. Each should be viewed as a separate step in the selection process.

Minimising the need for post selection signal processing

As already stated, the current generation of signal processing algorithms are incapable of manipulating the speech signal
without introducing audible degradation. The greater the modification the greater the degradation. This problem is
particularly acute for comparatively simple time domain algorithms such as PSOLA, where large f0 movements cause
significant degradationin the processed speech signal. As a result, there is a trend to include fundamentalfrequency as
part of the selectionprocess. This trend has resulted in the speech databases ballooning in size. The question address in
this section is how best to include pitch as a unit selection feature while minimising the amount of extra speech needed to
be recorded? Figure 1 suggested that the method of selecting units based on phonological context should be viewed as a
separate process to the method of selection based on intonation. From this it follows that all the data needed to spec@
context must be available at every frequency of interest. In other words, for every frequency value represented in the
selectionprocess, there should exist a complete set of context data at that frequency. Stated in thn way, it is easy to
understand why speech databases are growing so large.

Viewing the process of unit selection in this way enables researchers to balance the commercial cost of increasingthe
size of the speech database against the potential increase in voice quality accompanying a reduction in the amount of post
selection signal processing performed and a larger set of candidate sounds. Consider for example that a system, which
uses a PSOLA style signal-processing component, wishes to improve the voice quality over the fimdamental frequency
range 50Hz to 300%. What increase in database size would be required? Given that these algorithms work acceptably
well with frequency modifications in the range of an octave, then two databases would need to be recorded, one at lOOHz
and one at 200Hz.As a result the database would double in size.

Clearly, for a given unit selection and signal processing method, work is needed to determine the best trade-off in
computation over database size. Ultimately for many systems, the trade-off may have as much to do with commercial
considerations as it does to purely quality issues.

References

1. Breen, A. P., Jackson, P., “A phonologically motivated method of unit selection”, Proc. ICSLP 98, 1998.
2. Breen, A. P., Jackson, P., “‘Non-uniform unit selection and the similarity metric within BT’s Laureate TTS system”,
Proc. 3rdInternationalWorkshop on Speech Synthesis, Jenolan, 1998.
3. Breen, A. P., Jackson, P, “‘Issues in the Design of an Advanced unit SelectionMethod for Natural Sounding
Concatenative Synthesis”,Proc. 13” InternationalMeeting of the Acoustical Society of America, Berlin, 1999.
4. Conkie, A., “Robust unit selection system for speech synthesis”,Proc. 13” International Meeting of the Acoustical
Society of America, Berlin, 1999.
5. Donovan, R., Woodland, P., “Improvements in an HMM-based speech synthesiser”, In EuroSpeech 95, Madrid,
1996
6. Taylor, P., Black, A., “Speech synthesisby phonological structure matching”, Proc. EuroSpeech 99, Budapest, 1999.
7. Fmlay, J., Dix, A., An introduction to Artificial Intelligence, UCL Press, 1999.

11/4

You might also like