With the recent increase in demand for speech applications, it has become obvious that current general speech synthesis technology is not at a quality that users accept. Many speech applications still use fixed, fully pre-recorded prompts rather than standard TTS (text-to-speech) systems to generate their speech output, because the quality of standard TTS systems is not perceived to be good enough.
Recent improvements in speech synthesis techniques, particularly in the area of so-called ``unit selection synthesis,'' as typified by AT&T's NextGen system [1], have led to higher quality synthesis, but it remains an expert skill to build new voices for such systems. There is a requirement not simply for high quality speech synthesis, but also a reliable and efficient means of creating new, customized voices within the system. It is no longer acceptable for all speech technology systems to speak with one of only a few voices or prosodic styles.
In addressing this issue, we at CMU are making the process of building synthetic voices more reliable and faster, while requiring less arcane skills. Through the FestVox project [2] we release documentation, tools, scripts, etc. that allow new voices to be built in both the existing, supported languages, as well as new languages.
In developing both techniques for general diphone synthesis and unit selection, we noted a particular niche where a limited domain could be exploited to greatly improve reliability of high quality synthesis. In many speech applications, most of the language to be spoken is generated within the system. Despite this, many systems simply pass a raw text string, with no more than perhaps some special punctuation, to a general-purpose TTS system. The result is almost always disappointing, in that it sounds either quite bored (inappropriate prosodic realization) or the signal quality makes it unattractive. In noting that the quality of unit selection synthesis can be very good, and that the number of bad synthesis examples are much less when the sentences are closer to the domain of the recordings, we decided to exploit this by designing corpora specifically for each application.