CN101685633A

CN101685633A - Voice synthesizing apparatus and method based on rhythm reference

Info

Publication number: CN101685633A
Application number: CN200810166002A
Authority: CN
Inventors: 郭庆; 陆应亮; 王彬
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2008-09-28
Filing date: 2008-09-28
Publication date: 2010-03-31

Abstract

The present invention provides a speech synthesis device and method based on prosodic reference. The speech synthesis device includes: a prosody parameter acquisition unit, which analyzes the recording files obtained by natural persons reading the text to be synthesized aloud, or the prosody obtained by marking the text to be synthesized with prosody parameters according to a predetermined labeling standard Parameter annotation file is analyzed to obtain natural rhythm parameters or approximate natural rhythm parameters; and the sound generation part, which uses the natural rhythm parameters or approximate natural rhythm parameters as a reference, selects the corresponding text from the pre-recorded voice library for the text to be synthesized A speech synthesis unit, and splicing and synthesizing the speech synthesis unit to generate a synthesized speech file corresponding to the text to be synthesized. According to the speech synthesis device and method of the present invention, it is possible to generate highly natural synthesized speech that is full of emotion and whose intonation is very close to natural speech according to user requirements.

Description

Speech synthesis device and method based on prosodic reference

技术领域 technical field

本发明涉及基于韵律参照进行语音合成的装置和方法，更具体地说，本发明涉及以从自然语音或者基于特定标准制作的韵律特征标注文件中获得的抑扬顿挫的韵律特征为参照，来合成出具有高自然度的合成语音的装置和方法。The present invention relates to a device and method for speech synthesis based on prosodic reference. More specifically, the present invention relates to using the prosodic features obtained from natural speech or prosodic feature annotation files based on specific standards as a reference to synthesize speech with A device and method for synthesizing speech with high naturalness.

背景技术 Background technique

语音合成(Text-To-Speech，简称TTS)是用于从文本转化到语音的技术，具体地说，是一种将任意文字信息转化为标准、流畅语音的技术。语音合成牵涉到自然语言处理、韵律学、语音信号处理、声音感知等多种前沿的高新科技，横跨声学、语言学、数字信号处理等多个学科，是中文信息处理领域的一项前沿技术。Speech synthesis (Text-To-Speech, TTS for short) is a technology for converting text to speech, specifically, a technology for converting arbitrary text information into standard, fluent speech. Speech synthesis involves many cutting-edge high-tech technologies such as natural language processing, prosody, speech signal processing, and sound perception. It spans multiple disciplines such as acoustics, linguistics, and digital signal processing. It is a cutting-edge technology in the field of Chinese information processing. .

语音合成技术可广泛应用于电信、金融、电力、邮政、政府等行业。语音合成技术可以让用户更为轻松地收发电子邮件，获取股市行情，了解天气、交通和道路情况，不久的将来，它将提供更为全面的更有价值的应用服务。Speech synthesis technology can be widely used in telecommunications, finance, electric power, postal services, government and other industries. Speech synthesis technology can make it easier for users to send and receive emails, get stock market quotes, understand weather, traffic and road conditions, and in the near future, it will provide more comprehensive and more valuable application services.

语音合成系统用于合成高可懂度、高自然度的语音。The speech synthesis system is used to synthesize speech with high intelligibility and high naturalness.

一般来说，语音合成系统首先要选取一定的基本合成单元，譬如英语中的音素、汉语中的半音节或者带调音节等，然后在韵律模型(音长和基频等)预测结果的指导下，从预先录制标注好的标准语音库中搜寻全局最优的合成单元，利用特定的波形生成技术(如TD-PSOLA算法)对选中的语音段进行韵律特性的调整和修改，最终拼接合成出符合要求的语音。Generally speaking, the speech synthesis system first selects certain basic synthesis units, such as phonemes in English, half-syllables or toned syllables in Chinese, etc., and then under the guidance of the prediction results of the prosody model (sound length and fundamental frequency, etc.) , search for the globally optimal synthesis unit from the pre-recorded and marked standard speech library, use specific waveform generation technology (such as TD-PSOLA algorithm) to adjust and modify the prosodic characteristics of the selected speech segment, and finally splicing and synthesizing Voice required.

经过近十几年的研究，现阶段的语音合成系统合成的语音质量已经达到了实用的程度，其中可懂度已经能够满足应用的实际需要，但是自然度还不够高，与人的自然语音仍旧有很大的差距。After more than ten years of research, the voice quality synthesized by the current speech synthesis system has reached a practical level, and the intelligibility can meet the actual needs of the application, but the naturalness is not high enough, and it is still not as good as human natural speech. There is a big gap.

现阶段的语音合成系统大多采用基于大规模语音数据库的拼接合成技术，即，以概率统计的方法，在预测得到的韵律参数指导下，从预录语音库中搜寻对于输入文本来说符合全局最优的合成单元，然后根据预测韵律参数进行波形调整的拼接合成。Most of the speech synthesis systems at the present stage use splicing and synthesis technology based on large-scale speech databases, that is, using the method of probability and statistics, under the guidance of the predicted prosody parameters, search from the pre-recorded speech database that matches the global best for the input text. An excellent synthesis unit, and then perform splicing and synthesis of waveform adjustment according to the predicted rhythm parameters.

一般来说，语音合成系统包括以下三个模块：文本分析模块、韵律参数预测模块、后端合成模块。汉语文本分析模块的功能包括分词、词性标注、注音、韵律结构预测等。韵律参数预测模块在文本分析结果的基础上进行音长、基频、能量等声学参数的预测。后端合成模块一般由单元选取子模块和波形生成子模块组成，其中，单元选取子模块在韵律参数的指导下通过概率统计的方法从语音库中搜寻对于输入文本来说符合全局最优的合成单元，波形生成子模块利用特定的波形生成技术(如TD-PSOLA算法)对选取的语音段进行韵律特性的调整和修改，最终拼接合成出符合要求的语音。Generally speaking, a speech synthesis system includes the following three modules: a text analysis module, a prosody parameter prediction module, and a backend synthesis module. The functions of the Chinese text analysis module include word segmentation, part-of-speech tagging, phonetic notation, prosodic structure prediction, etc. The prosodic parameter prediction module predicts acoustic parameters such as sound length, fundamental frequency, and energy based on text analysis results. The back-end synthesis module is generally composed of a unit selection sub-module and a waveform generation sub-module. Among them, the unit selection sub-module searches for the globally optimal synthesis for the input text from the speech library through the method of probability and statistics under the guidance of prosody parameters. Unit, the waveform generation sub-module uses a specific waveform generation technology (such as TD-PSOLA algorithm) to adjust and modify the prosodic characteristics of the selected speech segment, and finally stitches and synthesizes the speech that meets the requirements.

音长(音素长度)是最重要的韵律特征之一，它对于合成语音的感知自然度有着重要的意义。音长的变化可以有助于人们对于音素本身的认知，同时也有助于人们在一个连续的语流中能够确定词、短语的划分，从而提高语音的自然度和可懂度。Pitch length (phoneme length) is one of the most important prosodic features, which has important implications for the perceived naturalness of synthesized speech. The change of sound length can help people to recognize the phoneme itself, and also help people to determine the division of words and phrases in a continuous speech flow, thereby improving the naturalness and intelligibility of speech.

基频也是最重要的韵律特征之一，它对于汉语尤其重要，因为汉语是有声调的语言。基频对于合成语音的感知自然度、可懂度有着重要的意义。Fundamental frequency is also one of the most important prosodic features, which is especially important for Chinese because Chinese is a tonal language. The fundamental frequency is of great significance to the perceived naturalness and intelligibility of synthesized speech.

在自然语流中，音素的音长和基频与该音素前后的语境是高度相关的。许多上下文的因素如音素自身的类型、前后音素的类型、前后韵律边界等级、重读与否等均对音素的时长和基频都有着制约作用。音长预测研究和基频预测研究的基本目的就是试图去描述这些上下文因素对于音素时长和基频的影响，从而提高语音合成系统的自然度。In natural language flow, the duration and fundamental frequency of a phoneme are highly correlated with the context before and after the phoneme. Many contextual factors, such as the type of the phoneme itself, the type of the front and back phonemes, the level of front and back prosodic boundaries, stress or not, etc. all have a restrictive effect on the duration and fundamental frequency of the phoneme. The basic purpose of research on sound length prediction and fundamental frequency prediction is to try to describe the impact of these contextual factors on phoneme duration and fundamental frequency, so as to improve the naturalness of the speech synthesis system.

但是，韵律参数预测受到很多方面的制约，除了本身模型建模存在的问题以外，还受到训练数据有限、前端文本分析的准确性等多方面的制约。虽然各种韵律参数预测技术试图去考虑协同发音中存在的各种现象，但是合成语音的韵律节奏仍然无法做到像自然语音中那样抑扬顿挫。However, prosodic parameter prediction is constrained by many aspects. In addition to the problems in the model itself, it is also constrained by limited training data and the accuracy of front-end text analysis. Although various prosody parameter prediction techniques try to take into account various phenomena in coarticulation, the prosodic rhythm of synthetic speech still cannot be as cadenced as in natural speech.

上述现阶段的语音合成系统的关键缺陷在于重视了局部而忽视了整体，造成了合成的语句平白无力，毫无生气，从而阻碍了目前的语音合成系统在有声电子图书市场等中得到广泛的应用。The key defect of the above-mentioned speech synthesis system at this stage is that it pays attention to the part and ignores the whole, resulting in the synthesis of sentences that are ineffective and lifeless, thus hindering the current speech synthesis system from being widely used in the audio electronic book market, etc. .

目前的有声电子图书市场都是使用真人来朗读。要找到一个音色很美，同时能够与富有感情的抑扬顿挫结合得很好的朗读者是很困难的。如果找专业的播音员来完成录音，那么成本一定会很高。The current audio e-book market uses real people to read aloud. It can be difficult to find a speaker who has a beautiful timbre and can work well with emotive cadences. If you find a professional announcer to complete the recording, the cost will definitely be high.

概括而言，传统的语音合成系统是首先对文本进行分析(例如分词、词性标注、数字符号处理、注音、韵律结构分析等)，然后在此基础上进行韵律参数如音长、基频、能量等的预测，之后在这些韵律参数的指导下，从预录语音库中搜寻全局最优的合成单元，然后进行波形调整的拼接合成。由于在文本分析、韵律预测上还存在许多不可回避的问题，使得传统的语音合成系统无法准确把握文本的内容以及韵律参数的预测，无法很好地控制各个语音合成单元之间的连接，无法很好地控制合成语音的抑扬顿挫，最终导致用户无法得到满意的语音。而对于诸如有声电子图书制作的广泛应用而言，要想既降低成本，又获得音色优美且具有自然韵律的合成语音文件，迫切需要一种能够合成出具有高自然度的合成语音的语音合成系统。In a nutshell, the traditional speech synthesis system first analyzes the text (such as word segmentation, part-of-speech tagging, digital symbol processing, phonetic notation, prosodic structure analysis, etc.), and then performs prosodic parameters such as sound length, fundamental frequency, and energy on this basis. etc., and then under the guidance of these prosodic parameters, search for the globally optimal synthesis unit from the pre-recorded speech library, and then perform splicing and synthesis of waveform adjustment. Because there are still many unavoidable problems in text analysis and prosody prediction, the traditional speech synthesis system cannot accurately grasp the content of the text and the prediction of prosody parameters, and cannot well control the connection between various speech synthesis units. Controlling the cadence of the synthesized voice well will eventually lead to the user's unsatisfactory voice. For the wide application such as the production of audio e-books, in order to reduce the cost and obtain synthetic speech files with beautiful timbre and natural rhythm, there is an urgent need for a speech synthesis system that can synthesize synthetic speech with high naturalness .

下面列出关于这方面的研究的一些文献，并通过引用将它们并入于此，如在此作了全面阐述一样。Some literature on research in this area is listed below and is hereby incorporated by reference as if fully set forth herein.

[1]Meron and Joram，US Patent publication No.6,829,581，July 31，2001，Method for prosody generation by unit selection from an imitationspeech database；[1]Meron and Joram, US Patent publication No.6,829,581, July 31, 2001, Method for prosody generation by unit selection from an imitations speech database;

[2]Baraff and David R.，US Patent publication No.6,795,807，August 17，2000，Method and means for creating prosody in speech regeneration forlaryngectomees；[2]Baraff and David R., US Patent publication No.6,795,807, August 17, 2000, Method and means for creating prosody in speech regeneration for laryngectomees;

[3]Holm，Frode，Hata and Kazue，US Patent publication No.6,260,016，November 25，1998，Speech synthesis employing prosody templates；[3] Holm, Frode, Hata and Kazue, US Patent publication No.6,260,016, November 25, 1998, Speech synthesis employing prosody templates;

[4]Holm，Frode，Hata and Kazue，US Patent publication No.6,185,533，March 15，1999，Generation and synthesis of prosody templates；[4] Holm, Frode, Hata and Kazue, US Patent publication No.6,185,533, March 15, 1999, Generation and synthesis of prosody templates;

[5]Shih，C.L.，“The Prosodic Domain of Tone Sandhi in MandarinChinese”，PhD Dissertation，UC San Diego，1986；[5] Shih, C.L., "The Prosodic Domain of Tone Sandhi in Mandarin Chinese", PhD Dissertation, UC San Diego, 1986;

[6]Chu M.and Qian Y.，“Locating boundaries for prosodic constituentsin unrestricted Mandarin texts”，Journal of Computational Linguistics andChinese Language Processing，6(1)，61-82，2001；[6] Chu M. and Qian Y., "Locating boundaries for prosodic constituents in unrestricted Mandarin texts", Journal of Computational Linguistics and Chinese Language Processing, 6(1), 61-82, 2001;

[7]E.Moulines and F.Charpentier.，“Pitch-synchronous waveformprocessing techniques for text-to-speech synthesis using diphones”，SpeechCommunication，9：453-467，1990；[7]E.Moulines and F.Charpentier., "Pitch-synchronous waveformprocessing techniques for text-to-speech synthesis using diphones", SpeechCommunication, 9: 453-467, 1990;

[8]Qing Guo，Nobuyuki Katae，“Statistical Prosody Generation inMandarin TTS System”，OCOCOSDA 2007；[8] Qing Guo, Nobuyuki Katae, "Statistical Prosody Generation in Mandarin TTS System", OCOCOSDA 2007;

[9]Qing Guo，Nobuyuki Katae，Hao Yu，Hitoshi Iwamida，“DecisionTree based Duration Prediction in Mandarin TTS System”，Journal of ChineseLanguage and Computing；[9] Qing Guo, Nobuyuki Katae, Hao Yu, Hitoshi Iwamida, "DecisionTree based Duration Prediction in Mandarin TTS System", Journal of Chinese Language and Computing;

[10]Guo Qing，Nobuyuki Katae，Yu Hao，Hitoshi Iwamida，“HighQuality Prosody Generation in a Text-to-speech System”，Journal of ChineseInformation Processing，Vol.22 No.2：110-115，2008；[10] Guo Qing, Nobuyuki Katae, Yu Hao, Hitoshi Iwamida, "High Quality Prosody Generation in a Text-to-speech System", Journal of Chinese Information Processing, Vol.22 No.2: 110-115, 2008;

[11]Qing Guo，Jie Zhang，Nobuyuki Katae，“Prosodic Word Groupingwith Global Probability Estimation Method”，Speech Prosody，2008。[11] Qing Guo, Jie Zhang, Nobuyuki Katae, "Prosodic Word Grouping with Global Probability Estimation Method", Speech Prosody, 2008.

发明内容 Contents of the invention

鉴于传统的语音合成技术所存在的上述问题而提出了本发明。本发明的目的在于提供一种基于韵律参照的语音合成系统，其巧妙地克服了传统语音合成技术中难以解决的文本分析中的分词、语义分析、韵律参数预测等问题，而能够取得韵律节奏和自然语音非常接近又具有优美音色的良好的合成语音效果。采用本发明，可以按照用户的要求生成任意的且抑扬顿挫程度与自然语音非常接近的语音文件。通过将标准的音色甜美的朗读和音色不太好或发音不标准但富有感情的朗读相结合，使得可以通过由任何人进行朗读来制作出音色甜美且抑扬顿挫的合成语音作品。从而，可以大幅度地降低有声电子图书的制作成本，批量生产出具有同一语音特色且富有韵律节奏的有声电子图书。In view of the above-mentioned problems existing in the traditional speech synthesis technology, the present invention is proposed. The purpose of the present invention is to provide a speech synthesis system based on prosodic reference, which skillfully overcomes the problems of word segmentation, semantic analysis, and prosodic parameter prediction in text analysis that are difficult to solve in traditional speech synthesis technology, and can obtain prosodic rhythm and Good synthesized voice effect that is very close to natural voice and has beautiful timbre. By adopting the present invention, arbitrary voice files can be generated according to user's requirements, and the degree of intonation and setback is very close to natural voice. By combining the standard sweet-sounding reading and the not-so-good-sounding or non-standard pronunciation but emotional reading, it is possible to create a synthesized voice work with a sweet and cadenced voice by reading aloud by anyone. Therefore, the production cost of the audio electronic book can be greatly reduced, and the audio electronic book with the same voice feature and full of rhythm can be produced in batches.

为了实现上述目的，根据本发明的第一方面，提供了一种基于韵律参照进行语音合成的语音合成装置，其包括：韵律参数获取部，其通过对自然人朗读待合成文本而获得的该待合成文本的录音文件进行分析，或者对以预定标注标准对待合成文本进行韵律参数标注而获得的韵律参数标注文件进行分析，来获取自然韵律参数或近似自然韵律参数；和音声作成部，其以该自然韵律参数或近似自然韵律参数作为参照，针对该待合成文本从预录语音库中选择相应的语音合成单元，并对所述语音合成单元进行拼接合成，以产生对应于该待合成文本的合成语音文件。In order to achieve the above object, according to the first aspect of the present invention, a speech synthesis device for performing speech synthesis based on prosodic reference is provided, which includes: a prosody parameter acquisition unit, which obtains the text to be synthesized by reading the text to be synthesized to a natural person The audio recording file of the text is analyzed, or the prosodic parameter labeling file obtained by labeling the prosodic parameter of the text to be synthesized with the predetermined labeling standard is analyzed to obtain the natural rhythm parameter or the approximate natural rhythm parameter; and the sound generation part, which uses the natural Prosodic parameters or approximate natural prosody parameters are used as a reference, select the corresponding speech synthesis unit from the pre-recorded speech bank for the text to be synthesized, and splicing and synthesizing the speech synthesis unit to generate the synthesized speech corresponding to the text to be synthesized document.

根据本发明的第二方面，提供了第一方面中所述的语音合成装置，其中，该音声作成部包括：语音单元选择部，其以该自然韵律参数或近似自然韵律参数作为参照，从该预录语音库中选择针对该待合成文本全局最优的语音合成单元；和波形生成部，其基于该自然韵律参数或近似自然韵律参数，对该语音单元选择部选择的语音合成单元进行拼接合成并对合成后的语音文件进行波形调整，以获得对应于该待合成文本的具有高自然度的合成语音文件。According to a second aspect of the present invention, there is provided the speech synthesis device described in the first aspect, wherein the sound generation unit includes: a speech unit selection unit that takes the natural rhythm parameter or the approximate natural rhythm parameter as a reference, and selects from the Selecting the globally optimal speech synthesis unit for the text to be synthesized from the pre-recorded speech library; and a waveform generation unit, based on the natural rhythm parameters or approximate natural rhythm parameters, splicing and synthesizing the speech synthesis units selected by the speech unit selection unit Waveform adjustment is performed on the synthesized voice file to obtain a synthesized voice file with high naturalness corresponding to the text to be synthesized.

根据本发明的第三方面，提供了第二方面中所述的语音合成装置，其中，该韵律参数获取部包括：录音部，其通过自然人朗读该待合成文本而获得该待合成文本的录音文件；和韵律参数提取部，其从该录音文件的波形数据中获得韵律参数，该韵律参数包括音长、音频和能量。According to a third aspect of the present invention, the speech synthesis device described in the second aspect is provided, wherein the prosody parameter acquisition unit includes: a recording unit, which obtains a recording file of the text to be synthesized by a natural person reading the text to be synthesized ; and a prosody parameter extraction unit, which obtains prosody parameters from the waveform data of the recording file, and the prosody parameters include sound length, audio frequency and energy.

根据本发明的第四方面，提供了第二方面中所述的语音合成装置，其中，该韵律参数获取部包括：韵律参数标注部，其参照定义了韵律参数的标注规则的知识库以该预定标注标准对该待合成文本进行韵律参数标注，以获得韵律参数标注文件；和韵律参数生成部，其对该韵律参数标注文件进行分析，以获得近似韵律参数，该近似韵律参数包括音长、音频和能量。According to a fourth aspect of the present invention, there is provided the speech synthesis device described in the second aspect, wherein the prosodic parameter acquisition unit includes: a prosodic parameter labeling unit, which refers to a knowledge base defining labeling rules for prosodic parameters and uses the predetermined Annotation standard carries out prosodic parameter labeling on the text to be synthesized to obtain a prosodic parameter labeling file; and a prosodic parameter generation unit analyzes the prosodic parameter labeling file to obtain approximate prosodic parameters, the approximate prosodic parameters including sound length, audio frequency and energy.

根据本发明的第五方面，提供了第一方面中所述的语音合成装置，其中，该待合成文本的录音文件是通过电子卡通发音的方式获得的。According to a fifth aspect of the present invention, there is provided the speech synthesis device described in the first aspect, wherein the recording file of the text to be synthesized is obtained by pronunciation of an electronic cartoon.

根据本发明的第六方面，提供了一种基于韵律参照进行语音合成的方法，其包括以下步骤：韵律参数获取步骤，其通过对自然人朗读待合成文本而获得的该待合成文本的录音文件进行分析，或者对以预定标注标准对待合成文本进行韵律参数标注而获得的韵律参数标注文件进行分析，来获取自然韵律参数或近似自然韵律参数；和音声作成步骤，其以该自然韵律参数或近似自然韵律参数作为参照，针对该待合成文本从预录语音库中选择相应的语音合成单元，并对所述语音合成单元进行拼接合成，以产生对应于该待合成文本的合成语音文件。According to a sixth aspect of the present invention, there is provided a method for speech synthesis based on prosodic reference, which includes the following steps: a prosodic parameter acquisition step, which is obtained by reading the audio file of the text to be synthesized by a natural person reading the text to be synthesized Analyzing, or analyzing the prosodic parameter labeling file obtained by labeling the prosodic parameter of the text to be synthesized with a predetermined labeling standard, to obtain natural prosodic parameters or approximate natural prosodic parameters; Prosodic parameters are used as a reference, and the corresponding speech synthesis unit is selected from the pre-recorded speech library for the text to be synthesized, and the speech synthesis units are spliced and synthesized to generate a synthesized speech file corresponding to the text to be synthesized.

根据本发明的第七方面，提供了第六方面中所述的方法，其中，该音声作成步骤包括：以该自然韵律参数或近似自然韵律参数作为参照，从该预录语音库中选择针对该待合成文本全局最优的语音合成单元；和基于该自然韵律参数或近似自然韵律参数，对所选择的语音合成单元进行拼接合成并对合成后的语音文件进行波形调整，以获得对应于该待合成文本的具有高自然度的合成语音文件。According to a seventh aspect of the present invention, there is provided the method described in the sixth aspect, wherein the voice generating step includes: using the natural rhythm parameter or an approximate natural rhythm parameter as a reference, selecting a voice for the voice from the pre-recorded voice library The globally optimal speech synthesis unit of the text to be synthesized; and based on the natural rhythm parameter or approximate natural rhythm parameter, the selected speech synthesis unit is spliced and synthesized, and the waveform adjustment is performed on the synthesized speech file, so as to obtain the text corresponding to the text to be synthesized Synthetic speech files with high naturalness for synthesized text.

根据本发明的第八方面，提供了第七方面中所述的方法，其中，该韵律参数获取步骤包括：通过自然人朗读该待合成文本而获得该待合成文本的录音文件；和从该录音文件的波形数据中获得韵律参数，该韵律参数包括音长、音频和能量。According to an eighth aspect of the present invention, there is provided the method described in the seventh aspect, wherein the step of obtaining the prosody parameters includes: obtaining a recording file of the text to be synthesized by a natural person reading the text to be synthesized; and from the recording file The prosodic parameters are obtained from the waveform data, and the prosodic parameters include sound duration, audio frequency and energy.

根据本发明的第九方面，提供了第七方面中所述的方法，其中，该韵律参数获取步骤包括：参照定义了韵律参数的标注规则的知识库以该预定标注标准对该待合成文本进行韵律参数标注，以获得韵律参数标注文件；和对该韵律参数标注文件进行分析，以获得近似韵律参数，该近似韵律参数包括音长、音频和能量。According to a ninth aspect of the present invention, there is provided the method described in the seventh aspect, wherein the step of obtaining prosodic parameters includes: referring to a knowledge base defining labeling rules for prosodic parameters and using the predetermined labeling standard to process the text to be synthesized prosodic parameter labeling to obtain a prosodic parameter labeling file; and analyzing the prosodic parameter labeling file to obtain approximate prosodic parameters, the approximate prosodic parameters including sound duration, audio frequency and energy.

根据本发明的第十方面，提供了第六方面中所述的方法，其中，通过电子卡通发音的方式来获得该待合成文本的录音文件。According to a tenth aspect of the present invention, there is provided the method described in the sixth aspect, wherein the recording file of the text to be synthesized is obtained by means of electronic cartoon pronunciation.

根据本发明的第十一方面，提供了一种包括计算机指令代码的计算机程序，该计算机程序在被加载到计算机上并由该计算机执行该计算机程序中包括的计算机指令代码时，实现如上面所述的根据本发明第六方面到第十方面中的任一个所述的基于韵律参照进行语音合成的方法。According to an eleventh aspect of the present invention, there is provided a computer program including computer instruction codes. When the computer program is loaded into a computer and the computer executes the computer instruction codes included in the computer program, the computer program as described above can be realized. The above-mentioned method for speech synthesis based on prosodic reference according to any one of the sixth aspect to the tenth aspect of the present invention.

根据本发明的第十二方面，提供了一种承载如第十一方面中所述的计算机程序的计算机可读记录介质，该计算机可读记录介质可由计算机读取以将所述计算机程序加载到该计算机上并由该计算机执行该计算机程序中包括的计算机指令代码，从而实现如上面所述的根据本发明第六方面到第十方面中的任一个所述的基于韵律参照进行语音合成的方法。According to a twelfth aspect of the present invention, there is provided a computer-readable recording medium carrying the computer program as described in the eleventh aspect, the computer-readable recording medium can be read by a computer to load the computer program into The computer executes the computer instruction code included in the computer program by the computer, so as to realize the method for speech synthesis based on prosodic reference according to any one of the sixth aspect to the tenth aspect of the present invention as described above .

根据本发明上述方面的语音合成装置和方法以及实现该语音合成方法的计算机程序和承载该计算机程序的计算机可读记录介质，可以按照用户的要求生成任意的且抑扬顿挫程度与自然语音非常接近的合成语音，从而能够改进合成语音的自然度。According to the speech synthesis device and method of the above aspects of the present invention, the computer program for realizing the speech synthesis method, and the computer-readable recording medium carrying the computer program, it is possible to generate arbitrary synthesized speech whose cadence is very close to natural speech according to the user's requirements. Speech, enabling improved naturalness of synthesized speech.

以上概述和以下详述都是对本发明的示例性描述，而非对本发明的限制。所属领域的技术人员在阅读本申请的公开内容后，基于本发明的实质精神，完全能够构思出各种其它形式的实施方式，但是只要这些实施方式包括以下技术特征，即，以实际通过真人朗读待合成文本或以其它方式(例如，通过电子卡通发音)获得的录音文件或者通过以预定标注标准参照韵律参数标注规则知识库对待合成文本进行标注而获得的韵律参数标注文件为基础，进行分析获得自然韵律参数或近似韵律参数，并利用这些自然韵律参数或近似韵律参数参照预录语音库来合成出具有高自然度的合成语音文件，显然这些实施方式都应落入本发明的范围内。本发明的保护范围由所述权利要求书具体限定。Both the foregoing summary and the following detailed description are exemplary descriptions of the invention, not limitations of the invention. After reading the disclosure content of this application, those skilled in the art can completely conceive various other forms of implementation based on the essence of the present invention, but as long as these implementations include the following technical features, that is, read aloud by a real person The text to be synthesized or the recording file obtained in other ways (for example, through electronic cartoon pronunciation) or the prosodic parameter labeling file obtained by labeling the text to be synthesized with reference to the prosodic parameter labeling rule knowledge base with predetermined labeling standards, is analyzed and obtained Natural rhythm parameters or approximate rhythm parameters, and using these natural rhythm parameters or approximate rhythm parameters to refer to the pre-recorded voice library to synthesize a synthesized voice file with high naturalness, obviously these implementations should fall within the scope of the present invention. The protection scope of the present invention is specifically defined by the claims.

下面参照附图对本发明的具体实施方式进行详细描述。Specific embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

附图说明 Description of drawings

所包括的附图用来提供对本发明的进一步的理解，其构成了说明书的一部分，例示了本发明的优选实施方式，并与文字说明一起用来解释本发明的原理，其中对于相同的要素，始终用相同的附图标记来表示。在附图中：The included drawings are used to provide a further understanding of the present invention, which constitute a part of the specification, illustrate preferred embodiments of the present invention, and together with the text description, are used to explain the principle of the present invention, wherein for the same elements, Always identified with the same reference signs. In the attached picture:

图1是例示根据本发明第一实施方式的语音合成系统的构成示例的框图；1 is a block diagram illustrating a configuration example of a speech synthesis system according to a first embodiment of the present invention;

图2是例示出根据本发明第一实施方式的语音合成系统中的自然韵律参数获取部的构成示例的框图；2 is a block diagram illustrating a configuration example of a natural rhythm parameter acquisition section in the speech synthesis system according to the first embodiment of the present invention;

图3是例示出根据本发明第一实施方式的语音合成系统中的音声作成部的构成示例的框图；3 is a block diagram illustrating an example of the configuration of a voice generating unit in the speech synthesis system according to the first embodiment of the present invention;

图4是例示根据本发明第二实施方式的语音合成系统的构成示例的框图；4 is a block diagram illustrating a configuration example of a speech synthesis system according to a second embodiment of the present invention;

图5是例示出根据本发明第二实施方式的语音合成系统中的近似自然韵律参数获取部的构成示例的框图；以及5 is a block diagram illustrating a configuration example of an approximate natural rhythm parameter acquisition section in a speech synthesis system according to a second embodiment of the present invention; and

图6是示出韵律参数标注规则知识库的一个示例的图。FIG. 6 is a diagram showing an example of a prosodic parameter labeling rule knowledge base.

具体实施方式 Detailed ways

参照下面的描述和附图，将清楚本发明的这些和其他方面。在这些描述和附图中，具体公开了本发明的一些特定实施方式，来表示实施本发明的原理的一些方式，但是应当理解，本发明的范围不受此限制。相反，本发明包括落入所附权利要求书的精神和内涵范围内的所有变化、修改和等同物。These and other aspects of the invention will become apparent with reference to the following description and drawings. In these descriptions and drawings, some specific embodiments of the present invention are specifically disclosed to represent some ways of implementing the principles of the present invention, but it should be understood that the scope of the present invention is not limited thereto. On the contrary, the invention includes all changes, modifications and equivalents coming within the spirit and scope of the appended claims.

针对一个实施方式描述和/或例示的特征，可以在一个或更多个其它实施方式中以相同方式或以类似方式使用，和/或与其他实施方式的特征相结合或代替其他实施方式的特征。Features described and/or exemplified for one embodiment can be used in the same or similar manner in one or more other embodiments, and/or in combination with or instead of features of other embodiments .

应当强调的是，词语“包括”当在本说明书中使用时用来指所述特征、整数、步骤或组成部分的存在，但不排除一个或更多个其它特征、整数、步骤、组成部分或它们的组合的存在或增加。It should be emphasized that the word "comprising" when used in this specification is used to refer to the presence of stated features, integers, steps or components, but does not exclude the presence of one or more other features, integers, steps, components or The presence or increase of their combinations.

参照图1到6来对本发明的实施方式进行例示性的详细描述。An exemplary detailed description of an embodiment of the present invention is given with reference to FIGS. 1 to 6 .

图1是例示根据本发明第一实施方式的语音合成系统的构成示例的框图。该图同时示出了根据第一实施方式的语音合成系统的工作流程，即，各组成部分之间的信号流向关系。FIG. 1 is a block diagram illustrating a configuration example of a speech synthesis system according to a first embodiment of the present invention. The figure also shows the working flow of the speech synthesis system according to the first embodiment, that is, the signal flow relationship among various components.

根据第一实施方式的语音合成系统包括待合成文本提供部101、自然韵律参数获取部102、自然韵律参数存储部103、音素列表存储部104、语音库105、音声作成部106以及合成语音存储部107。The speech synthesis system according to the first embodiment includes a text to be synthesized providing unit 101, a natural rhythm parameter acquisition unit 102, a natural rhythm parameter storage unit 103, a phoneme list storage unit 104, a speech library 105, a voice generation unit 106, and a synthesized speech storage unit 107.

待合成文本提供部101提供待进行语音合成的任意文本，并且可以对该文本进行预处理，消除其中存在的各种缺陷和错误，使其规范、准确，便于制作良好的录音文件。The text-to-be-synthesized providing unit 101 provides any text to be synthesized by speech, and can preprocess the text to eliminate various defects and errors in it, make it standardized and accurate, and facilitate the production of good recording files.

自然韵律参数获取部102基于待合成文本提供部101提供的待合成文本，生成对应于该待合成文本的自然韵律参数和音素列表，并将它们分别提供给自然韵律参数存储部103和音素列表存储部104进行存储。The natural prosody parameter acquisition unit 102 generates the natural prosody parameters and phoneme list corresponding to the text to be synthesized based on the text to be synthesized provided by the text to be synthesized providing unit 101, and provides them to the natural prosody parameter storage unit 103 and the phoneme list storage unit respectively. The unit 104 stores it.

自然韵律参数是指在语音中各个音素对应的声学韵律参数，一般包括音长、基频和能量等。譬如，音长描述一个音素在语流中对应的时间长度。Natural prosody parameters refer to the acoustic prosody parameters corresponding to each phoneme in speech, generally including sound length, fundamental frequency and energy. For example, sound duration describes the length of time a phoneme corresponds to in the speech flow.

音素列表是指经过语音合成系统前端的语言分析和语音学分析处理得到的序列。通常来说，它对应于一个文本句子。例如，对于汉语，在经过语音合成系统前端处理后，包括以下几个方面的信息：汉字字符、分词信息、词性信息、汉语拼音(音节、半音节)信息、韵律边界等级信息等。The phoneme list refers to a sequence obtained through language analysis and phonetic analysis at the front end of the speech synthesis system. Generally speaking, it corresponds to a text sentence. For example, for Chinese, after being processed by the front-end of the speech synthesis system, the following information is included: Chinese characters, word segmentation information, part-of-speech information, Chinese pinyin (syllable, semi-syllable) information, prosodic boundary level information, etc.

下面是可能的一个音素列表示例。这个音素列表中含有分词、拼音、词性以及韵律结构信息。Below is an example of a list of possible phonemes. This phoneme list contains participle, pinyin, part of speech and prosodic structure information.

有(you3)/v一(yi1)/m次(ci4)/q|||我们(wo3 men5)/r||和(he2)/p|外(wai4)/f校(xiao4)/ng||搞(gao3)/v联谊(lian2 yi4)/v|爬(pa2)/v香山(xiang1 shan1)/ns|||我们(wo3 men5)/r的(de5)/u|学生(xue2 sheng1)/n||没有(mei2 you3)/v一个(yi2 ge4)/m|掉队(diao4 dui4)/v的(de5)/u|||噌噌噌(ceng1 ceng1 ceng1)/o||就(jiu4)/d爬(pa2)/v上(shang4)/v了(le5)/u|山顶(shan1 ding3)/n@。There are (you3)/v one (yi1)/m times (ci4)/q|||we (wo3 men5)/r|| and (he2)/p|outer (wai4)/f school (xiao4)/ng| | engage in (gao3)/v fellowship (lian2 yi4)/v| climb (pa2)/v Xiangshan (xiang1 shan1)/ns|||our (wo3 men5)/r's (de5)/u|student (xue2 sheng1) /n||no(mei2 you3)/vone(yi2 ge4)/m|left behind(diao4 dui4)/v(de5)/u|||噢噌噌(ceng1 ceng1 ceng1)/o||just(jiu4 )/d climb (pa2)/v up (shang4)/v up (le5)/u|top of the mountain (shan1 ding3)/n@.

由于汉语词与词之间没有标记，因此为了便于后续的处理，词法分析包括分词和词性标注是首要问题。一般来说，机器自动分词采用一个字典来完成。目前，主要采用的自动分词方法有前向最大匹配方法、后向最大匹配方法、语言模型方法、隐马尔可夫模型以及最大熵模型等。在上面的例子中，“/”之前的一个或多个字组成一个词，“/”后的英文字母表示该词的词性。譬如“有”是动词(“v”)，“我们”是代词(“r”)等。Since there is no mark between Chinese words and words, in order to facilitate subsequent processing, lexical analysis including word segmentation and part-of-speech tagging is the primary issue. Generally speaking, machine automatic word segmentation uses a dictionary to complete. At present, the main automatic word segmentation methods are forward maximum matching method, backward maximum matching method, language model method, hidden Markov model and maximum entropy model. In the above example, one or more characters before "/" form a word, and the English letters after "/" indicate the part of speech of the word. For example, "have" is a verb ("v"), "we" is a pronoun ("r"), etc.

其中，“|”、“||”、“|||”分别代表韵律短语、韵律短语和语调短语。“@”用于标注句尾。这样，“有一次”、“我们和”、“外校”、“搞联谊”、“爬香山”等都是韵律词。进一步，“我们和外校”和“搞联谊爬香山”为韵律短语，“我们和外校搞联谊爬香山”则为语调短语。Among them, "|", "||", "|||" represent prosodic phrase, prosodic phrase and intonation phrase respectively. "@" is used to mark the end of a sentence. In this way, "one time", "we and", "outside school", "engaging in friendship", "climbing Xiangshan" and so on are all rhyming words. Furthermore, "we and other schools" and "climbing Xiangshan in friendship" are prosodic phrases, and "we are climbing Xiangshan in friendship with other schools" are intonation phrases.

另外就是拼音信息。譬如：“you3”为词“有”的拼音，“wo3 men54”为词“我们”的拼音。The other is pinyin information. For example: "you3" is the pinyin of the word "you", and "wo3 men54" is the pinyin of the word "we".

音声作成部106基于自然韵律参数存储部103中存储的自然韵律参数和音素列表存储部104中存储的音素列表，从预先录制好的语音库105中选择和待合成文本对应的多个语音单元，并对所选择的多个语音单元进行波形合成和调整，以生成最终的合成语音文件，并将该合成语音文件提供给合成语音存储部107进行存储。Based on the natural prosody parameters stored in the natural prosody parameter storage unit 103 and the phoneme list stored in the phoneme list storage unit 104, the voice generator 106 selects a plurality of phonetic units corresponding to the text to be synthesized from the pre-recorded voice bank 105, And perform waveform synthesis and adjustment on multiple selected speech units to generate a final synthesized speech file, and provide the synthesized speech file to the synthesized speech storage unit 107 for storage.

语音库105可以是用于通用语音合成系统的语音库。音声作成部106可以是可以是通用语音合成系统的后端模块。自然韵律参数存储部103、音素列表存储部104和合成语音存储部107都可以是计算机系统中常用的可读写存储装置，如RAM、闪存、硬盘、可读写光盘、磁光盘等，或者是专门的存储服务器。虽然在这里将这些存储部描述为分立的组件，但在实际应用中，它们实质上可以共用同一个物理存储装置。The speech library 105 may be a speech library for a general speech synthesis system. The voice generator 106 may be a back-end module of a general-purpose speech synthesis system. The natural prosody parameter storage unit 103, the phoneme list storage unit 104 and the synthesized speech storage unit 107 can all be commonly used readable and writable storage devices in computer systems, such as RAM, flash memory, hard disk, readable and writable optical disk, magneto-optical disk, etc., or Dedicated storage server. Although these storage units are described here as separate components, in practice, they may share the same physical storage device in essence.

图2是例示出自然韵律参数获取部103的具体构成示例的框图。FIG. 2 is a block diagram illustrating a specific configuration example of the natural rhythm parameter acquisition unit 103 .

自然韵律参数获取部103包括录音部201、自然韵律参数提取部202和音素列表生成部203。The natural prosody parameter acquisition unit 103 includes a recording unit 201 , a natural prosody parameter extraction unit 202 and a phoneme list generation unit 203 .

录音部201用于根据待合成文本提供部101提供的待合成文本来生成录音文件。这里，可以请一个自然人用贴合文本内容的富有感情色彩的抑扬顿挫的语调来朗读待合成文本，也可以用能够发出类似通常的电子卡通的声音或者能够发出其它特殊风格的声音的发声方式来朗读该文本，并进行相应的记录，以形成录音文件。另选的是，可以从已有的录音素材库中寻找对应于或包含该文本的内容的录音素材，进行相应的剪辑处理，以得到所需的录音的文件。录音部201的具体实现可以采用现有技术中充分描述过的各种方式，这里不再赘述。The recording unit 201 is configured to generate a recording file according to the text to be synthesized provided by the text to be synthesized providing unit 101 . Here, a natural person can be asked to read the text to be synthesized with an emotional intonation suitable for the content of the text, or a voice that can make a sound similar to an ordinary electronic cartoon or can make other special styles of sound. The text, and make a corresponding record to form a recording file. Alternatively, the recording material corresponding to or containing the content of the text may be searched from the existing recording material library, and the corresponding editing process may be performed to obtain the required recording file. The specific implementation of the recording unit 201 can adopt various methods fully described in the prior art, which will not be repeated here.

然后，自然韵律参数提取部202基于录音部201提供的录音文件，通过对该录音文件的波形数据进行数字信号处理，提取出与文本的具体内容(词或词组单元)对应的韵律参数，如音长、音频、能量等。Then, based on the recording file provided by the recording unit 201, the natural prosody parameter extraction part 202 extracts the prosody parameters corresponding to the specific content (word or phrase unit) of the text by performing digital signal processing on the waveform data of the recording file, such as the sound length, audio, energy, etc.

音素列表生成部203通过基于韵律参数提取部202的分析结果，对待合成文本进行分析，以获得一音素列表，该音素列表包括构成该文本的连续语流的一系列音素。这里，考虑到机器自动生成音素列表可能会存在一定的问题，譬如多音字的处理等，因此在音素列表生成部203的处理中可以附加人工的校验来生成与待合成文本对应的正确音素列表。The phoneme list generating unit 203 analyzes the text to be synthesized based on the analysis result of the prosodic parameter extracting unit 202 to obtain a phoneme list including a series of phonemes constituting the continuous speech flow of the text. Here, considering that there may be certain problems in the automatic generation of the phoneme list by the machine, such as the processing of polyphonic characters, etc., manual verification can be added to the processing of the phoneme list generation unit 203 to generate the correct phoneme list corresponding to the text to be synthesized .

自然韵律参数提取部202提取的自然韵律参数和音素列表生成部203生成的音素列表分别存储在自然韵律参数存储部103和音素列表存储部104中。The natural prosody parameters extracted by the prosody parameter extracting unit 202 and the phoneme list generated by the phoneme list generating unit 203 are stored in the prosody parameter storage unit 103 and the phoneme list storage unit 104 , respectively.

图3例示出音声作成部106的构成示例。FIG. 3 illustrates an example of the configuration of the sound generator 106 .

音声作成部106包括单元选择部301和波形生成部302。单元选择部301以自然韵律参数存储部103中存储的自然韵律参数作为参照，从语音库105中存储的各种语音单元中，针对音素列表存储部104中存储的音素列表搜寻符合全局最优的包括一系列语音合成单元的语音合成单元列表。The sound generation unit 106 includes a unit selection unit 301 and a waveform generation unit 302 . The unit selection unit 301 uses the natural prosody parameters stored in the natural prosody parameter storage unit 103 as a reference, and searches for the phoneme list stored in the phoneme list storage unit 104 from the various phonetic units stored in the speech bank 105 that is in line with the global optimum. A list of speech synthesis units that includes an array of speech synthesis units.

对全局最优说明如下。The description of the global optimum is as follows.

给定一个句子，假设由N个音素组成。对于每个音素来说，按照其韵律参数特征，在语音库中总能为其找到与其韵律参数最为相似的样本作为其合成单元。但是，对于一个句子来说，简单地把这样找到的N个合成单元拼接在一起并不是最优的。实际上，除了希望各个音素合成韵律参数与样本韵律参数尽可能相似以外，还要考察相邻合成单元拼接处由于频谱不一致带来的音质损耗。基于这种考虑，我们称这种单元选取的策略为符合全局最优的单元选取。Given a sentence, suppose to consist of N phonemes. For each phoneme, according to its prosodic parameter characteristics, the most similar sample to its prosodic parameter can always be found in the speech library as its synthesis unit. However, simply splicing together the N synthetic units thus found is not optimal for a sentence. In fact, in addition to hoping that the prosody parameters of each phoneme synthesis are as similar as possible to the sample prosody parameters, the sound quality loss caused by the inconsistency of the frequency spectrum at the splicing of adjacent synthesis units should also be considered. Based on this consideration, we call this unit selection strategy the global optimal unit selection.

然后，波形生成部302对单元选择部301选择出的语音合成单元列表进行拼接合成，并参照自然韵律参数存储部103中的自然韵律参数，对拼接合成的文件进行波形调整，以得到具有高自然度的合成语音文件，并将该合成语音文件存储到合成语音存储部107中。Then, the waveform generation unit 302 splices and synthesizes the speech synthesis unit list selected by the unit selection unit 301, and refers to the natural rhythm parameters in the natural rhythm parameter storage unit 103, and performs waveform adjustment on the spliced and synthesized files to obtain a speech synthesis unit list with high natural rhythm. degree of synthetic voice file, and the synthetic voice file is stored in the synthetic voice storage unit 107.

根据本发明的第一实施方式，利用语音库提供的发音标准和音色优美的语音素材，并结合通过对真人朗读或其它方式得到的录音文件进行分析得到的自然韵律参数，可以合成出音色优美和韵律自然的合成语音文件。According to the first embodiment of the present invention, using the pronunciation standard provided by the speech database and the speech material with beautiful timbre, combined with the natural rhythm parameters obtained by analyzing the recording files obtained by reading aloud by real people or other methods, it is possible to synthesize beautiful and timbre. Synthetic voice files with natural rhythms.

下面结合图4到6来对本发明第二实施方式的语音合成系统进行描述。和第一实施方式不同的是，第二实施方式是通过对待合成文本进行韵律参数标注，并对标注文件进行分析来得到近似的自然韵律参数。The speech synthesis system according to the second embodiment of the present invention will be described below with reference to FIGS. 4 to 6 . The difference from the first embodiment is that the second embodiment obtains approximate natural rhythm parameters by annotating prosodic parameters of the text to be synthesized and analyzing the tagged file.

如图4所示，根据本发明第二实施方式的语音合成系统包括待合成文本提供部101、近似自然韵律参数获取部401、近似自然韵律参数存储部402、音素列表存储部104、语音库105、音声作成部106以及合成语音存储部107。其中，待合成文本提供部101、音素列表存储部104、语音库105、音声作成部106以及合成语音存储部107和第一实施方式中的相同，在此不再重复描述。As shown in Figure 4, the speech synthesis system according to the second embodiment of the present invention includes a text to be synthesized providing unit 101, an approximate natural rhythm parameter acquisition unit 401, an approximate natural rhythm parameter storage unit 402, a phoneme list storage unit 104, and a speech library 105 , the voice generation unit 106 and the synthesized voice storage unit 107 . Among them, the to-be-synthesized text providing unit 101 , phoneme list storage unit 104 , speech library 105 , voice generation unit 106 and synthesized speech storage unit 107 are the same as those in the first embodiment, and will not be described again here.

近似自然韵律参数获取部401在通过参照预定的韵律参数标注规则对待合成文本提供部101提供的文本进行标注后，对标注文本进行分析来得到近似的自然韵律参数。所得到的近似自然韵律参数被存储到近似自然韵律参数存储部402中。The approximate natural rhythm parameter acquiring unit 401 tags the text provided by the text to be synthesized providing unit 101 by referring to predetermined prosodic parameter tagging rules, and then analyzes the tagged text to obtain approximate natural rhythm parameters. The obtained approximate natural rhythm parameters are stored in the approximate natural rhythm parameter storage unit 402 .

图5例示出近似自然韵律参数获取部401的构成示例。FIG. 5 illustrates a configuration example of the approximate natural rhythm parameter acquisition unit 401 .

近似自然参数获取部401包括韵律参数标注规则知识库501、韵律参数标注部502、韵律参数标注文件存储部503、韵律参数生成部504以及音素列表生成部505。The approximate natural parameter acquisition unit 401 includes a prosodic parameter labeling rule knowledge base 501 , a prosodic parameter labeling unit 502 , a prosodic parameter labeling file storage unit 503 , a prosodic parameter generating unit 504 and a phoneme list generating unit 505 .

韵律参数标注规则知识库501存储有关韵律参数标注的各种规则的数据。例如，图6示出了韵律参数标注规则的数据结构的一个示例。在韵律参数标注规则中至少要存储韵律参数名称(例如，音长、音频、速度等)、标注符号(例如，P、D、S)和标注区间(例如，0～9)等。The prosodic parameter labeling rule knowledge base 501 stores data related to various rules of prosodic parameter labeling. For example, FIG. 6 shows an example of a data structure of prosodic parameter labeling rules. In the prosodic parameter labeling rule, at least prosodic parameter name (for example, duration, audio frequency, speed, etc.), label symbol (for example, P, D, S) and label interval (for example, 0-9) etc. should be stored.

韵律参数标注部502参照韵律参数标注规则知识库501中存储的韵律参数标注规则来对待合成文本提供部101提供的文本进行韵律参数标注，并将标注后的文件存储在韵律参数标注文件存储部503中。The prosodic parameter labeling unit 502 refers to the prosodic parameter labeling rules stored in the prosodic parameter labeling rule knowledge base 501 to perform prosodic parameter labeling on the text provided by the text to be synthesized providing unit 101, and stores the labeled file in the prosodic parameter labeling file storage unit 503 middle.

实际上，采用韵律标注规则可以与自动韵律参数预测相结合。具体来说，首先由自动韵律参数预测模块根据输入的音素列表对其中的各个音素进行韵律参数的预测。在此基础上，可以根据应用的需要，由人来对其中的一些参数进行修正，譬如根据上下文情感的需要，对某个或某几个音素进行拖长处理或语气的加重处理，前者可以用标注符号D进行操作，后者可以用标注符号P予以操作。例如，假设“D5”代表维持原来的音长，“D6、D7、D8、D9”可以代表在原来的基础上进行分别对音长进行10％、20％、30％、40％的拖长处理。反之，“D4、D3、D2、D1、D0”可以代表在原来的基础上进行分别对音长进行10％、20％、30％、40％、50％的加快处理。In fact, adopting prosodic labeling rules can be combined with automatic prosodic parameter prediction. Specifically, firstly, the automatic prosodic parameter prediction module predicts the prosodic parameters of each phoneme in the input phoneme list. On this basis, according to the needs of the application, some of the parameters can be modified by humans. For example, according to the needs of the context and emotion, one or several phonemes can be protracted or tone accentuated. The former can be used It is operated with the symbol D, and the latter can be operated with the symbol P. For example, assuming that "D5" means maintaining the original sound length, "D6, D7, D8, D9" can represent the lengthening of the sound length by 10%, 20%, 30%, and 40% respectively on the original basis. . On the contrary, "D4, D3, D2, D1, D0" can represent that the sound length is accelerated by 10%, 20%, 30%, 40%, and 50% on the original basis.

韵律参数生成部504对韵律参数标注文件存储部503中存储的韵律参数标注文件进行分析，从而得到各种量化的韵律参数，并将其存储在近似自然韵律参数存储部402中。这些韵律参数虽然不能和第一实施方式中得到的自然韵律参数完全等同，但也具有很好的自然度。The prosodic parameter generation unit 504 analyzes the prosodic parameter labeling file stored in the prosodic parameter labeling file storage unit 503 to obtain various quantized prosody parameters, and stores them in the approximate natural prosody parameter storage unit 402 . Although these rhythm parameters cannot be completely equal to the natural rhythm parameters obtained in the first embodiment, they still have a good degree of naturalness.

音素列表生成部505用于对韵律参数标注文件503中存储的韵律参数标注文件进行解析，以生成和该文件对应的音素列表，并将所生成的音素列表存储在音素列表存储部104中。The phoneme list generating unit 505 is configured to analyze the prosodic parameter labeling file stored in the prosodic parameter labeling file 503 to generate a phoneme list corresponding to the file, and store the generated phoneme list in the phoneme list storage unit 104 .

最后，和在第一实施方式中描述的一样，音声作成部106以近似自然韵律参数存储部402中存储的近似自然韵律参数作为参照，从语音库105中存储的各种语音单元中，针对音素列表104中存储的音素列表搜寻符合全局最优的包括一系列语音合成单元的语音合成单元列表，对该语音合成单元进行拼接合成，并参照所述近似自然韵律参数对合成的文件进行波形调整，以得到自然度得到改进的合成语音文件。Finally, as described in the first embodiment, the voice generation unit 106 uses the approximate natural rhythm parameters stored in the approximate natural rhythm parameter storage unit 402 as a reference, and selects phoneme The phoneme list stored in the list 104 searches for a globally optimal speech synthesis unit list including a series of speech synthesis units, splicing and synthesizing the speech synthesis units, and adjusting the waveform of the synthesized file with reference to the approximate natural rhythm parameters, to obtain synthesized speech files with improved naturalness.

根据本发明的第二实施方式，不仅相比于现有技术的语音合成系统，可以得到改进了自然度的合成语音文件，而且相比于本发明的第一实施方式，不必用真人进行朗读，从而可以提高效率，并节约成本。According to the second embodiment of the present invention, not only compared with the speech synthesis system of the prior art, can obtain the synthesized speech file of improved naturalness, and compared with the first embodiment of the present invention, do not need to read aloud with real person, This increases efficiency and saves costs.

需要另外说明的是，上面虽然单独说明了自然韵律参数存储部103、音素列表存储部104、合成语音存储部107、近似自然韵律参数存储部402等，但实质上这些组件可以共用相同的物理存储装置。语音库105和韵律参数标注规则知识库501可以采用专用的存储装置，但也可以利用通用的存储装置的一部分。It should be further explained that although the natural prosody parameter storage unit 103, the phoneme list storage unit 104, the synthesized speech storage unit 107, the approximate natural prosody parameter storage unit 402, etc. have been separately described above, these components can share the same physical storage unit in essence. device. The speech database 105 and the prosodic parameter labeling rule knowledge base 501 may use a dedicated storage device, but may also use a part of a general storage device.

本发明的语音合成系统可以由通用的计算机系统配以相应的计算机程序来实现。但是，不限于此，各个组成部分也可以实现为专用的电子装置(例如固件等)，并通过将它们集成起来以实现一个完整的语音合成系统。The speech synthesis system of the present invention can be realized by a general-purpose computer system coupled with corresponding computer programs. However, it is not limited thereto, and each component can also be implemented as a dedicated electronic device (such as firmware, etc.), and a complete speech synthesis system can be realized by integrating them.

本发明提供了一种能够改进合成语音自然度的语音合成系统。采用本发明的语音合成系统，可以按照用户的要求生成音色优美且抑扬顿挫程度与自然语音非常接近的合成语音。The present invention provides a speech synthesis system capable of improving the naturalness of synthesized speech. By adopting the speech synthesis system of the present invention, it is possible to generate a synthesized speech with beautiful timbre and cadence very close to natural speech according to the user's requirements.

通过将标准的音色甜美的朗读和音色不太好或发音不标准但富有感情的朗读相结合，使得可以通过由任何人进行朗读来制作出音色甜美且抑扬顿挫的合成语音作品。从而，可以大幅度地降低有声电子图书的制作成本，批量生产出具有同一语音特色且富有韵律节奏的有声电子图书。By combining the standard sweet-sounding reading and the not-so-good-sounding or non-standard pronunciation but emotional reading, it is possible to create a synthesized voice work with a sweet and cadenced voice by reading aloud by anyone. Therefore, the production cost of the audio electronic book can be greatly reduced, and the audio electronic book with the same voice feature and full of rhythm can be produced in batches.

采用本发明的语音合成系统，可以在自己的嗓音发生变化时仍然能够发出自己原来的甜美的声音。比如，播音员感冒时可以利用本系统来以跟过去没有太大区别的语音播报。女性老了之后还可以通过本系统发出年轻时候的声音。By adopting the speech synthesis system of the present invention, the original sweet voice can still be emitted when the voice of the voice changes. For example, when an announcer has a cold, he can use this system to broadcast in a voice that is not much different from the past. When women are old, they can still make their youthful voices through this system.

尽管以上仅选择了优选实施例来例示本发明，但是本领域技术人员根据这里公开的内容，很容易在不脱离由所附权利要求限定的发明范围的情况下进行各种变化和修改。上述实施例的说明仅是例示性的，而不构成对由所附权利要求及其等同物所限定的发明的限制。Although only the preferred embodiments have been chosen to illustrate the present invention, those skilled in the art can easily make various changes and modifications based on the disclosure herein without departing from the scope of the invention defined by the appended claims. The descriptions of the above embodiments are illustrative only, and do not constitute limitations on the invention defined by the appended claims and their equivalents.

Claims

1, a kind of based on the rhythm with reference to carrying out the speech synthetic device of phonetic synthesis, it comprises:

The prosodic parameter acquisition unit, it is by to by comprising that the recording file that the nature person reads aloud this text to be synthesized that the mode of text to be synthesized obtains analyzes, perhaps the prosodic parameter mark file that obtains treat synthesis text to carry out the prosodic parameter mark with predetermined labeled standards is analyzed, and obtains nature prosodic parameter or approximate rhythm of nature parameter; With

Sound sound makes portion, its with this rhythm of nature parameter or approximate rhythm of nature parameter as reference, from the sound bank of pre-recording, select corresponding phonetic synthesis unit at this text to be synthesized, and described phonetic synthesis unit is spliced synthetic, to produce synthetic speech file corresponding to this text to be synthesized.

2, speech synthetic device as claimed in claim 1, wherein, this sound sound portion of making comprises:

The voice unit selection portion, it as reference, selects the phonetic synthesis unit at this text to be synthesized global optimum with this rhythm of nature parameter or approximate rhythm of nature parameter from this sound bank of pre-recording; With

The waveform generating unit, it is based on this rhythm of nature parameter or approximate rhythm of nature parameter, the phonetic synthesis unit that this voice unit selection portion is selected splices synthetic and the voice document after synthetic is carried out the waveform adjustment, to obtain the synthetic speech file with high naturalness corresponding to this text to be synthesized.

3, speech synthetic device as claimed in claim 2, wherein, this prosodic parameter acquisition unit comprises:

Recording portion, it reads aloud the recording file that this text to be synthesized obtains this text to be synthesized by the nature person; With

The prosodic parameter extraction unit, it obtains prosodic parameter from the Wave data of this recording file, and this prosodic parameter comprises the duration of a sound, audio frequency and energy.

4, speech synthetic device as claimed in claim 2, wherein, this prosodic parameter acquisition unit comprises:

Prosodic parameter mark portion, its knowledge base with reference to the mark rule that has defined prosodic parameter is carried out the prosodic parameter mark with this predetermined labeled standards to this text to be synthesized, to obtain prosodic parameter mark file; With

The prosodic parameter generating unit, it is analyzed this prosodic parameter mark file, and to obtain approximate prosodic parameter, this approximate prosodic parameter comprises the duration of a sound, audio frequency and energy.

5, speech synthetic device as claimed in claim 1, wherein, the recording file of this text to be synthesized is to obtain by the mode that the electronics cartoon is pronounced.

6, a kind of based on the rhythm with reference to carrying out the method for phonetic synthesis, it may further comprise the steps:

The prosodic parameter obtaining step, it is by to by comprising that the recording file that the nature person reads aloud this text to be synthesized that the mode of text to be synthesized obtains analyzes, perhaps the prosodic parameter mark file that obtains treat synthesis text to carry out the prosodic parameter mark with predetermined labeled standards is analyzed, and obtains nature prosodic parameter or approximate rhythm of nature parameter; With

Sound sound makes step, its with this rhythm of nature parameter or approximate rhythm of nature parameter as reference, from the sound bank of pre-recording, select corresponding phonetic synthesis unit at this text to be synthesized, and described phonetic synthesis unit is spliced synthetic, to produce synthetic speech file corresponding to this text to be synthesized.

7, method as claimed in claim 6, wherein, this sound sound makes step and comprises:

As reference, from this sound bank of pre-recording, select phonetic synthesis unit with this rhythm of nature parameter or approximate rhythm of nature parameter at this text to be synthesized global optimum; With

Based on this rhythm of nature parameter or approximate rhythm of nature parameter, to splicing synthetic and the voice document after synthetic is carried out the waveform adjustment in selected phonetic synthesis unit, to obtain synthetic speech file with high naturalness corresponding to this text to be synthesized.

8, method as claimed in claim 7, wherein, this prosodic parameter obtaining step comprises:

Read aloud the recording file that this text to be synthesized obtains this text to be synthesized by the nature person; With

Obtain prosodic parameter from the Wave data of this recording file, this prosodic parameter comprises the duration of a sound, audio frequency and energy.

9, method as claimed in claim 7, wherein, this prosodic parameter obtaining step comprises:

Knowledge base with reference to the mark rule that has defined prosodic parameter is carried out the prosodic parameter mark with this predetermined labeled standards to this text to be synthesized, to obtain prosodic parameter mark file; With

This prosodic parameter mark file is analyzed, and to obtain approximate prosodic parameter, this approximate prosodic parameter comprises the duration of a sound, audio frequency and energy.

10, method as claimed in claim 6 wherein, obtains the recording file of this text to be synthesized by the mode of electronics cartoon pronunciation.