CN102881285B

CN102881285B - A kind of method of prosodic labeling and prosodic labeling equipment

Info

Publication number: CN102881285B
Application number: CN201110204284.2A
Authority: CN
Inventors: 张波; 孟遥; 于浩
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2011-07-15
Filing date: 2011-07-15
Publication date: 2015-10-21
Anticipated expiration: 2031-07-15
Also published as: CN102881285A

Abstract

The embodiment of the present invention discloses a method of prosodic labeling and a special labeling device, the method includes: a receiving module, used to receive the audio data identification to be labeled, the audio data identification includes the batch and/or data of the audio data The entry number; the playback module, used to play the corresponding audio data to the tagger according to the audio data identification; the record module, used to record the tagging process with the tagger when it is detected that the tagger has triggered the tagging process Information related to the labeling behavior of the labeler; a generating module configured to generate audio labeling information of the audio data according to the information related to the labeling behavior of the labeler. Through the embodiments of the present invention, the labeled audio annotation data can be made more accurate, and then the accuracy or fluency of speech synthesis using the audio annotation data can meet actual requirements. The special labeling device provided by the embodiments of the present invention is also more suitable for blind people.

Description

A prosodic labeling method and prosodic labeling device

技术领域technical field

本发明一般地涉及语音数据处理技术领域，尤其是一种韵律标注的方法及专用标注设备。The present invention generally relates to the technical field of speech data processing, in particular to a prosodic labeling method and special labeling equipment.

背景技术Background technique

带有韵律标注的声音资源库，是语音识别或者从文本到语音(Text ToSpeech，TTS)领域不可或缺的知识训练源。The sound resource library with prosodic annotation is an indispensable source of knowledge training in the field of speech recognition or text to speech (Text To Speech, TTS).

目前现有技术在进行韵律标注时，有一种是利用生语料和标点符号信息生成统计概率模型，再根据该模型进行韵律标注；但是因为生成的统计概率模型不够准确，所以进行韵律标注的结果也不够精确；还有一种现有技术将用户的真实声音作为训练数据，根据发声规则和统计出的音素长度来生成规则韵律信息；但是用户会出现疲累的状态，这样就会使得生成的规则韵律信息不够通用。At present, when performing prosodic labeling in existing technologies, one method is to use raw corpus and punctuation information to generate a statistical probability model, and then perform prosodic labeling based on the model; however, because the generated statistical probability model is not accurate enough, the result of prosodic labeling is also poor. Not accurate enough; there is another existing technology that uses the user's real voice as training data to generate regular prosodic information according to the vocalization rules and the statistical phoneme length; but the user will appear tired, which will make the generated regular prosodic information Not generic enough.

总之，采用现有技术进行韵律标注都不能有效生成准确的音频标注信息，进而也使得语音合成的准确度或者流畅程度都不满足实际需求。In short, prosodic labeling using existing technologies cannot effectively generate accurate audio labeling information, and furthermore, the accuracy or fluency of speech synthesis does not meet actual needs.

发明内容Contents of the invention

有鉴于此，本发明实施例提供了一种韵律标注的方法及专用标注设备，能够方便得生成准确的音频标注信息，进而也使得语音合成的准确度或者流畅程度都能够满足实际需求。In view of this, the embodiment of the present invention provides a prosodic labeling method and a special labeling device, which can conveniently generate accurate audio labeling information, and then make the accuracy or fluency of speech synthesis meet actual needs.

根据本发明实施例的一个方面，提供一种韵律标注设备，包括：接收模块，用于接收待标注的音频数据标识，所述音频数据标识包括音频数据的批次和/或数据条目号；播放模块，用于依据所述音频数据标识向标注者播放相对应的音频数据；记录模块，用于在检测到所述标注者触发了标注过程的情况下，记录与所述标注者的标注行为相关的信息；生成模块，用于根据所述与标注者的标注行为相关的信息生成所述音频数据的音频标注信息，所述与标注者的标注行为相关的信息具体为触发时间点和停顿时长信息。According to an aspect of an embodiment of the present invention, there is provided a prosodic tagging device, including: a receiving module, configured to receive an audio data identifier to be tagged, where the audio data identifier includes a batch and/or data entry number of the audio data; A module for playing corresponding audio data to the tagger according to the audio data identifier; a recording module for recording the tagging behavior related to the tagger when it is detected that the tagger triggers the tagging process information; a generating module, configured to generate the audio annotation information of the audio data according to the information related to the tagging behavior of the tagger, and the information related to the tagging behavior of the tagger is specifically trigger time point and pause duration information .

根据本发明实施例的另一个方面，提供一种韵律标注的方法，包括：接收待标注的音频数据标识，所述音频数据标识包括音频数据的批次和/或数据条目号；依据所述音频数据标识向标注者播放相对应的音频数据；在检测到所述标注者触发了标注过程的情况下，记录与标注者的标注行为相关的信息；根据所述与标注者的标注行为相关的信息生成所述音频数据的音频标注信息，所述与标注者的标注行为相关的信息具体为触发时间点和停顿时长信息。According to another aspect of the embodiments of the present invention, there is provided a method for prosodic labeling, including: receiving an audio data identifier to be labeled, the audio data identifier including the batch and/or data entry number of the audio data; The data identifier plays corresponding audio data to the annotator; when it is detected that the annotator triggers the annotation process, record the information related to the annotator’s annotating behavior; according to the information related to the annotator’s annotating behavior The audio annotation information of the audio data is generated, and the information related to the annotation behavior of the annotator is specifically trigger time point and pause duration information.

另外，根据本发明实施例的另一方面，还提供了一种存储介质。所述存储介质包括机器可读的程序代码，当在信息处理设备上执行所述程序代码时，所述程序代码使得所述信息处理设备执行根据本发明的上述一种韵律标注的方法。In addition, according to another aspect of the embodiments of the present invention, a storage medium is also provided. The storage medium includes machine-readable program code. When the program code is executed on the information processing device, the program code causes the information processing device to execute the above-mentioned prosodic tagging method according to the present invention.

此外，根据本发明实施例的再一方面，还提供了一种程序产品。所述程序产品包括机器可执行的指令，当在信息处理设备上执行所述指令时，所述指令使得所述信息处理设备执行根据本发明的上述一种韵律标注的方法。In addition, according to yet another aspect of the embodiments of the present invention, a program product is also provided. The program product includes machine-executable instructions, and when the instructions are executed on the information processing device, the instructions cause the information processing device to execute the above-mentioned prosodic tagging method according to the present invention.

根据本发明实施例的上述方法，可以通过多次向标注者播放音频数据的方式，能够在标注者熟悉音频数据之后，再触发音频数据的标注过程，并且通过重复标注的方式，可以使得采用本实施例得到的音频标注数据更准确，进而也使得采用音频标注数据进行语音合成的准确度或者流畅程度都能够满足实际需求。并且，还能够对某一个标注者所标注的所有音频标注信息进行可信权重处理，这样就能进一步评估音频标注信息的精确度和准确性，从而为后续的语音合成等应用打下基础。According to the above-mentioned method of the embodiment of the present invention, by playing the audio data to the annotator multiple times, the audio data annotation process can be triggered after the annotator is familiar with the audio data, and by repeating the annotation, it is possible to use this The audio annotation data obtained in the embodiment is more accurate, and thus the accuracy or fluency of speech synthesis using the audio annotation data can meet actual requirements. Moreover, it is also possible to perform credible weight processing on all audio annotation information marked by a certain annotator, so that the accuracy and accuracy of the audio annotation information can be further evaluated, thereby laying the foundation for subsequent applications such as speech synthesis.

在下面的说明书部分中给出本发明实施例的其他方面，其中，详细说明用于充分地公开本发明实施例的优选实施例，而不对其施加限定。Further aspects of the embodiments of the present invention are given in the description section below, wherein the detailed description serves to fully disclose preferred embodiments of the embodiments of the present invention without imposing limitations thereon.

附图说明Description of drawings

下面结合具体的实施例，并参照附图，对本发明实施例的上述和其它目的和优点做进一步的描述。在附图中，相同的或对应的技术特征或部件将采用相同或对应的附图标记来表示。The above and other objectives and advantages of the embodiments of the present invention will be further described below in conjunction with specific embodiments and with reference to the accompanying drawings. In the drawings, the same or corresponding technical features or components will be indicated by the same or corresponding reference numerals.

图1是示出作为本发明实施例提供的第一种韵律标注的方法流程图；Fig. 1 is a flow chart showing the first method of prosodic labeling provided as an embodiment of the present invention;

图2是示出作为第一种方法实施例中步骤S102的流程图；Fig. 2 is a flowchart showing step S102 as the first method embodiment;

图3是示出作为第一种方法实施例中步骤S103的流程图；Fig. 3 is a flowchart showing step S103 as the first method embodiment;

图4是示出作为本发明实施例提供的第二种韵律标注的方法流程图；Fig. 4 is a flow chart showing a second prosodic tagging method provided as an embodiment of the present invention;

图5是示出作为本发明实施例提供的第三种韵律标注的方法流程图；Fig. 5 is a flow chart showing a third prosodic tagging method provided as an embodiment of the present invention;

图6是示出作为第三种方法实施例中步骤S506的流程图；FIG. 6 is a flow chart showing step S506 as a third method embodiment;

图7是示出作为第三种方法实施例中步骤S507的流程图；FIG. 7 is a flow chart showing step S507 as a third method embodiment;

图8是示出作为本发明实施例提供的韵律标注的装置的示意图；Fig. 8 is a schematic diagram showing a prosodic tagging device provided as an embodiment of the present invention;

图9是示出作为韵律装置实施例中专用标注设备的界面的示意图；Fig. 9 is a schematic diagram showing an interface as a dedicated labeling device in the prosody device embodiment;

图10是示出作为韵律装置实施例中记录模块603的示意图；Fig. 10 is a schematic diagram showing the recording module 603 in the prosodic device embodiment;

图11是示出作为本发明的实施例中所采用的信息处理设备的个人计算机的示例性结构的框图。FIG. 11 is a block diagram showing an exemplary structure of a personal computer as an information processing device employed in an embodiment of the present invention.

具体实施方式Detailed ways

下面参照附图来说明本发明的实施例。Embodiments of the present invention will be described below with reference to the drawings.

具体的，参见图1，本发明实施例提供的第一种韵律标注的方法可以包括：Specifically, referring to FIG. 1, the first method of prosodic labeling provided by the embodiment of the present invention may include:

S101：接收待标注的音频数据标识，所述音频数据标识包括音频数据的批次和/或数据条目号。S101: Receive an audio data identifier to be marked, where the audio data identifier includes a batch and/or data entry number of the audio data.

在本发明实施例中，待标注的音频数据标识可以采用音频数据的批次或者数据条目号来表示，这样就可以通过选择需要标注的批次和/或数据条目号来确定出待标注的音频数据。In the embodiment of the present invention, the audio data identification to be marked can be represented by the batch or data entry number of the audio data, so that the audio data to be marked can be determined by selecting the batch and/or data entry number to be marked. data.

其中，待标注的音频数据可以使用无线上网模块从互联网进行下载，或者可以使用USB模块从本地下载，待标注的音频数据的获取方式可以有多种，本发明实施例中对此不做限定。Wherein, the audio data to be marked can be downloaded from the Internet by using the wireless Internet access module, or can be downloaded from the local area by using the USB module. There are many ways to obtain the audio data to be marked, which is not limited in the embodiment of the present invention.

S102：依据所述音频数据标识向标注者播放相对应的音频数据。S102: Play corresponding audio data to the tagger according to the audio data identifier.

当选择音频数据标识之后，就可以确定出待标注的音频数据，此时，再向标注者播放选择的音频数据。需要说明的是，这里的标注者可以由自然人来实现，也可以采用具有标注功能的实体来实现。After the audio data identification is selected, the audio data to be marked can be determined, and at this time, the selected audio data is played to the marker. It should be noted that the annotator here can be realized by a natural person, or by an entity with an annotation function.

具体的，参考图2所示，所述步骤S102在实际应用中可以包括：Specifically, as shown in FIG. 2, the step S102 may include in practical applications:

S201：依据所述音频数据标识向标注者第一次播放相对应的音频数据。S201: Play the corresponding audio data to the annotator for the first time according to the audio data identifier.

在播放音频数据过程中，可以采取反复播放的方式，这样可以提高标注的准确度。因此首先向标注者第一次播放相对应的音频数据。In the process of playing the audio data, the way of playing repeatedly can be adopted, which can improve the accuracy of labeling. The corresponding audio data is therefore first played to the annotator for the first time.

S202：在停顿第一预定时间段之后，向标注者第二次播放所述音频数据。S202: After pausing for a first predetermined period of time, play the audio data to the annotator for the second time.

所述第一预定时间段可以设置为两秒，这样就可以给标注者一个适应时间，提高标注的注意力。The first predetermined time period can be set to two seconds, so that the labeler can be given a period of time to get used to it, and the attention of labeling can be improved.

S203：在停顿第二预定时间段之后，向标注者第三次播放所述音频数据。S203: After pausing for a second predetermined period of time, play the audio data to the annotator for the third time.

所述第二预定时间段可以设置为三秒，其与第一预定时间段起到相同的作用。The second predetermined time period may be set to three seconds, which has the same effect as the first predetermined time period.

回到图1，在步骤S103：在检测到所述标注者触发了标注过程的情况下，记录与标注者的标注行为相关的信息。Returning to FIG. 1 , in step S103 : if it is detected that the annotator triggers the annotating process, record information related to the annotator's annotating behavior.

在向标注者播放两次待标注的音频数据的情况下，检测标注者是否触发了标注过程，如果标注者触发了标注过程，则记录与标注者的标注行为相关的信息，所述与标注者的标注行为相关的信息具体可以为触发时间点和停顿时长信息，即是标注者触发标注按键的时间点和每一次触发的时长。在标注者没有触发标注过程的情况下，则不做任何其他的相关处理，继续播放所述音频数据，直至标注者触发了标注过程再开始记录标注信息，例如初始触发时间点和初始停顿时长信息等等。In the case of playing the audio data to be labeled twice to the labeler, it is detected whether the labeler has triggered the labeling process, and if the labeler has triggered the labeling process, the information related to the labeling behavior of the labeler is recorded, and the labeler The information related to the labeling behavior can specifically be the trigger time point and pause duration information, that is, the time point when the labeler triggers the label button and the duration of each trigger. If the annotator does not trigger the annotation process, no other related processing is performed, and the audio data continues to be played until the annotator triggers the annotation process before recording annotation information, such as initial trigger time point and initial pause duration information etc.

具体的，参考图3所示，所述步骤S103在实际应用中可以包括：Specifically, as shown in FIG. 3, the step S103 may include in practical applications:

S301：在第三次播放所述音频数据的过程中，检测标注者是否触发了标注过程，如果是，则进入步骤S302，如果否，则进入步骤S304。S301: In the process of playing the audio data for the third time, detect whether the tagger triggers the tagging process, if yes, go to step S302, if not, go to step S304.

需要说明的是，检测标注者是否触发了标注过程，可以通过检测标注者是否触发标注按键来实现，具体的介绍可以参考下一个实施例。It should be noted that detecting whether the tagger triggers the tagging process can be achieved by detecting whether the tagger triggers the tagging button, and the specific introduction can refer to the next embodiment.

S302：记录初始触发时间点和初始停顿时长信息。S302: Record information about an initial trigger time point and an initial pause duration.

记录初始触发时间点和初始停顿时长信息，需要说明的是，在S302中记录的初始触发时间点和初始停顿时长信息可以认为是模拟过程，因为在第四次播放音频数据的过程中记录的触发时间点和停顿时长信息作为最终的音频标注数据。Record the initial trigger time point and initial pause duration information. It should be noted that the initial trigger time point and initial pause duration information recorded in S302 can be considered as an analog process, because the trigger recorded during the fourth playback of audio data The time point and pause duration information are used as the final audio annotation data.

S303：第三次播放音频数据完毕时，停顿第三预定时间段，并在向标注者第四次播放所述音频数据的过程中，记录最终触发时间点和最终停顿时长信息。S303: When the audio data is played for the third time, pause for a third predetermined period of time, and record the information of the final trigger time point and the final pause duration during the process of playing the audio data to the annotator for the fourth time.

所述第三预定时间段具体可以设置为一秒。需要说明的是，在S303中记录的最终触发时间点和最终停顿时长信息也可以与S302中的初始触发时间点和初始停顿时长信息进行比较，或者取其平均值作为最终的音频标注数据都是可行的，具体可以根据实际情况或者用户需求适应性调整。Specifically, the third predetermined time period may be set as one second. It should be noted that the final trigger time point and final pause duration information recorded in S303 can also be compared with the initial trigger time point and initial pause duration information in S302, or the average value thereof can be taken as the final audio annotation data. It is feasible, and can be adaptively adjusted according to the actual situation or user needs.

S304：不做任何其他的相关处理，继续播放所述音频数据。S304: Continue playing the audio data without performing any other related processing.

回到图1，在步骤S104：根据所述与标注者的标注行为相关的信息生成所述音频数据的音频标注信息。Returning to FIG. 1 , in step S104 : generating audio annotation information of the audio data according to the information related to the annotation behavior of the annotator.

在本实施例中，可以采用将在第四次播放所述音频数据过程中记录的所述最终触发时间点和最终停顿时长信息作为音频标注信息的方式。In this embodiment, the final trigger time point and the final pause duration information recorded during the fourth playing of the audio data may be used as audio annotation information.

需要说明的是，在本申请实施例中提及的预定时间段，都可以根据不同的实际需求进行适应性调整，本申请实施例就不再进行限定。It should be noted that the predetermined time periods mentioned in the embodiments of the present application can be adjusted adaptively according to different actual needs, and are not limited in the embodiments of the present application.

通过上述第一种韵律标注的方法，可以通过多次向标注者播放音频数据的方式，能够在标注者熟悉音频数据之后，再触发音频数据的标注过程，并且通过重复标注的方式，可以使得采用本实施例得到的音频标注数据更准确，进而也使得采用音频标注数据进行语音合成的准确度或者流畅程度都能够满足实际需求。Through the above-mentioned first method of prosodic labeling, by playing the audio data to the labeler multiple times, the labeling process of the audio data can be triggered after the labeler is familiar with the audio data, and by repeated labeling, it can be used. The audio annotation data obtained in this embodiment is more accurate, and thus the accuracy or fluency of speech synthesis using the audio annotation data can meet actual requirements.

具体的，参见图4，本发明实施例提供了另一种韵律标注的方法，可以包括：Specifically, referring to FIG. 4, the embodiment of the present invention provides another method for prosodic labeling, which may include:

S401：通过专用标注设备的界面接收待标注的音频数据标识，所述音频数据标识包括音频数据的批次和/或数据条目号。S401: Receive an audio data identifier to be annotated through an interface of a dedicated annotating device, where the audio data identifier includes a batch and/or data entry number of the audio data.

在本实施例中，具体可以通过专用设备的界面来接收待标注的音频数据标识，所述音频数据标识也可以包括音频数据的批次和/或数据条目号。In this embodiment, specifically, the audio data identifier to be marked may be received through an interface of a special device, and the audio data identifier may also include a batch and/or data entry number of the audio data.

S402：依据所述音频数据标识向标注者播放相对应的音频数据。S402: Play corresponding audio data to the tagger according to the audio data identifier.

本步骤与第一种韵律标注的方法实施例类似，在此不再赘述。This step is similar to the first method embodiment of prosodic labeling, and will not be repeated here.

S403：通过检测标注者是否触发所述专用标注设备的界面上的标注按钮来检测是否触发了标注过程。S403: Detect whether the tagging process is triggered by detecting whether the tagger triggers the tagging button on the interface of the dedicated tagging device.

在本实施例中，具体通过检测标注者是否触发了专用标注设备的界面上的标注按钮来检测是否触发了标注过程。如果能够接收到标注者在标注按钮上的触发信息，则认为标注者已经触发了标注过程，这样就可以通过检测标注按钮的触发信息来检测是否开始进行标注。In this embodiment, whether the tagging process is triggered is detected specifically by detecting whether the tagger triggers the tagging button on the interface of the special tagging device. If the trigger information of the labeler on the label button can be received, it is considered that the labeler has triggered the labeling process, so whether to start labeling can be detected by detecting the trigger information of the label button.

S404：在检测到所述标注者触发了标注过程的情况下，记录与标注者的标注行为相关的信息。S404: Record information related to the tagging behavior of the tagger when it is detected that the tagging process is triggered by the tagger.

S405：根据所述与标注者的标注行为相关的信息生成所述音频数据的音频标注信息。S405: Generate audio annotation information of the audio data according to the information related to the annotation behavior of the annotator.

S406：将所述记录的音频标注信息保存为可以用于网络传输的数据格式。S406: Save the recorded audio annotation information in a data format that can be used for network transmission.

在生成所述音频数据的音频标注信息之后，还将所述音频标注信息保存为可以用于网络传输的数据格式，例如可扩展标记语言(XML，Extensible Markup Language)格式的文件，可以将音频标注信息保存在存储器中，每条声音数据可以只保留最新的标注记录。After generating the audio annotation information of the audio data, the audio annotation information is also saved as a data format that can be used for network transmission, such as a file in Extensible Markup Language (XML, Extensible Markup Language) format, and the audio annotation can be The information is stored in the memory, and each piece of sound data can only keep the latest annotation record.

通过本实施例中，如果待标注的音频数据有更新，则可以通过网络下载或者本地下载重复执行标注流程，最后保存的音频标注信息可以通过USB接口导出或者通过网络上传到服务端。本实施例除了能够方便得生成准确的音频标注信息，进而也使得语音合成的准确度或者流畅程度都能够满足实际需求之外，还可以方便得通过检测标注按钮的触发信息来监控是否需要进行标注，以及能够方便的在网络上实现音频标注信息的共享和发布。Through this embodiment, if the audio data to be marked is updated, the marking process can be repeated through network download or local download, and the final saved audio marking information can be exported through the USB interface or uploaded to the server through the network. This embodiment can conveniently generate accurate audio annotation information, and then make the accuracy or fluency of speech synthesis meet actual needs, and can also conveniently monitor whether annotation is required by detecting the trigger information of the annotation button , and can conveniently realize the sharing and publishing of audio annotation information on the network.

具体的，参见图5，本发明实施例提供了第三种韵律标注的方法，可以包括：Specifically, referring to FIG. 5, the embodiment of the present invention provides a third method for prosodic labeling, which may include:

S501：接收待标注的音频数据标识，所述音频数据标识包括音频数据的批次和/或数据条目号。S501: Receive an audio data identifier to be marked, where the audio data identifier includes a batch and/or data entry number of the audio data.

S502：依据所述音频数据标识向标注者播放相对应的音频数据。S502: Play corresponding audio data to the tagger according to the audio data identifier.

S503：在检测到所述标注者触发了标注过程的情况下，记录与标注者的标注行为相关的信息。S503: Record information related to the tagging behavior of the tagger when it is detected that the tagging process is triggered by the tagger.

S504：根据所述与标注者的标注行为相关的信息生成所述音频数据的音频标注信息。S504: Generate audio annotation information of the audio data according to the information related to the annotation behavior of the annotator.

S505：从任一标注者的音频标注信息集合中抽取出至少两个音频标注信息样本。S505: Extract at least two audio annotation information samples from the audio annotation information set of any annotator.

在本实施例中，在生成音频标注信息之后，针对任一个标注者，还可以对其所有的音频标注信息进行可信权重处理，用来检测音频标注信息的可信程度。首先需要从任一标注者的音频标注信息集合中抽取出至少两个音频标注信息样本。In this embodiment, after the audio annotation information is generated, for any annotator, credible weight processing may be performed on all the audio annotation information to detect the credibility of the audio annotation information. Firstly, at least two audio annotation information samples need to be extracted from any annotator's audio annotation information set.

S506：依据所述至少两个音频标注信息样本获取所述任一标注者的标准基准时长。S506: Obtain a standard reference duration of any annotator according to the at least two audio annotation information samples.

在本步骤中需要依据抽取出的至少两个音频标注信息样本计算该任一标注者的标准基准时长。In this step, it is necessary to calculate the standard reference duration of any annotator based on at least two extracted audio annotation information samples.

参考图6所示，具体的，所述步骤S506在实际应用中可以包括：Referring to FIG. 6, specifically, the step S506 may include in practical applications:

S601：获取每一个音频标注信息样本中子标注时长的最小值。S601: Obtain a minimum value of sub-label duration in each audio label information sample.

在实际应用中，假设抽取出N条音频标注信息样本，而每条标注信息则有M个子标注时长，则本步骤分别获取N个子标注时长集合的最小值Min(T₁，T₂...T_N)，获取N个最小时长信息。In practical applications, assuming that N pieces of audio annotation information samples are extracted, and each annotation information has M sub-annotation durations, this step obtains the minimum value Min(T ₁ , T ₂ ... T _N ), to obtain N minimum hour length information.

S602：依据获取到的最小值计算每条音频标注信息中子标注时长的标准差值。S602: Calculate the standard deviation value of the sub-annotation duration in each piece of audio annotation information according to the obtained minimum value.

依据所述N个最小的子标注时长信息，计算子标注时长的标准差值E。在本步骤中计算标准差值可以采用现有的计算公式，在此不再一一列举额。Calculate the standard deviation E of the sub-tagging duration according to the N smallest sub-tagging duration information. In this step, an existing calculation formula can be used to calculate the standard deviation value, and the amounts will not be listed here.

S603：将最小的标准差值所对应的音频标注信息中最小的子标注时长作为所述任一标注者的标准基准时长。S603: Use the smallest sub-annotation duration in the audio annotation information corresponding to the smallest standard deviation value as the standard reference duration of any annotator.

因为计算出的标准差值有N个，所以将N个标准差值中最小的那个标准差值Min(E)所对应的音频标注信息中最小的子标注时长作为所述任一标注者的标准基准时长Pi。Because there are N calculated standard deviation values, the minimum sub-tagging duration in the audio annotation information corresponding to the smallest standard deviation value Min(E) among the N standard deviation values is used as the standard for any tagger Baseline duration Pi.

S507：利用所述标准基准时长对所述任一标注者的音频标注信息集合进行可信权重处理。S507: Perform credible weight processing on the audio annotation information set of any annotator by using the standard reference duration.

在步骤S603得到标注者的标注基准时长之后，利用标准基准时长对该标注者的音频标注信息集合进行可信权重处理。参考图7，所述步骤S507具体可以包括：After the annotation reference duration of the annotator is obtained in step S603, credible weight processing is performed on the audio annotation information set of the annotator by using the standard reference duration. Referring to FIG. 7, the step S507 may specifically include:

S701：依据该标注者的前N条音频标注信息的N个最小子标注时长，计算所述N个最小子标注时长与所述基准标注时长的标准差；其中，N为大于1的自然数。S701: According to the N minimum sub-tagging durations of the first N pieces of audio tagging information of the tagger, calculate the standard deviation between the N smallest sub-tagging durations and the reference tagging duration; wherein, N is a natural number greater than 1.

在本步骤中，每个用户有一个标准基准时长P_i，再根据每个用户标注N条音频数据的N条最小子标注时长T_i计算该N条最小子标注时长的标准差，计算公式如下：In this step, each user has a standard reference duration P _i , and then calculates the standard deviation of the N minimum sub-label durations T _i according to the N minimum sub-label durations T i of each user’s N pieces of audio data. The calculation formula is as follows :

$F f = = \sqrt[22]{{Σ Σ}_{11}^{n no} {(({T T}_{i i} - - {P P}_{i i}))}^{22}}$

S702：将所述标准差作为可信度量参数对该任一标注者的音频标注信息进行可信权重处理，其中，所述标准差的值越大，则所述音频标注信息的可信度越低。S702: Use the standard deviation as a credibility measurement parameter to perform credible weight processing on the audio annotation information of any annotator, wherein the larger the value of the standard deviation, the higher the credibility of the audio annotation information. Low.

将计算得到的标准差F作为可信度量参数对该任一标注者的音频标注信息进行可信权重处理，其中，所述标准差的值越大，说明音频标注信息之间的差异越大，即是有可能标注者出现了疲劳情况或者其他导致音频标注信息比较凌乱的客观情况，则说明所述音频标注信息的可信度越低。The calculated standard deviation F is used as a credibility measurement parameter to perform credible weight processing on the audio annotation information of any annotator, wherein the larger the value of the standard deviation, the greater the difference between the audio annotation information, That is, if the annotator may experience fatigue or other objective conditions that cause the audio annotation information to be relatively messy, it means that the reliability of the audio annotation information is lower.

总之，根据本发明实施例公开的第三种韵律标注的方法，除了能够准确的对音频数据进行标注之外，还能够对某一个标注者所标注的所有音频标注信息进行可信权重处理，这样就能进一步评估音频标注信息的精确度和准确性，从而为后续的语音合成等应用打下基础。In short, according to the third prosodic labeling method disclosed in the embodiment of the present invention, in addition to accurately labeling audio data, it can also perform credible weight processing on all audio labeling information marked by a certain labeler, so that The accuracy and accuracy of the audio annotation information can be further evaluated, thus laying the foundation for subsequent applications such as speech synthesis.

与本发明实施例提供的韵律标注的方法相对应，本发明实施例还提供了一种韵律标注的装置，参见图8，该装置可以包括接收模块801，播放模块802和记录模块803。下面将详细描述各模块的操作。Corresponding to the prosodic tagging method provided by the embodiment of the present invention, the embodiment of the present invention also provides a prosodic tagging device, as shown in FIG. 8 , the device may include a receiving module 801 , a playing module 802 and a recording module 803 . The operation of each module will be described in detail below.

接收模块801，用于接收待标注的音频数据标识，所述音频数据标识包括音频数据的批次和/或数据条目号。The receiving module 801 is configured to receive an audio data identifier to be marked, where the audio data identifier includes a batch and/or data entry number of the audio data.

其中，在实际应用中，所述接收模块801具体可以配置为：通过专用标注设备的界面接收待标注的音频数据标识。参考图9所示，为所述专用标注设备的一个界面示意图。Wherein, in a practical application, the receiving module 801 may specifically be configured to: receive an audio data identifier to be marked through an interface of a special marking device. Referring to FIG. 9 , it is a schematic diagram of an interface of the dedicated labeling device.

在图9中，“开关”按钮用于控制所述专用标注设备的开和关，“音量”按钮用于控制播放音频数据的音量，图9中的圆形数字按钮可以用于方便得选择音频数据的批次号码和数据条目号；“下载”和“上传”按钮可以用于音频数据的接收和提交音频数据；“批次”和“序号”按钮可以用于选择不同的音频数据批次和相应的数据条目；播放控制按钮“上一条”、“下一条”、“重复”和“自动”用于控制播放上一条音频数据、下一条音频数据、重复一边播放音频数据或自动顺序播放音频数据。“标记按钮”可以根据标注者触发的按键记录打下标记，并根据标注者按下时间的长短给予不同的信息，即是触发时间点和触发时长信息。其中，专用标注设备的按钮可以采用电容式按钮，还可以采用标注者触摸时自动发音的方式，从而也能方便盲人识别按钮。In Fig. 9, the "switch" button is used to control the opening and closing of the special marking device, the "volume" button is used to control the volume of the audio data played, and the circular number buttons in Fig. 9 can be used to conveniently select audio Data batch number and data entry number; "Download" and "Upload" buttons can be used to receive and submit audio data; "Batch" and "Serial Number" buttons can be used to select different audio data batches and Corresponding data entry; the playback control buttons "Previous", "Next", "Repeat" and "Auto" are used to control the playback of the previous audio data, the next audio data, repeat while playing audio data or automatically play audio data sequentially . The "mark button" can be marked according to the record of keystrokes triggered by the annotator, and give different information according to the length of time the annotator presses, that is, the trigger time point and trigger duration information. Among them, the buttons of the special labeling equipment can adopt capacitive buttons, and can also adopt the mode of automatic pronunciation when the labeler touches, so that the buttons can also be easily recognized by the blind.

需要说明的是，所述专用标注设备的界面的长宽比例可以设置为2∶1，其中，所述标注按钮在所述专用标注设备的界面上的位置采用黄金分割比设置，至少一个播放控制按钮位于所述标注按钮下方2-3厘米处呈正方形排列，该播放控制按钮用于控制所述音频数据的播放顺序；除所述播放控制按钮之外的其他按键设置在所述专用标注设备的界面的四周。It should be noted that the aspect ratio of the interface of the special labeling device can be set to 2:1, wherein the position of the label button on the interface of the special labeling device adopts the golden ratio setting, and at least one playback control The buttons are arranged in a square at 2-3 centimeters below the labeling button, and the playback control button is used to control the playback sequence of the audio data; other buttons except the playback control button are arranged on the special labeling device. around the interface.

播放模块802，用于依据所述音频数据标识向标注者播放相对应的音频数据。The playing module 802 is configured to play corresponding audio data to the tagger according to the audio data identification.

所述播放模块802具体可以配置为：依据所述音频数据标识向标注者第一次播放相对应的音频数据；在停顿第一预定时间段之后，向标注者第二次播放所述音频数据；以及在停顿第二预定时间段之后，向标注者第三次播放所述音频数据。The playing module 802 can be specifically configured to: play the corresponding audio data to the tagger for the first time according to the audio data identifier; play the audio data to the tagger for the second time after pausing for a first predetermined period of time; and playing the audio data to the annotator for a third time after pausing for a second predetermined period of time.

记录模块803，用于在检测到所述标注者触发了标注过程的情况下，记录与所述标注者的标注行为相关的信息。The recording module 803 is configured to record information related to the tagging behavior of the tagger when it is detected that the tagger triggers the tagging process.

参考图10所示，所述记录模块803在实际应用中具体可以包括：Referring to Figure 10, the recording module 803 may specifically include in practical applications:

检测子模块1001，用于在第三次播放所述音频数据时，检测标注者是否触发了标注过程。The detection sub-module 1001 is configured to detect whether the tagger triggers the tagging process when the audio data is played for the third time.

其中，所述检测子模块1001具体可以配置为：通过检测标注者是否触发所述专用标注设备的界面上的标注按钮来检测是否触发了标注过程。Wherein, the detection sub-module 1001 may be specifically configured to: detect whether the labeling process is triggered by detecting whether the labeler triggers the labeling button on the interface of the special labeling device.

第一记录子模块1002，用于在所述检测子模块的结果为是时，记录初始触发时间点和初始停顿时长信息。The first recording submodule 1002 is configured to record the information of the initial trigger time point and the initial pause duration when the result of the detection submodule is yes.

第二记录子模块1003，用于在第三次播放音频数据完毕时，停顿第三预定时间段，并在向标注者第四次播放所述音频数据的过程中，记录最终初始触发时间点和停顿时长信息。The second recording sub-module 1003 is used to stop the third predetermined period of time when the audio data is played for the third time, and record the final initial trigger time point and Pause duration information.

生成模块804，用于根据所述与标注者的标注行为相关的信息生成所述音频数据的音频标注信息。A generating module 804, configured to generate audio annotation information of the audio data according to the information related to the annotation behavior of the annotator.

其中，所述生成模块804具体可以配置为：将在第四次播放所述音频数据过程中记录的所述最终触发时间点和最终停顿时长信息作为音频标注信息。Wherein, the generating module 804 may be specifically configured to: use the final trigger time point and final pause duration information recorded during the fourth playing of the audio data as audio annotation information.

通过本发明实施例提供的上述装置，可以通过多次向标注者播放音频数据的方式，能够在标注者熟悉音频数据之后，再触发音频数据的标注过程，并且通过重复标注的方式，可以使得采用本实施例得到的音频标注数据更准确，进而也使得采用音频标注数据进行语音合成的准确度或者流畅程度都能够满足实际需求。Through the above-mentioned device provided by the embodiment of the present invention, by playing the audio data to the annotator multiple times, the audio data annotation process can be triggered after the annotator is familiar with the audio data, and by repeating the annotation, it is possible to use The audio annotation data obtained in this embodiment is more accurate, and thus the accuracy or fluency of speech synthesis using the audio annotation data can meet actual requirements.

另外，还应该指出的是，上述系列处理和装置也可以通过软件和/或固件实现。在通过软件和/或固件实现的情况下，从存储介质或网络向具有专用硬件结构的计算机，例如图11所示的通用个人计算机1100安装构成该软件的程序，该计算机在安装有各种程序时，能够执行各种功能等等。In addition, it should also be noted that the series of processes and devices described above may also be implemented by software and/or firmware. In the case of realization by software and/or firmware, a program constituting the software is installed from a storage medium or a network to a computer having a dedicated hardware configuration, such as a general-purpose personal computer 1100 shown in FIG. , can perform various functions and so on.

在图11中，中央处理单元(CPU)1101根据只读存储器(ROM)1102中存储的程序或从存储部分1108加载到随机存取存储器(RAM)1103的程序执行各种处理。在RAM 1103中，也根据需要存储当CPU 1101执行各种处理等等时所需的数据。In FIG. 11 , a central processing unit (CPU) 1101 executes various processes according to programs stored in a read only memory (ROM) 1102 or loaded from a storage section 1108 to a random access memory (RAM) 1103 . In the RAM 1103, data required when the CPU 1101 executes various processing and the like is also stored as necessary.

CPU 1101、ROM1102和RAM 1103经由总线1104彼此连接。输入/输出接口1105也连接到总线1104。The CPU 1101, ROM 1102, and RAM 1103 are connected to each other via a bus 1104. An input/output interface 1105 is also connected to the bus 1104 .

下述部件连接到输入/输出接口1105：输入部分1106，包括键盘、鼠标等等；输出部分1107，包括显示器，比如阴极射线管(CRT)、液晶显示器(LCD)等等，和扬声器等等；存储部分1108，包括硬盘等等；和通信部分1109，包括网络接口卡比如LAN卡、调制解调器等等。通信部分1109经由网络比如因特网执行通信处理。The following components are connected to the input/output interface 1105: an input section 1106 including a keyboard, a mouse, etc.; an output section 1107 including a display such as a cathode ray tube (CRT), a liquid crystal display (LCD), etc., and a speaker; The storage section 1108 includes a hard disk and the like; and the communication section 1109 includes a network interface card such as a LAN card, a modem, and the like. The communication section 1109 performs communication processing via a network such as the Internet.

根据需要，驱动器1110也连接到输入/输出接口1105。可拆卸介质1111比如磁盘、光盘、磁光盘、半导体存储器等等根据需要被安装在驱动器1110上，使得从中读出的计算机程序根据需要被安装到存储部分1108中。A driver 1110 is also connected to the input/output interface 1105 as needed. A removable medium 1111 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 1110 as necessary, so that a computer program read therefrom is installed into the storage section 1108 as necessary.

在通过软件实现上述系列处理的情况下，从网络比如因特网或存储介质比如可拆卸介质1111安装构成软件的程序。In the case of realizing the above-described series of processes by software, the programs constituting the software are installed from a network such as the Internet or a storage medium such as the removable medium 1111 .

本领域的技术人员应当理解，这种存储介质不局限于图11所示的其中存储有程序、与设备相分离地分发以向用户提供程序的可拆卸介质1111。可拆卸介质1111的例子包含磁盘(包含软盘(注册商标))、光盘(包含光盘只读存储器(CD-ROM)和数字通用盘(DVD))、磁光盘(包含迷你盘(MD)(注册商标))和半导体存储器。或者，存储介质可以是ROM 1102、存储部分1108中包含的硬盘等等，其中存有程序，并且与包含它们的设备一起被分发给用户。Those skilled in the art should understand that such a storage medium is not limited to the removable medium 1111 shown in FIG. 11 in which the program is stored and distributed separately from the device to provide the program to the user. Examples of the removable media 1111 include magnetic disks (including floppy disks (registered trademark)), optical disks (including compact disk read only memory (CD-ROM) and digital versatile disks (DVD)), magneto-optical disks (including )) and semiconductor memory. Alternatively, the storage medium may be a ROM 1102, a hard disk contained in the storage section 1108, or the like, in which programs are stored and distributed to users together with devices containing them.

还需要指出的是，执行上述系列处理的步骤可以自然地按照说明的顺序按时间顺序执行，但是并不需要一定按照时间顺序执行。某些步骤可以并行或彼此独立地执行。It should also be pointed out that the steps for executing the above series of processes can naturally be executed in chronological order according to the illustrated order, but it does not need to be executed in chronological order. Certain steps may be performed in parallel or independently of each other.

关于包括以上实施例的实施方式，还公开下述附记：Regarding the implementation manner comprising the above embodiments, the following additional notes are also disclosed:

附记1、一种韵律标注的方法，包括：Note 1. A method for prosodic labeling, comprising:

接收待标注的音频数据标识，所述音频数据标识包括音频数据的批次和/或数据条目号；Receive the audio data identification to be marked, the audio data identification includes the batch and/or data entry number of the audio data;

依据所述音频数据标识向标注者播放相对应的音频数据；Play the corresponding audio data to the annotator according to the audio data identification;

在检测到所述标注者触发了标注过程的情况下，记录与标注者的标注行为相关的信息；When it is detected that the tagger has triggered the tagging process, recording information related to the tagging behavior of the tagger;

根据所述与标注者的标注行为相关的信息生成所述音频数据的音频标注信息。The audio annotation information of the audio data is generated according to the information related to the annotation behavior of the annotator.

附记2、根据附记1所述的方法，所述与标注者的标注行为相关的信息具体为触发时间点和停顿时长信息。Additional Note 2. According to the method described in Additional Note 1, the information related to the tagging behavior of the tagger is specifically trigger time point and pause duration information.

附记3、根据附记2所述的方法，所述依据所述音频数据标识向标注者播放相对应的音频数据，包括：Supplement 3. According to the method described in Supplement 2, playing the corresponding audio data to the labeler according to the audio data identification includes:

依据所述音频数据标识向标注者第一次播放相对应的音频数据；Playing the corresponding audio data to the annotator for the first time according to the audio data identification;

在停顿第一预定时间段之后，向标注者第二次播放所述音频数据；以及Playing the audio data a second time to the annotator after pausing for a first predetermined period of time; and

在停顿第二预定时间段之后，向标注者第三次播放所述音频数据。After a pause for a second predetermined period of time, the audio data is played a third time to the annotator.

附记4、根据附记3所述的方法，所述在检测到标注者触发了标注过程的情况下，记录与标注者的标注行为相关的信息的步骤，具体包括：Additional Note 4. According to the method described in Additional Note 3, the step of recording information related to the tagging behavior of the tagger when it is detected that the tagger has triggered the tagging process specifically includes:

在第三次播放所述音频数据的过程中，检测标注者是否触发了标注过程；如果是，则记录初始触发时间点和初始停顿时长信息；In the process of playing the audio data for the third time, it is detected whether the tagger has triggered the tagging process; if so, the initial trigger time point and the initial pause duration information are recorded;

第三次播放音频数据完毕时，停顿第三预定时间段，并在向标注者第四次播放所述音频数据的过程中，记录最终初始触发时间点和停顿时长信息。When the audio data is played for the third time, stop for a third predetermined period of time, and record the final initial trigger time point and pause duration information during the fourth time of playing the audio data to the annotator.

附记5、根据附记4所述的方法，所述第一预定时间段为两秒，所述第二预定时间段为三秒，所述第三预定时间段为一秒。Additional Note 5. According to the method described in Additional Note 4, the first predetermined time period is two seconds, the second predetermined time period is three seconds, and the third predetermined time period is one second.

附记6、根据附记4所述的方法，所述根据所述与标注者的标注行为相关的信息生成所述音频数据的音频标注信息的步骤，具体为：Additional Note 6. According to the method described in Additional Note 4, the step of generating the audio tagging information of the audio data according to the information related to the tagging behavior of the tagger is specifically:

将在第四次播放所述音频数据过程中记录的所述最终触发时间点和最终停顿时长信息作为音频标注信息。The final trigger time point and the final pause duration information recorded during the fourth playing of the audio data are used as audio annotation information.

附记7、根据附记1所述的方法，还包括：Supplement 7. The method according to Supplement 1, further comprising:

将所述记录的音频标注信息保存为可以用于网络传输的数据格式。The recorded audio annotation information is saved in a data format that can be used for network transmission.

附记8、根据附记1所述的方法，通过专用标注设备的界面接收待标注的音频数据标识。Supplement 8. According to the method described in Supplement 1, the audio data identification to be marked is received through the interface of the special marking device.

附记9、根据附记1所述的方法，通过检测标注者是否触发所述专用标注设备的界面上的标注按钮来检测是否触发了标注过程。Supplement 9. According to the method described in Supplement 1, whether the labeling process is triggered is detected by detecting whether the labeler triggers the labeling button on the interface of the dedicated labeling device.

附记10、根据附记1所述的方法，还包括：Supplement 10. The method according to Supplement 1, further comprising:

从任一标注者的音频标注信息集合中抽取出至少两个音频标注信息样本；extracting at least two samples of audio annotation information from any annotator's audio annotation information set;

依据所述至少两个音频标注信息样本获取所述任一标注者的标准基准时长；Acquiring the standard reference duration of any annotator according to the at least two audio annotation information samples;

利用所述标准基准时长对所述任一标注者的音频标注信息集合进行可信权重处理。The credible weighting process is performed on the audio annotation information set of any annotator by using the standard reference duration.

附记11、根据附记10所述的方法，所述依据所述至少两个音频标注信息样本获取所述任一标注者的标准基准时长的步骤，包括：Supplement 11. According to the method described in Supplement 10, the step of obtaining the standard reference duration of any one of the annotators based on the at least two audio annotation information samples includes:

获取每一个音频标注信息样本中子标注时长的最小值；Obtain the minimum value of the sub-label duration in each audio label information sample;

依据获取到的最小值计算每条音频标注信息中子标注时长的标准差值；Calculate the standard deviation of the sub-annotation duration in each piece of audio annotation information based on the obtained minimum value;

将最小的标准差值所对应的音频标注信息中最小的子标注时长作为所述任一标注者的标准基准时长。The smallest sub-annotation duration in the audio annotation information corresponding to the smallest standard deviation value is used as the standard reference duration of any annotator.

附记12、根据附记11所述的方法，所述可信权重处理的步骤，包括：Supplement 12. According to the method described in Supplement 11, the step of processing trusted weights includes:

依据该标注者的前N条音频标注信息的N个最小子标注时长，计算所述N个最小子标注时长与所述基准标注时长的标准差；其中，N为大于1的自然数；According to the N minimum sub-labeling durations of the first N pieces of audio labeling information of the labeler, calculate the standard deviation between the N minimum sub-labeling durations and the reference labeling duration; wherein, N is a natural number greater than 1;

将所述标准差作为可信度量参数对该任一标注者的音频标注信息进行可信权重处理，其中，所述标准差的值越大，则所述音频标注信息的可信度越低。The standard deviation is used as a credibility measurement parameter to carry out credible weight processing on the audio annotation information of any annotator, wherein the larger the value of the standard deviation is, the lower the credibility of the audio annotation information is.

附记13、一种专用标注设备，包括：Note 13. A special labeling device, including:

接收模块，用于接收待标注的音频数据标识，所述音频数据标识包括音频数据的批次和/或数据条目号；The receiving module is used to receive the audio data identification to be marked, and the audio data identification includes the batch and/or data entry number of the audio data;

播放模块，用于依据所述音频数据标识向标注者播放相对应的音频数据；A playback module, configured to play the corresponding audio data to the annotator according to the audio data identifier;

记录模块，用于在检测到所述标注者触发了标注过程的情况下，记录与所述标注者的标注行为相关的信息；A recording module, configured to record information related to the labeling behavior of the labeler when it is detected that the labeler has triggered the labeling process;

生成模块，用于根据所述与标注者的标注行为相关的信息生成所述音频数据的音频标注信息。A generating module, configured to generate audio annotation information of the audio data according to the information related to the annotation behavior of the annotator.

附记14、根据附记13所述的设备，所述与标注者的标注行为相关的信息具体为触发时间点和停顿时长信息。Supplement 14. According to the device described in Supplement 13, the information related to the marking behavior of the annotator is specifically trigger time point and pause duration information.

附记15、根据附记14所述的设备，所述播放模块具体配置为：Supplement 15. According to the device described in Supplement 14, the playback module is specifically configured as:

附记16、根据附记14所述的设备，所述记录模块包括：Supplement 16. The device according to Supplement 14, the recording module includes:

检测子模块，用于在第三次播放所述音频数据时，检测标注者是否触发了标注过程；The detection submodule is used to detect whether the tagger has triggered the tagging process when the audio data is played for the third time;

第一记录子模块，用于在所述检测子模块的结果为是时，记录初始触发时间点和初始停顿时长信息；The first recording submodule is used to record the initial trigger time point and initial pause duration information when the result of the detection submodule is yes;

第二记录子模块，用于在第三次播放音频数据完毕时，停顿第三预定时间段，并在向标注者第四次播放所述音频数据的过程中，记录最终初始触发时间点和停顿时长信息。The second recording submodule is used to pause for a third predetermined period of time when the audio data is played for the third time, and to record the final initial trigger time point and pause during the fourth playback of the audio data to the annotator Duration information.

附记17、根据附记14所述的设备，所述生成模块具体配置为：Supplement 17. According to the device described in Supplement 14, the generating module is specifically configured as:

附记18、根据附记13所述的设备，所述接收模块具体配置为：Supplement 18. According to the device described in Supplement 13, the specific configuration of the receiving module is:

通过专用标注设备的界面接收待标注的音频数据标识。The audio data identification to be marked is received through the interface of the special marking device.

附记19、根据附记16所述的设备，所述检测子模块具体配置为：Supplement 19. According to the device described in Supplement 16, the detection submodule is specifically configured as:

通过检测标注者是否触发所述专用标注设备的界面上的标注按钮来检测是否触发了标注过程。Whether the tagging process is triggered is detected by detecting whether the tagger triggers the tagging button on the interface of the special tagging device.

附记20、根据附记13所述的设备，所述专用标注设备的界面的长宽比例为2∶1，所述标注按钮在所述专用标注设备的界面上的位置采用黄金分割比设置，至少一个播放控制按钮位于所述标注按钮下方2-3厘米处呈正方形排列，该播放控制按钮用于控制所述音频数据的播放顺序；除所述播放控制按钮之外的其他按键设置在所述专用标注设备的界面的四周。Additional Note 20. According to the device described in Additional Note 13, the aspect ratio of the interface of the special labeling device is 2:1, and the position of the labeling button on the interface of the special labeling device is set by the golden ratio, At least one playback control button is arranged in a square at 2-3 centimeters below the label button, and the playback control button is used to control the playback sequence of the audio data; other buttons except the playback control button are arranged on the Around the interface of the dedicated labeling device.

虽然已经详细说明了本发明及其优点，但是应当理解在不脱离由所附的权利要求所限定的本发明的精神和范围的情况下可以进行各种改变、替代和变换。而且，本发明实施例的术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下，由语句“包括一个......”限定的要素，并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。Although the present invention and its advantages have been described in detail, it should be understood that various changes, substitutions and alterations can be made hereto without departing from the spirit and scope of the invention as defined by the appended claims. Moreover, the terms "comprising", "comprising" or any other variation thereof in the embodiments of the present invention are intended to cover a non-exclusive inclusion such that a process, method, article or device comprising a series of elements includes not only those elements, but also Including other elements not expressly listed, or also including elements inherent in such process, method, article or apparatus. Without further limitations, an element defined by the phrase "comprising a ..." does not exclude the presence of additional identical elements in the process, method, article or apparatus comprising said element.

Claims

1. A prosodic labeling device, comprising:

The receiving module is used to receive the audio data identification to be marked, and the audio data identification includes the batch and/or data entry number of the audio data;

A playback module, configured to play the corresponding audio data to the annotator according to the audio data identifier;

A recording module, configured to record information related to the labeling behavior of the labeler when it is detected that the labeler has triggered the labeling process;

A generation module, configured to generate audio annotation information of the audio data according to the information related to the annotation behavior of the annotator,

The information related to the tagging behavior of the tagger is specifically trigger time point and pause duration information.

2. The prosody labeling device according to claim 1, the specific configuration of the playback module is:

Playing the corresponding audio data to the annotator for the first time according to the audio data identification;

Playing the audio data a second time to the annotator after pausing for a first predetermined period of time; and

After a pause for a second predetermined period of time, the audio data is played a third time to the annotator.

3. The prosody labeling device according to claim 1, the recording module comprising:

The detection submodule is used to detect whether the tagger has triggered the tagging process when the audio data is played for the third time;

The first recording submodule is used to record the initial trigger time point and initial pause duration information when the result of the detection submodule is yes;

The second recording submodule is used to pause for a third predetermined period of time when the audio data is played for the third time, and to record the final initial trigger time point and pause during the fourth playback of the audio data to the annotator Duration information.

4. The prosody labeling device according to claim 3, the specific configuration of the generating module is:

The final trigger time point and the final pause duration information recorded during the fourth playing of the audio data are used as audio annotation information.

5. The prosody labeling device according to claim 1, the receiving module is specifically configured as:

The audio data identification to be marked is received through the interface of the prosody marking device.

6. The prosody labeling device according to claim 3, the specific configuration of the detection submodule is:

Whether the tagging process is triggered is detected by detecting whether the tagger triggers the tagging button on the interface of the prosodic tagging device.

7. The prosody labeling device according to claim 1, the aspect ratio of the interface of the prosody labeling device is 2: 1, and the position of the label button on the interface of the prosody labeling device adopts the golden ratio setting, At least one playback control button is arranged in a square at 2-3 centimeters below the label button, and the playback control button is used to control the playback sequence of the audio data; other buttons except the playback control button are arranged on the Rhythm labels the perimeter of the device's interface.

8. A method for prosodic labeling, comprising:

Receive the audio data identification to be marked, the audio data identification includes the batch and/or data entry number of the audio data;

Play the corresponding audio data to the annotator according to the audio data identification;

When it is detected that the tagger has triggered the tagging process, recording information related to the tagging behavior of the tagger;

generating audio annotation information of the audio data according to the information related to the annotation behavior of the annotator,