TWI643184B

TWI643184B - Method and apparatus for speaker diarization

Info

Publication number: TWI643184B
Application number: TW106135243A
Authority: TW
Inventors: 王健宗; 郭卉; 肖京
Original assignee: 大陸商平安科技（深圳）有限公司
Priority date: 2016-12-19
Filing date: 2017-10-13
Publication date: 2018-12-01
Also published as: WO2018113243A1; CN106782507A; CN106782507B; TW201824250A

Abstract

本發明涉及一種語音分割的方法及裝置，所述語音分割的方法包括：自動應答系統在接收到終端發送的混合語音時，將所述混合語音分割成多個短語音段，並對各短語音段標注對應的說話人標識；利用時間遞歸神經網路對各說話人標識對應的短語音段建立聲紋模型，基於所述聲紋模型調整所述混合語音中對應的分割邊界，以分割出各說話人標識對應的有效語音段。本發明能夠有效提高語音分割的精度，特別是對於對話交替頻繁、以及有交疊的語音，語音分割的效果較好。 The invention relates to a method and a device for voice segmentation. The method for voice segmentation comprises: when receiving a mixed voice sent by a terminal, the automatic answering system divides the mixed voice into a plurality of phrase segments, and each phrase sound a segment identifier corresponding to the speaker identifier; using a time recurrent neural network to establish a voiceprint model for the phrase segment corresponding to each speaker identifier, and adjusting a corresponding segmentation boundary in the mixed voice based on the voiceprint model to segment each The valid voice segment corresponding to the speaker identification. The invention can effectively improve the precision of the speech segmentation, especially for the speech alternating frequently, and the overlapping speech, the speech segmentation effect is better.

Description

Method and device for voice segmentation

本發明涉及語音處理技術領域，尤其涉及一種語音分割的方法及裝置。 The present invention relates to the field of voice processing technologies, and in particular, to a method and apparatus for voice segmentation.

目前，呼叫中心接收到的語音很多都混雜有多人的語音，這時需要先對語音進行語音分割(speaker diarization)，才能進一步對目標語音進行語音分析。語音分割是指：在語音處理領域，當多個說話人的語音被合併錄在一個聲道中時，把信號中每個說話人的語音分別進行提取。傳統的語音分割技術是基於全域背景模型和高斯混合模型進行分割，由於技術的限制，這種語音分割的方法分割的精度並不高，特別是對於對話交替頻繁、以及有交疊的對話分割效果差。 At present, many voices received by the call center are mixed with voices of many people. In this case, speaker diarization is required to further perform voice analysis on the target voice. Speech segmentation means that in the field of speech processing, when the speeches of multiple speakers are combined and recorded in one channel, the speech of each speaker in the signal is separately extracted. The traditional speech segmentation technology is based on the global background model and the Gaussian mixture model. Due to the limitation of technology, the segmentation accuracy of this segmentation method is not high, especially for dialogues with frequent alternating and overlapping dialogue segmentation effects. difference.

本發明的目的在於提供一種語音分割的方法及裝置，旨在有效提高語音分割的精度。 It is an object of the present invention to provide a method and apparatus for speech segmentation, which aims to effectively improve the accuracy of speech segmentation.

為實現上述目的，本發明提供一種語音分割的方法，其特徵在於，所述語音分割的方法包括：S1，自動應答系統在接收到終端發送的混合語音時，將所述混合語音分割成多個短語音段，並對各短語音段標注對應的說話人標識；S2，利用時間遞歸神經網路對各說話人標識對應的短語音段建立聲紋模型，基於所述聲紋模型調整所述混合語音中對應的分割邊界，以分割出各說話人標識對應的有效語音段。 To achieve the above object, the present invention provides a method for voice segmentation, characterized in that: the method for voice segmentation comprises: S1, the automatic response system divides the mixed voice into multiples when receiving the mixed voice sent by the terminal a phrase segment, and marking a corresponding speaker identifier for each phrase segment; S2, establishing a voiceprint model for the phrase segment corresponding to each speaker identifier by using a time recurrent neural network, and adjusting the mixture based on the voiceprint model Corresponding segmentation boundaries in the speech to segment the valid speech segments corresponding to each speaker identifier.

在一實施例中，所述步驟S1包括：S11，獲取所述混合語音中的靜音段，去除所述混合語音中的靜音段，以根據所述靜音段對所述混合語音進行分割，得到分割後的長語音段； S12，對所述長語音段進行分幀，以提取每一長語音段的聲學特徵；S13，對每一長語音段的聲學特徵進行KL距離分析，根據KL距離分析結果對所述語音段進行切分，得到切分後的短語音段；S14，利用高斯混合模型對各短語音段進行語音聚類，並對同一語音類的短語音段標注對應的說話人標識。 In an embodiment, the step S1 includes: S11, acquiring a silent segment in the mixed voice, and removing a silent segment in the mixed voice, to segment the mixed voice according to the silent segment, and obtain a segmentation. a long speech segment; S12, framing the long speech segment to extract acoustic features of each long speech segment; S13, performing KL distance analysis on acoustic features of each long speech segment, according to KL distance analysis results Segmenting the speech segment to obtain a segmented phrase segment; S14, performing a speech clustering on each phrase segment by using a Gaussian mixture model, and marking a corresponding speaker identifier for the phrase segment of the same speech class.

在一實施例中，所述步驟S13包括：對每一長語音段的聲學特徵進行KL距離分析，對時長大於預設時間閾值的長語音段在KL距離的最大值處進行切分，得到切分後的短語音段。 In an embodiment, the step S13 includes: performing KL distance analysis on the acoustic features of each long speech segment, and segmenting the long speech segment whose duration is greater than the preset time threshold at a maximum value of the KL distance. The segmented phrase segment.

在一實施例中，所述步驟S2包括：S21，利用所述時間遞歸神經網路對各說話人標識對應的短語音段建立聲紋模型，基於所述聲紋模型提取表徵說話人身份特徵的預設類型向量；S22，基於所述預設類型向量計算每一語音幀屬對應的說話人的最大後驗機率；S23，基於所述最大後驗機率並利用預定算法調整該說話人的混合高斯模型；S24，基於調整後的混合高斯模型獲取每一語音幀對應的機率最大的說話人，並根據機率最大的說話人與語音幀的機率關係調整所述混合語音中對應的分割邊界；S25，疊代更新所述聲紋模型n次，每次更新所述聲紋模型時疊代m次所述混合高斯模型，以得到各說話人對應的有效語音段，n及m均為大於1的正整數。 In an embodiment, the step S2 includes: S21, using the time recurrent neural network to establish a voiceprint model for a phrase segment corresponding to each speaker identifier, and extracting a characterizing feature of the speaker based on the voiceprint model. a preset type vector; S22, calculating a maximum posterior probability of the speaker corresponding to each voice frame based on the preset type vector; S23, adjusting the mixed Gaussian of the speaker based on the maximum posterior probability and using a predetermined algorithm a model; S24, acquiring a speaker with the highest probability corresponding to each voice frame based on the adjusted mixed Gaussian model, and adjusting a corresponding segmentation boundary in the mixed voice according to a probability relationship between the speaker with the highest probability and the voice frame; S25, The voiceprint model is updated n times, and the mixed Gaussian model is iterated m times each time the voiceprint model is updated to obtain an effective voice segment corresponding to each speaker, and both n and m are greater than 1 Integer.

在一實施例中，所述步驟S2之後還包括：基於所述有效語音段獲取對應的應答內容，並將所述應答內容反饋給所述終端。 In an embodiment, after the step S2, the method further includes: acquiring a corresponding response content based on the valid voice segment, and feeding back the response content to the terminal.

為實現上述目的，本發明還提供一種語音分割的裝置，所述語音分割的裝置包括：分割模組，用於在接收到終端發送的混合語音時，將所述混合語音分割成多個短語音段，並對各短語音段標注對應的說話人標識；調整模組，用於利用時間遞歸神經網路對各說話人標識對應的短語音段建立聲紋模型，基於所述聲紋模型調整所述混合語音中對應的分割邊界，以分割出各說話人標識對應的有效語音段。 In order to achieve the above object, the present invention further provides a device for voice segmentation, the device for voice segmentation comprising: a segmentation module, configured to divide the mixed speech into a plurality of phrase sounds when receiving the mixed speech sent by the terminal And segmenting the corresponding speaker identifier for each phrase segment; the adjustment module is configured to establish a voiceprint model for the phrase segment corresponding to each speaker identifier by using a time recurrent neural network, and adjust the voiceprint model based on the voiceprint model The corresponding segmentation boundary in the mixed speech is segmented to segment the valid segment of speech corresponding to each speaker identifier.

在一實施例中，所述分割模組包括：去除單元，用於獲取所述混合語音中的靜音段，去除所述混合語音中的靜音段，以根據所述靜音段對所述混合語音進行分割，得到分割後的長語音段；分幀單元，用於對所述長語音段進行分幀，以提取每一長語音段的聲學特徵；切分單元，用於對每一長語音段的聲學特徵進行KL距離分析，根據KL距離分析結果對所述語音段進行切分，得到切分後的短語音段；聚類單元，用於利用高斯混合模型對各短語音段進行語音聚類，並對同一語音類的短語音段標注對應的說話人標識。 In an embodiment, the splitting module includes: a removing unit, configured to acquire a silent segment in the mixed voice, and remove a silent segment in the mixed voice, to perform the mixed voice according to the silent segment. Dividing to obtain a divided long speech segment; a framing unit for framing the long speech segment to extract an acoustic feature of each long speech segment; and a segmentation unit for each long speech segment The acoustic feature is analyzed by KL distance, and the segment is segmented according to the KL distance analysis result to obtain a segmented phrase segment; the clustering unit is used for performing speech clustering on each phrase segment by using a Gaussian mixture model. The corresponding speaker identifier is marked with the phrase segment of the same voice class.

在一實施例中，所述切分單元具體用於對每一長語音段的聲學特徵進行KL距離分析，對時長大於預設時間閾值的長語音段在KL距離的最大值處進行切分，得到切分後的短語音段。 In an embodiment, the segmentation unit is specifically configured to perform KL distance analysis on the acoustic features of each long speech segment, and segment the long speech segment whose duration is greater than the preset time threshold at a maximum value of the KL distance. , get the segmented phrase segment.

在一實施例中，所述調整模組包括：建模單元，用於利用所述時間遞歸神經網路對各說話人標識對應的短語音段建立聲紋模型，基於所述聲紋模型提取表徵說話人身份特徵的預設類型向量；計算單元，用於基於所述預設類型向量計算每一語音幀屬對應的說話人的最大後驗機率；第一調整單元，用於基於所述最大後驗機率並利用預定算法調整該說話人的混合高斯模型；第二調整單元，用於基於調整後的混合高斯模型獲取每一語音幀對應的機率最大的說話人，並根據機率最大的說話人與語音幀的機率關係調整所述混合語音中對應的分割邊界；疊代單元，用於疊代更新所述聲紋模型n次，每次更新所述聲紋模型時疊代m次所述混合高斯模型，以得到各說話人對應的有效語音段，n及m均為大於1的正整數。 In an embodiment, the adjustment module includes: a modeling unit, configured to establish a voiceprint model for a phrase segment corresponding to each speaker identifier by using the time recurrent neural network, and extract a representation based on the voiceprint model a preset type vector of the speaker identity feature; the calculating unit is configured to calculate a maximum posterior probability of the speaker corresponding to each voice frame genus based on the preset type vector; the first adjusting unit is configured to be based on the maximum Checking the rate and adjusting the mixed Gaussian model of the speaker by using a predetermined algorithm; the second adjusting unit is configured to acquire the speaker with the highest probability corresponding to each voice frame based on the adjusted mixed Gaussian model, and according to the speaker with the highest probability The probability relationship of the speech frame adjusts a corresponding segmentation boundary in the mixed speech; an iteration unit is used to update the voiceprint model n times in an iterative manner, and the hybrid Gauss is iterated m times each time the voiceprint model is updated The model is to obtain a valid speech segment corresponding to each speaker, and both n and m are positive integers greater than one.

在一實施例中，所述語音分割的裝置還包括：反饋模組，用於基於所述有效語音段獲取對應的應答內容，並將所述應答內容反饋給所述終端。 In an embodiment, the apparatus for voice segmentation further includes: a feedback module, configured to acquire a corresponding response content based on the valid voice segment, and feed back the response content to the terminal.

本發明的有益效果是：本發明首先將混合語音進行分割，分割成多個短語音段，每一短語音段對應標識一個說話人，利用時間遞歸神經網路對各短語音段建立聲紋模型，由於利用時間遞歸神經網路建立的聲紋模型能夠關聯說話人跨時間點的聲音信息，因此基於該聲紋模型實現對短語音段的分割邊界的調整，能夠有效提高語音分割的精度，特別是對於對話交替頻繁、以及有交疊的語音，語音分割的效果較好。 The invention has the beneficial effects that the present invention first divides the mixed speech into a plurality of phrase segments, each phrase segment correspondingly identifies a speaker, and uses a time recurrent neural network to establish a voiceprint model for each phrase segment. Since the voiceprint model established by using the time recurrent neural network can correlate the sound information of the speaker across time points, the adjustment of the segmentation boundary of the phrase segment can be realized based on the voiceprint model, which can effectively improve the accuracy of the voice segmentation, especially It is a good effect for speech segmentation for speech that alternates frequently and has overlapping speech.

101‧‧‧分割模組 101‧‧‧Segment Module

1011‧‧‧去除單元 1011‧‧‧Removal unit

1012‧‧‧分幀單元 1012‧‧‧Frame unit

1013‧‧‧切分單元 1013‧‧‧Segment unit

1014‧‧‧聚類單元 1014‧‧‧ clustering unit

102‧‧‧調整模組 102‧‧‧Adjustment module

1021‧‧‧建模單元 1021‧‧‧Modeling unit

1022‧‧‧計算單元 1022‧‧‧Computation unit

1023‧‧‧第一調整單元 1023‧‧‧First adjustment unit

1024‧‧‧第二調整單元 1024‧‧‧Second adjustment unit

1025‧‧‧疊代單元 1025‧‧‧ iteration unit

S1‧‧‧步驟 S1‧‧‧ steps

S11‧‧‧步驟 S11‧‧ steps

S12‧‧‧步驟 Step S12‧‧‧

S13‧‧‧步驟 S13‧‧‧ steps

S14‧‧‧步驟 S14‧‧‧ steps

S2‧‧‧步驟 S2‧‧‧ steps

S21‧‧‧步驟 S21‧‧‧ steps

S22‧‧‧步驟 S22‧‧‧ steps

S23‧‧‧步驟 S23‧‧‧Steps

S24‧‧‧步驟 S24‧‧‧Steps

S25‧‧‧步驟 S25‧‧‧ steps

圖1為本發明語音分割的方法一實施例的流程示意圖。 FIG. 1 is a schematic flow chart of a method for voice segmentation according to an embodiment of the present invention.

圖2為圖1所示步驟S1的細化流程示意圖。 FIG. 2 is a schematic diagram showing the refinement process of step S1 shown in FIG. 1.

圖3為圖1所示步驟S2的細化流程示意圖。 FIG. 3 is a schematic diagram of the refinement process of step S2 shown in FIG. 1.

圖4為本發明語音分割的裝置一實施例的結構示意圖。 FIG. 4 is a schematic structural diagram of an apparatus for voice segmentation according to an embodiment of the present invention.

圖5為圖4所示分割模組的結構示意圖。 FIG. 5 is a schematic structural view of the split module shown in FIG. 4.

圖6為圖4所示調整模組的結構示意圖。 FIG. 6 is a schematic structural view of the adjustment module shown in FIG. 4.

以下結合附圖對本發明的原理和特徵進行描述，所舉實例只用于解釋本發明，並非用於限定本發明的範圍。 The principles and features of the present invention are described in the following with reference to the accompanying drawings.

如圖1所示，圖1為本發明語音分割的方法一實施例的流程示意圖，該語音分割的方法包括以下步驟：步驟S1，自動應答系統在接收到終端發送的混合語音時，將所述混合語音分割成多個短語音段，並對各短語音段標注對應的說話人標識；本實施例中可應用於呼叫中心的自動應答系統中，例如保險呼叫中心的自動應答系統、各種客服呼叫中心的自動應答系統等等。自動應答系統接收到終端發送的原始的混合語音，該混合語音中混合有多種不同的聲源產生的聲音，例如有多人說話混合的聲音，多人說話的聲音與其他噪聲混合的聲音等等。 As shown in FIG. 1 , FIG. 1 is a schematic flowchart of a method for voice segmentation according to an embodiment of the present invention. The method for voice segmentation includes the following steps: Step S1: When an automatic answering system receives a mixed voice sent by a terminal, The mixed voice is divided into a plurality of phrase segments, and each speaker segment is marked with a corresponding speaker identifier; in this embodiment, it can be applied to an automatic answering system of a call center, such as an automatic answering system of an insurance call center, and various customer service calls. The center's automatic answering system and more. The automatic answering system receives the original mixed voice sent by the terminal, and the mixed voice is mixed with sounds generated by a plurality of different sound sources, such as a sound mixed by a plurality of people, a sound mixed by a plurality of people and a sound mixed with other noises, and the like. .

本實施例可以利用預定的方法將混合語音分割成多個短語音段，例如可以利用高斯混合模型(Gaussian Mixture Model，GMM)將混合語音分割成多個短語音段，當然，也可以利用其他傳統的方法將混合語音分割成多個短語音段。 In this embodiment, the mixed voice can be segmented into a plurality of phrase segments by using a predetermined method. For example, a Gaussian Mixture Model (GMM) can be used to divide the mixed voice into a plurality of phrase segments. Of course, other traditions can also be utilized. The method divides the mixed speech into multiple phrase segments.

其中，經本實施例的語音分割後，每一短語音段應只對應一說話人，不同的短語音段中可能有多個短語音段屬同一個說話人，將同一個說話人的不同短語音段進行相同的標識。 Wherein, after the speech segmentation of the embodiment, each phrase segment should correspond to only one speaker, and different phrase segments may have the same speaker, and different speaker sounds of the same speaker. The segments are labeled the same.

步驟S2，利用時間遞歸神經網路對各說話人標識對應的短語音段建立聲紋模型，基於所述聲紋模型調整所述混合語音中對應的分割邊界，以分割出各說話人標識對應的有效語音段。 Step S2, using a time recurrent neural network to establish a voiceprint model for the phrase segments corresponding to each speaker identifier, and adjusting a corresponding segmentation boundary in the mixed voice based on the voiceprint model to segment the corresponding speaker identifiers. Effective voice segment.

本實施例中，時間遞歸神經網路模型(Long-Short Term Memory，LSTM)擁有遞歸神經網路在傳統前向反饋神經網路中引入的定向循環，用以處理層間輸入前後、層內輸出前後的關聯。用時間遞歸神經網路在語音序列上建模，可以得到跨越時間點的語音信號特徵，可以用於對關聯信息處於任何長度、任何位置的語音序列進行處理。時間遞歸神經網路模型通過神經網路層內設計多個交互層，可以記憶到更遠時間節點上的信息，在時間遞歸神經網路模型中用“忘記門層”丟棄與識別任務不相關的信息，接著用“輸入門層”決定需要更新的狀態，最後確定需要輸出的狀態並處理輸出。 In this embodiment, the Long-Short Term Memory (LSTM) model has a directional loop introduced by a recurrent neural network in a traditional forward feedback neural network to process before and after inter-layer input and before and after layer output. The association. By modeling the speech sequence with the time recurrent neural network, the speech signal characteristics across time points can be obtained, which can be used to process the speech sequence with associated information at any length and any position. The time recurrent neural network model can design the multiple interaction layers in the neural network layer to memorize the information on the nodes at the farther time. In the time recurrent neural network model, the “forget gate layer” discards the identification task. Information, then use the "input gate layer" to determine the state that needs to be updated, and finally determine the state that needs to be output and process the output.

本實施例對於各說話人標識對應的短語音段，利用時間遞歸神經網路建立聲紋模型，通過該聲紋模型可以得到說話人跨越時間點的聲音信息，基於這些聲音信息可以調整混合語音中對應的分割邊界，以對每一說話人對應的所有短語音段調整其分割邊界，最終分割出各說話人標識對應的有效語音段，該有效語音段可以看作對應的說話人的完整語音。 In this embodiment, a voice pattern is established by using a time recurrent neural network for a phrase segment corresponding to each speaker identifier, and the voice information of the speaker crossing the time point can be obtained by the voiceprint model, and the mixed voice can be adjusted based on the voice information. Corresponding segmentation boundaries are used to adjust the segmentation boundaries of all the phrase segments corresponding to each speaker, and finally the effective segment of speech corresponding to each speaker identifier is segmented, and the effective segment can be regarded as the complete speech of the corresponding speaker.

與現有技術相比，本實施例首先將混合語音進行分割，分割成多個短語音段，每一短語音段對應標識一個說話人，利用時間遞歸神經網路對各短語音段建立聲紋模型，由於利用時間遞歸神經網路建立的聲紋模型能夠關聯說話人跨時間點的聲音信息，因此基於該聲紋模型實現對短語音段的分割邊界的調整，能夠有效提高語音分割的精度，特別是對於對話交替頻繁、以及有交疊的語音，語音分割的效果較好。 Compared with the prior art, the embodiment first divides the mixed speech into a plurality of phrase segments, and each phrase segment corresponds to identify a speaker, and uses a time recurrent neural network to establish a voiceprint model for each phrase segment. Since the voiceprint model established by using the time recurrent neural network can correlate the sound information of the speaker across time points, the adjustment of the segmentation boundary of the phrase segment can be realized based on the voiceprint model, which can effectively improve the accuracy of the voice segmentation, especially It is a good effect for speech segmentation for speech that alternates frequently and has overlapping speech.

在一較佳的實施例中，如圖2所示，在上述圖1的實施例的基礎上，上述步驟S1包括：步驟S11，獲取所述混合語音中的靜音段，去除所述混合語音中的靜音段，以根據所述靜音段對所述混合語音進行分割，得到分割後的長語音段；步驟S12，對所述長語音段進行分幀，以提取每一長語音段的聲學特徵；步驟S13，對每一長語音段的聲學特徵進行KL距離分析，根據KL距離分析結果對所述語音段進行切分，得到切分後的短語音段，步驟S14，利用高斯混合模型對各短語音段進行語音聚類，並對同一語音類的短語音段標注對應的說話人標識。 In a preferred embodiment, as shown in FIG. 2, on the basis of the foregoing embodiment of FIG. 1, the foregoing step S1 includes: Step S11, acquiring a silent segment in the mixed voice, and removing the mixed voice. a silent segment to segment the mixed speech according to the silent segment to obtain a segmented long speech segment; and in step S12, segment the long speech segment to extract an acoustic feature of each long speech segment; Step S13, performing KL distance analysis on the acoustic features of each long speech segment, segmenting the speech segment according to the KL distance analysis result, and obtaining the segmented phrase segment, step S14, using the Gaussian mixture model for each short The speech segment performs speech clustering, and the corresponding speaker segment is marked with the phrase segment of the same speech class.

本實施例中，首先根據靜音進行初步分割：確定混合語音中的靜音段，將確定的靜音段從混合語音中去除，以實現將混合語音根據靜音段進行分割，靜音段是通過對混合語音的短時語音能量和短時過零率的分析來確定的。 In this embodiment, the initial segmentation is first performed according to the mute: determining the mute segment in the mixed speech, and removing the determined mute segment from the mixed speech, so as to realize the segmentation of the mixed speech according to the mute segment, and the mute segment is through the mixed speech. The analysis of short-term speech energy and short-term zero-crossing rate is determined.

去除靜音段後，首先假設在整個混合語音中，每人每次講話時長為固定閾值Tu，若某段語音大於該時長，則可能多人說話，若小於該時長，則更可能只有一個人說話，基於這種假設，可以對靜音分割後的每個長語音段的時長大於固定閾值Tu的語音段的聲學特徵進行幀間KL距離分析。當然，也可以對所有的長語音段的聲學特徵進行幀間KL距離分析。具體地，對得到的長語音段進行分幀，以得到每一長語音段的語音幀，提取語音幀的聲學特徵，對所有長語音段的聲學特徵進行KL距離(也即相對熵)分析，其中，聲學特徵包括但不限定於線性預測係數、倒頻譜係數MFCC、平均過零率、短時頻譜、共振峰頻率及頻寬。 After removing the silent segment, first assume that in the entire mixed voice, each person speaks for a fixed threshold Tu. If a certain voice is greater than the duration, more people may speak. If it is less than the duration, it is more likely to be only One person speaks, based on this assumption, an inter-frame KL distance analysis can be performed on the acoustic features of the speech segment of each of the long speech segments after the silence segmentation is greater than the fixed threshold Tu. Of course, inter-frame KL distance analysis can also be performed on the acoustic characteristics of all long speech segments. Specifically, the obtained long speech segment is framed to obtain a speech frame of each long speech segment, the acoustic features of the speech frame are extracted, and KL distance (ie, relative entropy) is analyzed for the acoustic features of all the long speech segments. Among them, the acoustic characteristics include, but are not limited to, a linear prediction coefficient, a cepstral coefficient MFCC, an average zero-crossing rate, a short-time spectrum, a formant frequency, and a bandwidth.

其中，KL距離分析的含義是對於兩個離散型的聲學特徵機率分佈集合P={p1，p2，...，pn}和Q={q1，q2，...，qn}，P和Q間的KL 距離：，當KL距離越大時，PQ兩者差異越大，即PQ 這兩個集合來自兩個不同人的語音。較佳地，對時長大於預設時間閾值的長語音段在KL的最大值處進行切分，以提高語音分割的精度。 Among them, the meaning of KL distance analysis is that for two discrete types of acoustic feature probability distribution sets P = {p1, p2, ..., pn} and Q = {q1, q2, ..., qn}, P and Q KL distance between: When the KL distance is larger, the difference between the two PQs is greater, that is, the two sets of PQ are from the voices of two different people. Preferably, the long speech segment whose duration is greater than the preset time threshold is segmented at the maximum value of KL to improve the accuracy of speech segmentation.

長語音段經過切分後得到短語音段，短語音段的數量大於長語音段的數量。然後進行短語音段聚類：對切分後的短語音段進行聚類，以將所有短語音段聚為多個語音類，並為各個短語音段標注對應的說話人標識，其中，屬同一語音類的短語音段標注相同的說話人標識，不屬同一語音類的短語音段標注不同的說話人標識。聚類方法是：採用K個成分的高斯混合模型擬合每段短語音段，以均值作為特徵向量，使用k-means聚類方法把所有短語音段聚為多類。 The long speech segment is segmented to obtain a phrase segment, and the number of phrase segments is greater than the number of long segments. Then, the phrase segment clustering is performed: clustering the segmented phrase segments to aggregate all the phrase segments into a plurality of phonetic classes, and labeling the corresponding speaker segments for each phrase segment, wherein, the same The phonetic phrase segment is labeled with the same speaker identifier, and the phrase segments that are not of the same voice class are labeled with different speaker identifiers. The clustering method is: using the Gaussian mixture model of K components to fit each segment of the phrase, using the mean as the feature vector, and using k-means clustering method to cluster all the phrase segments into multiple categories.

在一較佳的實施例中，如圖3所示，在上述的實施例的基礎上，上述步驟S2包括：步驟S21，利用所述時間遞歸神經網路對各說話人標識對應的短語音段建立聲紋模型，基於所述聲紋模型提取表徵說話人身份特徵的預設類型向量；步驟S22，基於所述預設類型向量計算每一語音幀屬對應的說話人的最大後驗機率；步驟S23，基於所述最大後驗機率並利用預定算法調整該說話人的混合高斯模型；步驟S24，基於調整後的混合高斯模型獲取每一語音幀對應的機率最大的說話人，並根據機率最大的說話人與語音幀的機率關係調整所述混合語音中對應的分割邊界；步驟S25，疊代更新所述聲紋模型n次，每次更新所述聲紋模型時疊代m次所述混合高斯模型，以得到各說話人對應的有效語音段，n及m均為大於1的正整數。 In a preferred embodiment, as shown in FIG. 3, based on the foregoing embodiment, the step S2 includes: Step S21, using the time recurrent neural network to identify a phrase segment corresponding to each speaker. Establishing a voiceprint model, and extracting a preset type vector representing the speaker identity feature based on the voiceprint model; and step S22, calculating a maximum posterior probability of the speaker corresponding to each voice frame based on the preset type vector; S23, adjusting the mixed Gaussian model of the speaker by using a predetermined algorithm based on the maximum posterior probability; and step S24, acquiring the speaker with the highest probability corresponding to each voice frame based on the adjusted mixed Gaussian model, and according to the maximum probability The probability relationship between the speaker and the voice frame adjusts the corresponding segmentation boundary in the mixed voice; in step S25, the voiceprint model is updated n times, and the hybrid Gauss is iterated m times each time the voiceprint model is updated. The model is to obtain a valid speech segment corresponding to each speaker, and both n and m are positive integers greater than one.

本實施例中，利用時間遞歸神經網路對各說話人標識對應的短語音段建立聲紋模型，基於所述聲紋模型提取表徵說話人身份特徵的預設類型向量，較佳地，該預設類型向量為i-vector向量，i-vector向量是反映說話人聲學差異的一個重要特徵。 In this embodiment, a temporal recursive neural network is used to establish a voiceprint model for a phrase segment corresponding to each speaker identifier, and a preset type vector that characterizes the speaker identity feature is extracted based on the voiceprint model. Preferably, the pre-preparation Let the type vector be an i-vector vector, which is an important feature that reflects the speaker's acoustic difference.

在整個混合語音中，根據預設類型向量計算每一語音幀屬某一說話人的最大後驗機率，利用計算最大後驗機率，在混合語音中通過預設算法重新調整說話人的混合高斯模型，例如，通過Baum-Welch算法重新調整說話人的混合高斯模型，該混合高斯模型為k(一般為3-5個)個高斯模型的集合。利用重新調整後的混合高斯模型尋找每一語音幀機率最大的說話人。根據語音幀與尋找到的該說話人的機率關係調整混合語音的分割邊界，例如將分割邊界向前微調或者向後微調。最後，疊代更新上述聲紋模型n次，每次更新聲紋模型時疊代m次混合高斯模型，以得到各個說話人對應的有效語音段，n及m均為大於1的正整數。 In the whole mixed speech, the maximum posterior probability of each speaker is calculated according to the preset type vector, and the mixed Gaussian model of the speaker is re-adjusted by the preset algorithm in the mixed speech by calculating the maximum posterior probability. For example, the speaker's mixed Gaussian model is re-adjusted by the Baum-Welch algorithm, which is a set of k (typically 3-5) Gaussian models. Use the re-adjusted mixed Gaussian model to find the speaker with the highest probability of each speech frame. The segmentation boundary of the mixed speech is adjusted according to the probability relationship between the speech frame and the found speaker, for example, the segmentation boundary is fine-tuned forward or fine-tuned backward. Finally, the above-mentioned voiceprint model is updated n times, and each time the voiceprint model is updated, the m-mixed Gaussian model is superimposed to obtain an effective speech segment corresponding to each speaker, and n and m are positive integers greater than 1.

本實施例借助深度學習的時間遞歸神經網路建立聲紋模型，用各說話人聲紋對應的身份特徵對應各語音幀以計算語音幀屬某一說話人的機率，基於該機率修正模型，最終調整語音分割的邊界，可以有效提高說話人語音分割的精度，降低錯誤率，且可擴展性好。 In this embodiment, a voiceprint model is established by means of a deep learning time recurrent neural network, and each voice frame corresponding to each speaker voice pattern is used to calculate a probability that the voice frame belongs to a certain speaker, and the probability correction model is finally adjusted. The boundary of speech segmentation can effectively improve the accuracy of speaker speech segmentation, reduce the error rate, and have good scalability.

在一較佳的實施例中，在上述的實施例的基礎上，該方法在上述步驟S2之後還包括：基於所述有效語音段獲取對應的應答內容，並將所述應答內容反饋給所述終端。 In a preferred embodiment, based on the foregoing embodiment, after the step S2, the method further includes: acquiring a corresponding response content based on the valid voice segment, and feeding back the response content to the terminal.

本實施例中，自動應答系統關聯對應的應答庫，該應答庫中存儲有不同的問題對應的應答內容，自動應答系統在接收到終端發送的混合語音後，將其分割為說話人標識對應的有效語音段，從這些有效語音段中獲取與該自動應答系統有關問題的一個有效語音段，針對該有效語音段在應答庫中進行匹配，並將匹配得到的應答內容反饋給終端。 In this embodiment, the automatic response system is associated with a corresponding response library, and the response library stores response content corresponding to different questions. After receiving the mixed voice sent by the terminal, the automatic response system divides the response into a speaker identifier. An effective voice segment, from which an effective voice segment related to the automatic answering system is obtained, and the valid voice segment is matched in the response library, and the matched response content is fed back to the terminal.

如圖4所示，圖4為本發明語音分割的裝置一實施例的結構示意圖，該語音分割的裝置包括：分割模組101，用於在接收到終端發送的混合語音時，將所述混合語音分割成多個短語音段，並對各短語音段標注對應的說話人標識；本實施例的語音分割的裝置中包括自動應答系統，例如保險呼叫中心的自動應答系統、各種客服呼叫中心的自動應答系統等等。自動應答系統接收到終端發送的原始的混合語音，該混合語音中混合有多種不同的聲源產生的聲音，例如有多人說話混合的聲音，多人說話的聲音與其他噪聲混合的聲音等等。 As shown in FIG. 4, FIG. 4 is a schematic structural diagram of an apparatus for voice segmentation according to an embodiment of the present invention. The apparatus for voice segmentation includes: a segmentation module 101, configured to: when the mixed voice sent by the terminal is received, the hybrid The voice is divided into a plurality of phrase segments, and each speaker segment is marked with a corresponding speaker identifier; the voice segmentation device of the embodiment includes an automatic response system, such as an automatic answering system of an insurance call center, and various customer service call centers. Auto answering system and more. The automatic answering system receives the original mixed voice sent by the terminal, and the mixed voice is mixed with sounds generated by a plurality of different sound sources, such as a sound mixed by a plurality of people, a sound mixed by a plurality of people and a sound mixed with other noises, and the like. .

調整模組102，用於利用時間遞歸神經網路對各說話人標識對應的短語音段建立聲紋模型，基於所述聲紋模型調整所述混合語音中對應的分割邊界，以分割出各說話人標識對應的有效語音段。 The adjustment module 102 is configured to establish a voiceprint model for a phrase segment corresponding to each speaker identifier by using a time recurrent neural network, and adjust a corresponding segmentation boundary in the mixed voice based on the voiceprint model to segment each speech The person identifies the corresponding valid voice segment.

在一較佳的實施例中，如圖5所示，在上述圖4的實施例的基礎上，上述分割模組101包括：去除單元1011，用於獲取所述混合語音中的靜音段，去除所述混合語音中的靜音段，以根據所述靜音段對所述混合語音進行分割，得到分割後的長語音段；分幀單元1012，用於對所述長語音段進行分幀，以提取每一長語音段的聲學特徵；切分單元1013，用於對每一長語音段的聲學特徵進行KL距離分析，根據KL距離分析結果對所述語音段進行切分，得到切分後的短語音段；聚類單元1014，用於利用高斯混合模型對各短語音段進行語音聚類，並對同一語音類的短語音段標注對應的說話人標識。 In a preferred embodiment, as shown in FIG. 5, on the basis of the foregoing embodiment of FIG. 4, the splitting module 101 includes: a removing unit 1011, configured to acquire a silent segment in the mixed voice, and remove a mute segment in the mixed voice to segment the mixed speech according to the mute segment to obtain a segmented long speech segment; a framing unit 1012, configured to frame the long speech segment to extract An acoustic feature of each long speech segment; a segmentation unit 1013 configured to perform KL distance analysis on the acoustic features of each long speech segment, and segment the speech segment according to the KL distance analysis result to obtain a short after segmentation a speech segment; a clustering unit 1014, configured to perform speech clustering on each phrase segment by using a Gaussian mixture model, and label a corresponding speaker segment with a phrase segment of the same phonetic class.

在一較佳的實施例中，如圖6所示，在上述實施例的基礎上，上述調整模組102包括：建模單元1021，用於利用所述時間遞歸神經網路對各說話人標識對應的短語音段建立聲紋模型，基於所述聲紋模型提取表徵說話人身份特徵的預設類型向量；計算單元1022，用於基於所述預設類型向量計算每一語音幀屬對應的說話人的最大後驗機率；第一調整單元1023，用於基於所述最大後驗機率並利用預定算法調整該說話人的混合高斯模型；第二調整單元1024，用於基於調整後的混合高斯模型獲取每一語音幀對應的機率最大的說話人，並根據機率最大的說話人與語音幀的機率關係調整所述混合語音中對應的分割邊界；疊代單元1025，用於疊代更新所述聲紋模型n次，每次更新所述聲紋模型時疊代m次所述混合高斯模型，以得到各說話人對應的有效語音段，n及m均為大於1的正整數。 In a preferred embodiment, as shown in FIG. 6, on the basis of the foregoing embodiment, the adjustment module 102 includes: a modeling unit 1021, configured to identify each speaker by using the time recurrent neural network. a corresponding voice segment to establish a voiceprint model, and a preset type vector for characterizing the speaker identity feature is extracted based on the voiceprint model; the calculating unit 1022 is configured to calculate a voice corresponding to each voice frame based on the preset type vector a maximum posterior probability of the person; a first adjusting unit 1023, configured to adjust the mixed Gaussian model of the speaker based on the maximum posterior probability and using a predetermined algorithm; and a second adjusting unit 1024, configured to adjust the mixed Gaussian model Obtaining a speaker with the highest probability corresponding to each voice frame, and adjusting a corresponding segmentation boundary in the mixed voice according to a probability relationship between the speaker with the highest probability and the voice frame; the iteration unit 1025 is configured to update the sound in an iterative manner The pattern is n times, and the mixed Gaussian model is superimposed m times each time the voiceprint model is updated to obtain an effective speech segment corresponding to each speaker, and both n and m are greater than 1 Number.

在一較佳的實施例中，在上述的實施例的基礎上，所述語音分割的裝置還包括：反饋模組，用於基於所述有效語音段獲取對應的應答內容，並將所述應答內容反饋給所述終端。 In a preferred embodiment, the apparatus for voice segmentation further includes: a feedback module, configured to acquire a corresponding response content based on the valid voice segment, and the response is Content is fed back to the terminal.

以上所述僅為本發明的較佳實施例，並不用以限制本發明，凡在本發明的精神和原則之內，所作的任何修改、等同替換、改進等，均應包含在本發明的保護範圍之內。 The above are only the preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalents, improvements, etc., which are within the spirit and scope of the present invention, should be included in the protection of the present invention. Within the scope.

Claims

A method for voice segmentation, comprising: S1, when receiving the mixed voice sent by the terminal, the automatic answering system divides the mixed voice into a plurality of phrase segments, and marks a corresponding speaker identifier for each phrase segment; The step S1 includes: S11, acquiring a silent segment in the mixed voice, and removing a silent segment in the mixed voice, to segment the mixed voice according to the silent segment, to obtain a long segment after segmentation. S12, the long speech segment is segmented to extract acoustic features of each long speech segment; S13, KL distance analysis is performed on the acoustic features of each long speech segment, and the long speech is performed according to the KL distance analysis result. The segment is segmented to obtain a segmented phrase segment, wherein the KL distance is a relative entropy; and S14, a Gaussian mixture model is used to perform speech clustering on each phrase segment, and a phrase segment of the same phonetic class Marking a corresponding speaker identifier; S2, using a time recurrent neural network to establish a voiceprint model for a phrase segment corresponding to each speaker identifier, and adjusting the mixed voice based on the voiceprint model Corresponding segmentation boundary to segment the valid speech segment corresponding to each speaker identifier; wherein the step S2 includes: S21, using the time recurrent neural network to establish a voiceprint model for the phrase segment corresponding to each speaker identifier, Extracting a preset type vector representing the speaker identity feature based on the voiceprint model; S22, calculating a maximum posterior probability of the speaker corresponding to each voice frame genre based on the preset type vector; S23, based on the maximum Check the rate and adjust the speaker's mix using a predetermined algorithm a Gaussian model; S24, obtaining a speaker with the highest probability corresponding to each voice frame based on the adjusted mixed Gaussian model, and adjusting a corresponding segmentation boundary in the mixed voice according to a probability relationship between the speaker with the highest probability and the voice frame; And S25, updating the voiceprint model n times in an iterative manner, and superimposing the mixed Gaussian model m times each time the voiceprint model is updated, to obtain valid voice segments corresponding to each speaker, where n and m are greater than A positive integer of 1.

The method for voice segmentation according to claim 1, wherein the step S13 comprises: performing KL distance analysis on the acoustic features of each long segment of speech, and the KL distance for the long segment of the speech whose duration is greater than the preset time threshold. The maximum value is scored to obtain a segmented phrase segment.

The method for voice segmentation according to claim 1, wherein the step S2 further comprises: acquiring a corresponding response content based on the valid voice segment, and feeding back the response content to the terminal.

A device for voice segmentation, comprising: a segmentation module, configured to divide the mixed speech into a plurality of phrase segments and receive a corresponding speaker identifier for each phrase segment when receiving the mixed speech sent by the terminal; The segmentation module includes: a removal unit, configured to acquire a silence segment in the mixed voice, and remove a silence segment in the mixed voice, to segment the mixed voice according to the silence segment, and obtain a segmentation a long speech segment; a framing unit for framing the long speech segment to extract an acoustic feature of each long speech segment; a segmentation unit configured to perform KL distance analysis on the acoustic features of each long segment of speech, and segment the long segment of speech according to the KL distance analysis result to obtain a segmented phrase segment, wherein the KL distance is a relative entropy; and a clustering unit for performing voice clustering on each phrase segment by using a Gaussian mixture model, and marking a corresponding speaker identifier for a phrase segment of the same phonetic class; and adjusting a module for utilizing time recursion The neural network establishes a voiceprint model for the phrase segments corresponding to the speaker identifiers, and adjusts corresponding segmentation boundaries in the mixed voice based on the voiceprint model to segment the valid voice segments corresponding to the speaker identifiers; The adjustment module includes: a modeling unit, configured to establish a voiceprint model for the phrase segments corresponding to each speaker identifier by using the time recurrent neural network, and extract a preset representing the speaker identity feature based on the voiceprint model a calculation unit, configured to calculate a maximum posterior probability of a speaker corresponding to each voice frame genre based on the preset type vector; a first adjustment unit, configured to Describe the maximum posterior probability and adjust the mixed Gaussian model of the speaker by using a predetermined algorithm; the second adjusting unit is configured to acquire the speaker with the highest probability corresponding to each voice frame based on the adjusted mixed Gaussian model, and according to the maximum probability a probability relationship between the speaker and the speech frame adjusts a corresponding segmentation boundary in the mixed speech; and an iteration unit, configured to update the voiceprint model n times in an iterative manner, and iteratively m times each time the voiceprint model is updated The Gaussian model is mixed to obtain valid speech segments corresponding to each speaker, and n and m are positive integers greater than one.

The apparatus for voice segmentation according to Item 4, wherein the segmentation unit is specifically configured to perform KL distance analysis on acoustic characteristics of each long segment of speech, and the duration is greater than a preset. The long speech segment of the time threshold is segmented at the maximum of the KL distance to obtain a segmented phrase segment.

The device for voice segmentation according to claim 4, wherein the device for voice segmentation further comprises: a feedback module, configured to acquire a corresponding response content based on the valid voice segment, and feed back the response content to The terminal.