TW201833810A

TW201833810A - Method and system of authentication based on voiceprint recognition

Info

Publication number: TW201833810A
Application number: TW106135250A
Authority: TW
Inventors: 王健宗; 丁涵宇; 郭卉; 肖京
Original assignee: 平安科技（深圳）有限公司
Priority date: 2017-03-13
Filing date: 2017-10-13
Publication date: 2018-09-16
Also published as: CN107068154A; WO2018166112A1; WO2018166187A1; TWI641965B; CN107517207A

Abstract

A method and a system of authentication based on voiceprint recognition are disclosed. The method comprises steps of: after receiving a voice data of a user to be identified, acquiring a voiceprint feature of the voice data, and generating a corresponding voiceprint feature vector based on the voiceprint feature; inputting the voiceprint feature vector into a background channel model generated by training in advance, to construct a current voiceprint identification vector corresponding to the voice data; and calculating a spatial distance between the current voiceprint identification vector and a pre-stored standard voiceprint identification vector of the user, and generating a verification result by authenticating the user based on the spatial distance. Thus, the invention can improve an accuracy and an efficiency of user authentication.

Description

Method and system for identity verification based on voiceprint recognition

本發明涉及通信技術領域，尤其涉及一種基於聲紋識別的身份驗證的方法及系統。 The present invention relates to the field of communications technologies, and in particular, to a method and system for identity verification based on voiceprint recognition.

目前，大型金融公司的業務範圍涉及保險、銀行、投資等多個業務範疇，每個業務範疇通常都需要同客戶進行溝通，溝通的方式有多種(例如電話溝通或者面對面溝通等)。在進行溝通之前，對客戶的身份進行驗證成為保證業務安全的重要組成部分。為了滿足業務的即時性需求，金融公司通常採用人工方式對客戶的身份進行分析驗證。由於客戶群體龐大，依靠人工進行判別分析以對驗證客戶的身份的方式準確性也不高，效率也低。 At present, the business scope of large financial companies involves insurance, banking, investment and other business areas. Each business category usually needs to communicate with customers, and there are many ways to communicate (such as telephone communication or face-to-face communication). Before communicating, verifying the identity of the customer becomes an important part of ensuring business security. In order to meet the immediate needs of the business, financial companies usually use manual methods to analyze and verify the identity of customers. Due to the large customer base, relying on manual discriminant analysis to verify the identity of the customer is not accurate or efficient.

本發明之一目的在於提供一種基於聲紋識別的身份驗證的方法及系統，旨在提高用戶身份驗證的準確率及效率。 An object of the present invention is to provide a method and system for identity verification based on voiceprint recognition, which aims to improve the accuracy and efficiency of user identity verification.

為實現上述目的，本發明提供一種基於聲紋識別的身份驗證的方法，該基於聲紋識別的身份驗證的方法包括步驟：S1，在接收到進行身份驗證的用戶的語音資料後，獲取該語音資料的聲紋特徵，並基於該聲紋特徵構建對應的聲紋特徵向量；S2，將該聲紋特徵向量輸入預先訓練生成的背景通道模型，以構建出該語音資料對應的當前聲紋鑒別向量；S3，計算該當前聲紋鑒別向量與預存的該用戶的標準聲紋鑒別向量之間的空間距離，基於該距離對該用戶進行身份驗證，並生成驗證結果。 In order to achieve the above object, the present invention provides a voiceprint recognition-based identity verification method, and the voiceprint recognition-based identity verification method includes the following steps: S1, after receiving the voice data of the user performing the identity verification, acquiring the voice The voiceprint feature of the data is constructed, and the corresponding voiceprint feature vector is constructed based on the voiceprint feature; S2, the voiceprint feature vector is input into the background channel model generated by the pre-training to construct the current voiceprint discrimination vector corresponding to the voice data. S3, calculating a spatial distance between the current voiceprint discrimination vector and the pre-stored standard voiceprint discrimination vector of the user, authenticating the user based on the distance, and generating a verification result.

較佳地，該步驟S1包括子步驟： S11，對該語音資料進行預加權、分幀及加窗處理；S12，對每一個加窗處理進行傅立葉轉換得到對應的頻譜；S13，將該頻譜輸入梅爾濾波器以輸出得到梅爾頻譜；S14，在梅爾頻譜上面進行倒頻譜分析以獲得梅爾頻率倒頻譜係數MFCC，基於該梅爾頻率倒頻譜係數MFCC組成對應的聲紋特徵向量。 Preferably, the step S1 includes the sub-steps: S11, pre-weighting, framing, and windowing the voice data; S12, performing Fourier transform on each windowing process to obtain a corresponding spectrum; S13, inputting the spectrum The Meyer filter obtains the Mel spectrum from the output; S14, performs cepstrum analysis on the Mel spectrum to obtain the Mel frequency cepstral coefficient MFCC, and forms a corresponding voiceprint feature vector based on the Mel frequency cepstral coefficient MFCC.

較佳地，該步驟S3包括子步驟：S31，計算該當前聲紋鑒別向量與預存的該用戶的標準聲紋鑒別向量之間的餘弦距離：，為該標準聲紋鑒別向量，為當前聲紋鑒別向量；S32，若該餘弦距離小於或者等於預設的距離閾值，則生成驗證通過的資訊；S33，若該餘弦距離大於預設的距離閾值，則生成驗證不通過的資訊。 Preferably, the step S3 comprises the substeps: S31, calculating a cosine distance between the current voiceprint discrimination vector and the pre-stored standard voiceprint discrimination vector of the user: , Identify the vector for the standard voiceprint, For the current voiceprint discrimination vector; S32, if the cosine distance is less than or equal to the preset distance threshold, generate verification information; S33, if the cosine distance is greater than the preset distance threshold, generate information that the verification fails.

較佳地，該背景通道模型為高斯混合模型，該步驟S1之前包括步驟：獲取預設數量的語音資料樣本，並獲取各語音資料樣本對應的聲紋特徵，並基於各語音資料樣本對應的聲紋特徵構建各語音資料樣本對應的聲紋特徵向量；將各語音資料樣本對應的聲紋特徵向量分為第一比例的訓練集及第二比例的驗證集，該第一比例及第二比例的和小於等於1；利用該訓練集中的聲紋特徵向量對高斯混合模型進行訓練，並在訓練完成後，利用該驗證集對訓練後的高斯混合模型的準確率進行驗證；若該準確率大於預設閾值，則模型訓練結束，以訓練後的高斯混合模型作為該步驟S2的背景通道模型，或者，若該準確率小於等於預設閾值，則增加該語音資料樣本的數量，並基於增加後的語音資料樣本重新進行訓練。 Preferably, the background channel model is a Gaussian mixture model, and the step S1 includes the steps of: acquiring a preset number of voice data samples, and acquiring voiceprint features corresponding to the voice data samples, and based on the sounds corresponding to the voice data samples. The voice feature vector corresponding to each voice data sample is constructed; the voiceprint feature vector corresponding to each voice data sample is divided into a first proportional training set and a second proportional verification set, and the first ratio and the second ratio are And less than or equal to 1; the Gaussian mixture model is trained by using the voiceprint feature vector in the training set, and after the training is completed, the verification set is used to verify the accuracy of the trained Gaussian mixture model; if the accuracy is greater than the pre- If the threshold is set, the model training ends, and the trained Gaussian mixture model is used as the background channel model of the step S2, or if the accuracy is less than or equal to the preset threshold, the number of the voice data samples is increased, and based on the increased The voice data samples are retrained.

較佳地，該步驟S3替換為：計算該當前聲紋鑒別向量與預存的各標準聲紋鑒別向量之間的空間距離，獲取最小的空間距離，基於該最小的空間距離對該用戶進行身份驗證，並生成驗證結果。 Preferably, the step S3 is replaced by: calculating a spatial distance between the current voiceprint discrimination vector and the pre-stored standard voiceprint discrimination vectors, obtaining a minimum spatial distance, and authenticating the user based on the minimum spatial distance. And generate verification results.

為實現上述目的，本發明還提供一種基於聲紋識別的身份驗證的系統，該基於聲紋識別的身份驗證的系統包括：第一獲取模組，用於在接收到進行身份驗證的用戶的語音資料後，獲取該語音資料的聲紋特徵，並基於該聲紋特徵構建對應的聲紋特徵向量；構建模組，用於將該聲紋特徵向量輸入預先訓練生成的背景通道模型，以構建出該語音資料對應的當前聲紋鑒別向量；第一驗證模組，用於計算該當前聲紋鑒別向量與預存的該用戶的標準聲紋鑒別向量之間的空間距離，基於該距離對該用戶進行身份驗證，並生成驗證結果。 To achieve the above object, the present invention also provides a voiceprint recognition based authentication system, the voiceprint recognition based authentication system comprising: a first acquisition module, configured to receive a voice of a user performing authentication After the data, the voiceprint feature of the voice data is obtained, and the corresponding voiceprint feature vector is constructed based on the voiceprint feature; and the module is configured to input the voiceprint feature vector into the background channel model generated by the pre-training to construct The current voiceprint discrimination vector corresponding to the voice data; the first verification module is configured to calculate a spatial distance between the current voiceprint discrimination vector and the pre-stored standard voiceprint discrimination vector of the user, and perform the user on the basis based on the distance Authenticate and generate verification results.

較佳地，該第一獲取模組具體用於對該語音資料進行預加權、分幀及加窗處理；對每一個加窗進行傅立葉轉換得到對應的頻譜；將該頻譜輸入梅爾濾波器以輸出得到梅爾頻譜；在梅爾頻譜上面進行倒頻譜分析以獲得梅爾頻率倒頻譜係數MFCC，基於該梅爾頻率倒頻譜係數MFCC組成對應的聲紋特徵向量。 Preferably, the first acquiring module is specifically configured to perform pre-weighting, framing, and windowing processing on the voice data; performing Fourier transform on each window to obtain a corresponding spectrum; and inputting the spectrum into the mel filter The output obtains the Mel spectrum; the cepstrum analysis is performed on the Mel spectrum to obtain the Mel frequency cepstral coefficient MFCC, and the corresponding Moiré feature vector is composed based on the Mel frequency cepstral coefficient MFCC.

較佳地，該第一驗證模組具體用於計算該當前聲紋鑒別向量與預存的該用戶的標準聲紋鑒別向量之間的餘弦距離：，為該標準聲紋鑒別向量，為當前聲紋鑒別向量；若該餘弦距離小於或者等於預設的距離閾值，則生成驗證通過的資訊；若該餘弦距離大於預設的距離閾值，則生成驗證不通過的資訊。 Preferably, the first verification module is specifically configured to calculate a cosine distance between the current voiceprint discrimination vector and the pre-stored standard voiceprint discrimination vector of the user: , Identify the vector for the standard voiceprint, A vector for identifying the current voiceprint; if the cosine distance is less than or equal to a preset distance threshold, generating information for verifying the pass; if the cosine distance is greater than the preset distance threshold, generating information that the verification fails.

較佳地，該基於聲紋識別的身份驗證的系統還包括：第二獲取模組，用於獲取預設數量的語音資料樣本，並獲取各語音資料樣本對應的聲紋特徵，並基於各語音資料樣本對應的聲紋特徵構建各語音資料樣本對應的聲紋特徵向量；劃分模組，用於將各語音資料樣本對應的聲紋特徵向量分為第一比例的訓練集及第二比例的驗證集，該第一比例及第二比例的和小於等於1；訓練模組，用於利用該訓練集中的聲紋特徵向量對高斯混合模型進行訓練，並在訓練完成後，利用該驗證集對訓練後的高斯混合模型的準確率進行驗證；處理模組，用於若該準確率大於預設閾值，則模型訓練結束，以訓練後的高斯混合模型作為該背景通道模型，或者，若該準確率小於等於預設閾值，則增加該語音資料樣本的數量，並基於增加後的語音資料樣本重新進行訓練。 Preferably, the voiceprint recognition-based identity verification system further includes: a second acquisition module, configured to acquire a preset number of voice data samples, and obtain voiceprint features corresponding to each voice data sample, and based on each voice The voiceprint feature corresponding to the data sample constructs a voiceprint feature vector corresponding to each voice data sample; the partitioning module is configured to divide the voiceprint feature vector corresponding to each voice data sample into a first proportional training set and a second proportional verification. The set, the sum of the first ratio and the second ratio is less than or equal to 1; the training module is configured to train the Gaussian mixture model by using the voiceprint feature vector in the training set, and after the training is completed, use the verification set pair training The accuracy of the subsequent Gaussian mixture model is verified; the processing module is configured to end the model training if the accuracy is greater than a preset threshold, and use the trained Gaussian mixture model as the background channel model, or if the accuracy is If the threshold is less than or equal to the preset threshold, the number of the voice data samples is increased, and the training is performed based on the added voice data samples.

較佳地，該第一驗證模組替換為第二驗證模組，用於計算該當前聲紋鑒別向量與預存的各標準聲紋鑒別向量之間的空間距離，獲取最小的空間距離，基於該最小的空間距離對該用戶進行身份驗證，並生成驗證結果。 Preferably, the first verification module is replaced with a second verification module, configured to calculate a spatial distance between the current voiceprint discrimination vector and the pre-stored standard voiceprint discrimination vectors, to obtain a minimum spatial distance, based on the The minimum spatial distance authenticates the user and generates a verification result.

本發明的有益效果是：本發明預先訓練生成的背景通道模型為通過對大量語音資料的挖掘與比對訓練得到，這一模型可以在最大限度保留用戶的聲紋特徵的同時，精確刻畫用戶說話時的背景聲紋特徵，並能夠在識別時將這一特徵去除，而提取用戶聲音的固有特徵，能夠較大地提高用戶身份驗證的準確率，並提高身份驗證的效率。 The invention has the beneficial effects that the background channel model generated by the pre-training of the invention is obtained by mining and comparing a large amount of speech data, and the model can accurately describe the user's speech while retaining the user's voiceprint feature to the utmost extent. The background voiceprint feature can be removed at the time of recognition, and the intrinsic feature of the user voice can be extracted, which can greatly improve the accuracy of user identity verification and improve the efficiency of identity verification.

1‧‧‧電子裝置 1‧‧‧Electronic device

10‧‧‧基於聲紋識別的身分驗證的系統 10‧‧‧System for identity verification based on voiceprint recognition

101‧‧‧第一獲取模組 101‧‧‧First acquisition module

102‧‧‧構建模組 102‧‧‧Building module

103‧‧‧第一驗證模組 103‧‧‧First verification module

11‧‧‧記憶體 11‧‧‧ memory

12‧‧‧處理器 12‧‧‧ Processor

13‧‧‧顯示器 13‧‧‧ display

S1‧‧‧步驟 S1‧‧‧ steps

S11‧‧‧子步驟 S11‧‧‧ substeps

S12‧‧‧子步驟 S12‧‧‧ substeps

S13‧‧‧子步驟 S13‧‧‧ substeps

S14‧‧‧子步驟 S14‧‧‧ substeps

S2‧‧‧步驟 S2‧‧‧ steps

S3‧‧‧步驟 S3‧‧‧ steps

S31‧‧‧子步驟 S31‧‧‧ substeps

S32‧‧‧子步驟 S32‧‧‧ substeps

S33‧‧‧子步驟 S33‧‧‧ substeps

第1圖為本發明基於聲紋識別的身份驗證的方法較佳實施例的流程示意圖。 FIG. 1 is a schematic flow chart of a preferred embodiment of a method for authenticating voiceprint recognition based authentication according to the present invention.

第2圖為第1圖所示步驟S1的細化流程示意圖。 Fig. 2 is a schematic diagram showing the refinement flow of step S1 shown in Fig. 1.

第3圖為第1圖所示步驟S3的細化流程示意圖。 Fig. 3 is a schematic diagram showing the refinement flow of step S3 shown in Fig. 1.

第4圖為本發明基於聲紋識別的身份驗證的系統較佳實施例的運行環境示意圖。 4 is a schematic diagram of an operating environment of a preferred embodiment of a voiceprint recognition based authentication system according to the present invention.

第5圖為本發明基於聲紋識別的身份驗證的系統較佳實施例的結構示意圖。 FIG. 5 is a schematic structural diagram of a system for authenticating a voiceprint recognition based authentication method according to the present invention.

以下結合附圖對本發明的原理及特徵進行描述，所舉實例只用于解釋本發明，並非用於限定本發明的範圍。 The principles and features of the present invention are described in the following with reference to the accompanying drawings.

如第1圖所示，第1圖為本發明基於聲紋識別的身份驗證的方法一實施例的流程示意圖，該基於聲紋識別的身份驗證的方法可以由一基於聲紋識別的身份驗證的系統執行，該系統可以由軟體及/或硬體實現，並且該系統可整合在伺服器中。該基於聲紋識別的身份驗證的方法包括以下步驟：步驟S1，在接收到進行身份驗證的用戶的語音資料後，獲取該語音資料的聲紋特徵，並基於該聲紋特徵構建對應的聲紋特徵向量；本實施例中，語音資料由語音採集設備採集得到(語音採集設備例如為麥克風)，語音採集設備將採集的語音資料發送給基於聲紋識別的身份驗證的系統。 As shown in FIG. 1 , FIG. 1 is a schematic flowchart of an embodiment of a voiceprint recognition based identity verification method according to an embodiment of the present invention. The voiceprint recognition based identity verification method may be performed by a voiceprint recognition based identity verification. The system is executed, the system can be implemented by software and/or hardware, and the system can be integrated in the server. The voiceprint recognition-based authentication method includes the following steps: Step S1: After receiving the voice data of the user who performs the authentication, acquire the voiceprint feature of the voice data, and construct a corresponding voiceprint based on the voiceprint feature. Feature vector; in this embodiment, the voice data is collected by the voice collection device (the voice collection device is, for example, a microphone), and the voice collection device sends the collected voice data to the voice recognition-based identity verification system.

在採集語音資料時，應儘量防止環境噪音及語音採集設備的干擾。語音採集設備與用戶保持適當距離，且儘量不用失真大的語音採集設備，電源較佳使用市電，並保持電流穩定；在進行電話錄音時應使用感測器。在提取語音資料中的聲紋特徵之前，可以對語音資料進行去噪音處理，以進一步減少干擾。為了能夠提取得到語音資料的聲紋特徵，所採集的語音資料為預設資料長度的語音資料，或者為大於預設資料長度的語音資料。 When collecting voice data, you should try to prevent environmental noise and interference from voice collection equipment. The voice collection device maintains an appropriate distance from the user, and tries not to use a large voice acquisition device. The power supply preferably uses the commercial power and keeps the current stable; the sensor should be used when recording the telephone. Before extracting the voiceprint features in the voice data, the voice data can be denoised to further reduce interference. In order to extract the voiceprint feature of the voice data, the collected voice data is voice data of a preset data length, or voice data larger than a preset data length.

聲紋特徵包括多種類型，例如寬頻聲紋、窄頻聲紋、振幅聲紋等，本實施例的聲紋特徵為較佳地為語音資料的梅爾頻率倒頻譜係數(Mel Frequency Cepstrum Coefficient，MFCC)。在構建對應的聲紋特徵向量時，將語音資料的聲紋特徵組成特徵資料矩陣，該特徵資料矩陣即為語音資料的聲紋特徵向量。 The voiceprint feature includes a plurality of types, such as a wide-frequency voiceprint, a narrow-frequency voiceprint, an amplitude voiceprint, etc., and the voiceprint feature of the present embodiment is preferably a Mel Frequency Cepstrum Coefficient (MFCC). ). When constructing the corresponding voiceprint feature vector, the voiceprint feature of the voice data is composed into a feature data matrix, which is the voiceprint feature vector of the voice data.

步驟S2，將該聲紋特徵向量輸入預先訓練生成的背景通道模型，以構建出該語音資料對應的當前聲紋鑒別向量；其中，將聲紋特徵向量輸入預先訓練生成的背景通道模型，較佳地，該背景通道模型為高斯混合模型，利用該背景通道模型來計算聲紋特徵向量，得出對應的當前聲紋鑒別向量(即i-vector)。 Step S2, the voiceprint feature vector is input into the background channel model generated by the pre-training to construct a current voiceprint discrimination vector corresponding to the voice data; wherein the voiceprint feature vector is input into the background channel model generated by the pre-training, preferably The background channel model is a Gaussian mixture model, and the background channel model is used to calculate the voiceprint feature vector to obtain a corresponding current voiceprint discrimination vector (ie, i-vector).

具體地，該計算過程包括：1)、選擇高斯模型：首先，利用泛用背景通道模型中的參數來計算每幀資料在不同高斯模型的似然對數值，通過對似然對數值矩陣每列並行排序，選取前N個高斯模型，最終獲得一每幀資料在混合高斯模型中數值的矩陣：Loglike=E(X)*D(X)^-1*X ^T-0.5*D(X)^-1*(X.²)^T，其中，Loglike為似然對數值矩陣，E(X)為泛用背景通道模型訓練出來的均值矩陣，D(X)為共變異數矩陣，X為資料矩陣，X.²為矩陣每個值取平方。 Specifically, the calculation process includes: 1) selecting a Gaussian model: first, using parameters in the general background channel model to calculate the likelihood value of each frame of data in different Gaussian models, by using a column of likelihood logarithmic values Parallel sorting, select the first N Gaussian models, and finally obtain a matrix of values per frame of data in the mixed Gaussian model: Loglike = E ( X ) * D ( X ) ^-1 * X ^T -0.5* D ( X ) ^-1 *( X . ² ) ^T , where Loglike is a likelihood logarithmic matrix, E(X) is a mean matrix trained by a general background channel model, D(X) is a covariance matrix, and X is a data matrix, X ² Square the value of each matrix.

2)、計算後驗機率：將每幀資料X進行X * XT計算，得到一個對稱矩陣，可簡化為下三角矩陣，並將元素按順序排列為1行，變成一個N幀乘以該下三角矩陣個數緯度的一個向量進行計算，將所有幀的該向量組合成新的資料矩陣，同時將泛用背景模型中計算機率的共變異數矩陣，每個矩陣也簡化為下三角矩陣，變成與新資料矩陣類似的矩陣，在通過泛用背景通道模型中的均值矩陣及共變異數矩陣算出每幀資料的在該選擇的高斯模型下的似然對數值，然後進行Softmax回歸，最後進行正歸化操作，得到每幀在混合高斯模型後驗機率分佈，將每幀的機率分佈向量組成機率矩陣。 2) Calculate the posterior probability: X * XT calculation is performed for each frame of data X to obtain a symmetric matrix, which can be simplified into a lower triangular matrix, and the elements are arranged in order of 1 row, and become an N frame multiplied by the lower triangle. A vector of latitudes of the matrix is calculated, and the vectors of all frames are combined into a new data matrix, and the matrix of the common variability of the computer rate in the general background model is simplified, and each matrix is also reduced to a lower triangular matrix. A matrix similar to the new data matrix, the likelihood log value of each frame of data in the selected Gaussian model is calculated by using the mean matrix and the common variance matrix in the universal background channel model, and then Softmax regression is performed, and finally the positive return is performed. The operation results in a posterior probability distribution of each frame in the mixed Gaussian model, and the probability distribution vector of each frame is composed into a probability matrix.

3)、提取當前聲紋鑒別向量：首先進行一階、二階係數的計算，一階係數計算可以通過機率矩陣列求和得到：，其中，Gamma_i為一階係數向量的第i 個元素，loglikes_ji為機率矩陣的第j行，第i個元素。 3) Extract the current voiceprint discrimination vector: first calculate the first-order and second-order coefficients, and the first-order coefficient calculation can be obtained by summing the probability matrix columns: Where Gamma _i is the ith element of the first-order coefficient vector, and loglikes _ji is the j-th row of the probability matrix, the ith element.

二階係數可以通過機率矩陣的轉置乘以資料矩陣獲得：X=Loglike ^T * feats，其中，X為二階係數矩陣，loglike為機率矩陣，feats為特徵資料矩陣。 The second-order coefficients can be obtained by multiplying the transposition of the probability matrix by the data matrix: X = Loglike ^T * feats , where X is a second-order coefficient matrix, loglike is a probability matrix, and feats is a feature data matrix.

在計算得到一階、二階係數以後，並行計算一次項及二次項，然後通過一次項及二次項計算當前聲紋鑒別向量。 After the first-order and second-order coefficients are calculated, the primary term and the quadratic term are calculated in parallel, and then the current voiceprint discrimination vector is calculated by the primary term and the quadratic term.

較佳地，背景通道模型為高斯混合模型，在上述步驟S1之前包括：獲取預設數量的語音資料樣本，並獲取各語音資料樣本對應的聲紋特徵，並基於各語音資料樣本對應的聲紋特徵構建各語音資料樣本對應的聲紋特徵向量；將各語音資料樣本對應的聲紋特徵向量分為第一比例的訓練集及第二比例的驗證集，該第一比例及第二比例的和小於等於1；利用該訓練集中的聲紋特徵向量對高斯混合模型進行訓練，並在訓練完成後，利用該驗證集對訓練後的高斯混合模型的準確率進行驗證；若該準確率大於預設閾值，則模型訓練結束，以訓練後的高斯混合模型作為該步驟S2的背景通道模型，或者，若該準確率小於或等於預設閾值，則增加該語音資料樣本的數量，並基於增加後的語音資料樣本重新進行訓練。 Preferably, the background channel model is a Gaussian mixture model, and before the step S1, the method includes: acquiring a preset number of voice data samples, and acquiring voiceprint features corresponding to each voice data sample, and based on the voiceprint corresponding to each voice data sample The feature maps the voiceprint feature vector corresponding to each voice data sample; the voiceprint feature vector corresponding to each voice data sample is divided into a first proportional training set and a second proportional verification set, and the sum of the first ratio and the second ratio Less than or equal to 1; the Gaussian mixture model is trained by using the voiceprint feature vector in the training set, and after the training is completed, the verification set is used to verify the accuracy of the trained Gaussian mixture model; if the accuracy is greater than the preset Threshold, the model training ends, the trained Gaussian mixture model is used as the background channel model of the step S2, or if the accuracy is less than or equal to the preset threshold, the number of the voice data samples is increased, and based on the increased The voice data samples are retrained.

其中，在利用訓練集中的聲紋特徵向量對高斯混合模型進行訓練時，抽取出來的D維聲紋特徵對應的似然機率可用K個高斯分量表示為：，其中，P(x)為語音資料樣本由高斯混合模型生成的機率(混合高斯模型)，w_k為每個高斯模型的權重，p(x|k)為樣本由第k個高斯模型生成的機率，K為高斯模型數量。 When the Gaussian mixture model is trained by using the voiceprint feature vector in the training set, the likelihood probability of the extracted D-dimensional voiceprint feature can be expressed by K Gaussian components: , where P(x) is the probability that the speech data samples are generated by the Gaussian mixture model (mixed Gaussian model), w _k is the weight of each Gaussian model, and p(x|k) is the sample generated by the kth Gaussian model. Probability, K is the number of Gaussian models.

整個高斯混合模型的參數可以表示為：{w _i,μ _i,Σ_i}，w_i為第i個高斯模型的權重，μ_i為第i個高斯模型的均值，Σ_i為第i個高斯模型的協方差。訓練該高斯混合模型可以用非監督的EM算法。訓練完成後，得到高斯混合模型的權重向量、常數向量、N個共變異數矩陣、均值乘以共變異數的矩陣等，即為一個訓練後的高斯混合模型。 The parameters of the entire Gaussian mixture model can be expressed as: { w _i , μ _i , Σ _i }, w _i is the weight of the i-th Gaussian model, μ _i is the mean of the i-th Gaussian model, and Σ _i is the i-th Gaussian The covariance of the model. Training the Gaussian mixture model can use an unsupervised EM algorithm. After the training is completed, the Gaussian mixture model weight vector, constant vector, N common variance matrix, and the mean multiplied by the common variance matrix are obtained, which is a trained Gaussian mixture model.

步驟S3，計算該當前聲紋鑒別向量與預存的該用戶的標準聲紋鑒別向量之間的空間距離，基於該距離對該用戶進行身份驗證，並生成驗證結果。 Step S3: Calculate a spatial distance between the current voiceprint discrimination vector and the pre-stored standard voiceprint discrimination vector of the user, perform identity verification on the user based on the distance, and generate a verification result.

向量與向量之間的距離有多種，包括餘弦距離及歐氏距離等等，較佳地，本實施例的空間距離為餘弦距離，餘弦距離為利用向量空間中兩個向量夾角的餘弦值作為衡量兩個個體間差異的大小的度量。 There are various distances between the vector and the vector, including the cosine distance and the Euclidean distance, etc. Preferably, the spatial distance of the present embodiment is a cosine distance, and the cosine distance is measured by using the cosine of the angles of the two vectors in the vector space. A measure of the magnitude of the difference between two individuals.

其中，標準聲紋鑒別向量為預先獲得並存儲的聲紋鑒別向量，標準聲紋鑒別向量在存儲時攜帶其對應的用戶的標識資訊，其能夠準確代表對應的用戶的身份。在計算空間距離前，根據用戶提供的標識資訊獲得存儲的聲紋鑒別向量。 The standard voiceprint discriminant vector is a voiceprint discriminant vector obtained and stored in advance, and the standard voiceprint discriminant vector carries the identifier information of the corresponding user when stored, which can accurately represent the identity of the corresponding user. The stored voiceprint discrimination vector is obtained according to the identification information provided by the user before calculating the spatial distance.

其中，在計算得到的空間距離小於等於預設距離閾值時，驗證通過，反之，則驗證失敗。 Wherein, when the calculated spatial distance is less than or equal to the preset distance threshold, the verification passes, and vice versa, the verification fails.

與現有技術相比，本實施例預先訓練生成的背景通道模型為通過對大量語音資料的挖掘與比對訓練得到，這一模型可以在最大限度保留用戶的聲紋特徵的同時，精確刻畫用戶說話時的背景聲紋特徵，並能夠在識別時將這一特徵去除，而提取用戶聲音的固有特徵，能夠較大地提高用戶身份驗證的準確率，並提高身份驗證的效率；此外，本實施例充分利用了人聲中與聲道相關的聲紋特徵，這種聲紋特徵並不需要對文本加以限制，因而在進行識別與驗證的過程中有較大的靈活性。 Compared with the prior art, the background channel model generated by the pre-training in this embodiment is obtained by mining and comparing a large amount of voice data, and the model can accurately describe the user's voice while retaining the user's voiceprint feature to the utmost extent. The background voiceprint feature, and can remove this feature at the time of recognition, and extract the inherent features of the user voice, can greatly improve the accuracy of the user identity verification, and improve the efficiency of the identity verification; in addition, this embodiment is sufficient It utilizes the voiceprint features associated with the vocal vocal in the human voice. This voiceprint feature does not require restrictions on the text, and thus has greater flexibility in the process of identification and verification.

在一較佳的實施例中，如第2圖所示，在上述第1圖的實施例的基礎上，上述步驟S1包括：子步驟S11，對該語音資料進行預加權、分幀及加窗處理；本實施例中，在接收到進行身份驗證的用戶的語音資料後，對語音資料進行處理。其中，預加權處理實際是高通濾波處理，濾除低頻資料，使得語音資料中的高頻特性更加突顯，具體地，高通濾波的傳遞函數為：H(Z)=1-αZ ^-1，其中，Z為語音資料，α為常量係數，較佳地，α的取值為0.97；由於聲音信號只在較短時間內呈現平穩性，因此將一段聲音信號分成N段短時間的信號(即N幀)，且為了避免聲音的連續性特徵遺失，相鄰幀之間有一段重複區域，重複區域一般為每幀長的1/2；在對語音資料進行分幀後，每一幀信號都當成平穩信號來處理，但吉布斯效應的存在，語音資料的起始幀及結束幀是不連續的，在分幀之後，更加背離原始語音，因此，需要對語音資料進行加窗處理。 In a preferred embodiment, as shown in FIG. 2, based on the embodiment of FIG. 1 above, the step S1 includes: sub-step S11, pre-weighting, framing, and windowing the voice data. Processing; in this embodiment, after receiving the voice data of the user who performs identity verification, the voice data is processed. Wherein, the pre-weighting process is actually a high-pass filtering process, filtering out the low-frequency data, so that the high-frequency characteristics in the voice data are more prominent. Specifically, the transfer function of the high-pass filter is: H ( Z )=1- αZ ^-1 , wherein Z is the speech data, α is a constant coefficient, preferably, the value of α is 0.97; since the sound signal exhibits smoothness only in a short time, the sound signal is divided into N segments of short time signals (ie, N frames) ), and in order to avoid the loss of the continuity feature of the sound, there is a repeating area between adjacent frames, and the repeating area is generally 1/2 of the length of each frame; after framing the voice data, each frame signal is regarded as stable The signal is processed, but the existence of the Gibbs effect, the start frame and the end frame of the speech data are discontinuous, and after the framing, the original speech is further deviated. Therefore, the speech data needs to be windowed.

子步驟S12，對每一個加窗進行傅立葉轉換得到對應的頻譜；子步驟S13，將該頻譜輸入梅爾濾波器以輸出得到梅爾頻譜；子步驟S14，在梅爾頻譜上面進行倒頻譜分析以獲得梅爾頻率倒頻譜係數MFCC，基於該梅爾頻率倒頻譜係數MFCC組成對應的聲紋特徵向量。其中，倒頻譜分析例如為取對數、做反轉換，反轉換一般是通過DCT離散餘弦轉換來實現，取DCT後的第2個到第13個係數作為MFCC係數。梅爾頻率倒頻譜係數MFCC即為這幀語音資料的聲紋特徵，將每幀的梅爾頻率倒頻譜係數MFCC組成特徵資料矩陣，該特徵資料矩陣即為語音資料的聲紋特徵向量。 Sub-step S12, performing Fourier transform on each window to obtain a corresponding spectrum; sub-step S13, inputting the spectrum into the mel filter to output a mel spectrum; sub-step S14, performing cepst analysis on the mel spectrum A Mel frequency cepstral coefficient MFCC is obtained, and a corresponding voiceprint feature vector is formed based on the Mel frequency cepstral coefficient MFCC. The cepstrum analysis is, for example, taking logarithm and inverse transform. The inverse transform is generally implemented by DCT discrete cosine transform, and the second to thirteenth coefficients after DCT are taken as MFCC coefficients. The Mel frequency cepstral coefficient MFCC is the voiceprint feature of the speech data of this frame. The Mel frequency cepstral coefficient MFCC of each frame is composed into a feature data matrix, which is the voiceprint feature vector of the speech data.

在一較佳的實施例中，如第3圖所示，在上述第1圖的實施例的基礎上，上述步驟S3包括：子步驟S31，計算該當前聲紋鑒別向量與預存的該用戶的標準聲紋鑒別向量之間的餘弦距離：，為該標準聲紋鑒別向量，為當前聲紋鑒別向量；子步驟S32，若該餘弦距離小於或者等於預設的距離閾值，則生成驗證通過的資訊；子步驟S33，若該餘弦距離大於預設的距離閾值，則生成驗證不通過的資訊。 In a preferred embodiment, as shown in FIG. 3, based on the embodiment of FIG. 1 above, the step S3 includes: sub-step S31, calculating the current voiceprint discrimination vector and the pre-stored user's The cosine distance between standard voiceprint discrimination vectors: , Identify the vector for the standard voiceprint, For the current voiceprint discrimination vector; sub-step S32, if the cosine distance is less than or equal to the preset distance threshold, generate verification pass information; sub-step S33, if the cosine distance is greater than the preset distance threshold, generate verification Information passed.

在一較佳的實施例中，在上述第1圖的實施例的基礎上，上述的步驟S3替換為：計算該當前聲紋鑒別向量與預存的各標準聲紋鑒別向量之間的空間距離，獲取最小的空間距離，基於該最小的空間距離對該用戶進行身份驗證，並生成驗證結果。 In a preferred embodiment, based on the embodiment of FIG. 1 above, the step S3 is replaced by: calculating a spatial distance between the current voiceprint discrimination vector and each of the pre-stored standard voiceprint discrimination vectors, Get the minimum spatial distance, authenticate the user based on the minimum spatial distance, and generate the verification result.

本實施例與第1圖的實施例不同的是，本實施例在存儲標準聲紋鑒別向量時並不攜帶用戶的標識資訊，在驗證用戶的身份時，計算當前聲紋鑒別向量與預存的各標準聲紋鑒別向量之間的空間距離，並取得最小的空間距離，如果該最小的空間距離小於預設的距離閾值(該距離閾值與上述實施例的距離閾值相同或者不同)，則驗證通過，否則驗證失敗。 This embodiment is different from the embodiment of FIG. 1 in that the present embodiment does not carry the user's identification information when storing the standard voiceprint authentication vector, and calculates the current voiceprint identification vector and the pre-stored each when verifying the identity of the user. The standard voiceprint discriminates the spatial distance between the vectors and obtains a minimum spatial distance. If the minimum spatial distance is less than a preset distance threshold (the distance threshold is the same as or different from the distance threshold of the above embodiment), the verification passes, Otherwise the verification fails.

請參閱第4圖，第4圖是本發明基於聲紋識別的身份驗證的系統10較佳實施例的運行環境示意圖。 Please refer to FIG. 4, which is a schematic diagram of an operating environment of a preferred embodiment of the voiceprint recognition based authentication system 10 of the present invention.

在本實施例中，基於聲紋識別的身份驗證的系統10安裝並運行於電子裝置1中。電子裝置1可以是桌上型計算機、筆記本、掌上電腦及服務器等計算設備。該電子裝置1可包括，但不僅限於，記憶體11、處理器12及顯示器13。第1圖僅示出了具有元件11-13的電子裝置1，但是應理解的是，並不要求實施所有示出的元件，可以替代的實施更多或者更少的元件。 In the present embodiment, the voiceprint recognition based authentication system 10 is installed and operates in the electronic device 1. The electronic device 1 may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a server. The electronic device 1 may include, but is not limited to, a memory 11, a processor 12, and a display 13. The first figure shows only the electronic device 1 with the elements 11-13, but it should be understood that not all illustrated elements are required to be implemented, and more or fewer elements may be implemented instead.

記憶體11在一些實施例中可以是電子裝置1的內部記憶單元，例如該電子裝置1的硬碟或記憶體。記憶體11在另一些實施例中也可以是電子裝置1的外部記憶設備，例如電子裝置1上配備的插接式硬碟，智慧媒體卡(Smart Media Card,SMC)，安全數位(Secure Digital,SD)卡，快閃記憶卡(Flash Card)等。進一步地，記憶體11還可以既包括電子裝置1的內部存儲單元也包括外部記憶設備。記憶體11用於儲存安裝於電子裝置1的應用軟體及各類資料，例如基於聲紋識別的身份驗證的系統10的程式代碼等。記憶體11還可以用於暫時地存儲已經輸出或者將要輸出的資料。 The memory 11 may be an internal memory unit of the electronic device 1, such as a hard disk or a memory of the electronic device 1, in some embodiments. The memory 11 may also be an external memory device of the electronic device 1 in other embodiments, such as a plug-in hard disk equipped on the electronic device 1, a smart media card (SMC), and a secure digital device (Secure Digital, SD) card, flash card, etc. Further, the memory 11 may also include both an internal storage unit of the electronic device 1 and an external memory device. The memory 11 is used to store application software installed on the electronic device 1 and various types of materials, such as program code of the system 10 for voiceprint recognition based authentication. The memory 11 can also be used to temporarily store data that has been output or is about to be output.

處理器12在一些實施例中可以是一中央處理器(Central Processing Unit,CPU)，微處理器或其他資料處理芯片，用於運行記憶體11中存儲的程式代碼或處理資料，例如執行基於聲紋識別的身份驗證的系統10等。 The processor 12, in some embodiments, may be a central processing unit (CPU), a microprocessor or other data processing chip for running program code or processing data stored in the memory 11, for example, performing sound based Pattern recognition for the authentication system 10 etc.

顯示器13在一些實施例中可以是LED顯示器、液晶顯示器、觸控式液晶顯示器以及OLED(Organic Light-Emitting Diode，有機發光二極體)觸摸器等。顯示器13用於顯示在電子裝置1中處理的資訊以及用於顯示可視化的用戶界面，例如聲紋識別界面等。電子裝置1的部件11-13通過系統匯流排相互通信。 The display 13 may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch sensor, or the like in some embodiments. The display 13 is for displaying information processed in the electronic device 1 and a user interface for displaying visualization, such as a voiceprint recognition interface or the like. The components 11-13 of the electronic device 1 communicate with each other through a system bus bar.

請參閱第5圖，是本發明基於聲紋識別的身份驗證的系統10較佳實施例的功能模組圖。在本實施例中，基於聲紋識別的身份驗證的系統10可以被分割成一個或多個模組，一個或者多個模組被存儲於記憶體11中，並由一個或多個處理器(本實施例為處理器12)所執行，以完成本發明。例如，在第5圖中，基於聲紋識別的身份驗證的系統10可以被分割成偵測模組21、識別模組22、複製模組23、安裝模組24及啟動模組25。本發明所稱的模組是指能夠完成特定功能的一系列電腦程式指令段，比程式更適合於描述基於聲紋識別的身份驗證的系統10在電子裝置1中的執行過程，其中：第一獲取模組101，用於在接收到進行身份驗證的用戶的語音資料後，獲取該語音資料的聲紋特徵，並基於該聲紋特徵構建對應的聲紋特徵向量；本實施例中，語音資料由語音採集設備採集得到(語音採集設備例如為麥克風)，語音採集設備將採集的語音資料發送給基於聲紋識別的身份驗證的系統。 Please refer to FIG. 5, which is a functional block diagram of a preferred embodiment of the voiceprint recognition based authentication system 10 of the present invention. In this embodiment, the voiceprint recognition based authentication system 10 can be divided into one or more modules, one or more modules being stored in the memory 11 and being processed by one or more processors ( This embodiment is executed by the processor 12) to complete the present invention. For example, in FIG. 5, the voiceprint recognition based authentication system 10 can be divided into a detection module 21, an identification module 22, a replication module 23, an installation module 24, and a startup module 25. The module referred to in the present invention refers to a series of computer program instruction segments capable of performing a specific function, and is more suitable than the program for describing the execution process of the voiceprint recognition-based authentication system 10 in the electronic device 1, wherein: The acquiring module 101 is configured to acquire a voiceprint feature of the voice data after receiving the voice data of the user who performs the authentication, and construct a corresponding voiceprint feature vector based on the voiceprint feature; in this embodiment, the voice data It is collected by the voice collection device (the voice collection device is, for example, a microphone), and the voice collection device sends the collected voice data to the system based on the voiceprint recognition identity verification.

聲紋特徵包括多種類型，例如寬帶聲紋、窄帶聲紋、振幅聲紋等，本實施例的聲紋特徵為較佳地為語音資料的梅爾頻率倒頻譜係數(Mel Frequency Cepstrum Coefficient，MFCC)。在構建對應的聲紋特徵向量時，將語音資料的聲紋特徵組成特徵資料矩陣，該特徵資料矩陣即為語音資料的聲紋特徵向量。 The voiceprint features include a plurality of types, such as a wide-band voiceprint, a narrow-band voiceprint, an amplitude voiceprint, etc., and the voiceprint feature of the present embodiment is preferably a Mel Frequency Cepstrum Coefficient (MFCC). . When constructing the corresponding voiceprint feature vector, the voiceprint feature of the voice data is composed into a feature data matrix, which is the voiceprint feature vector of the voice data.

構建模組102，用於將該聲紋特徵向量輸入預先訓練生成的背景通道模型，以構建出該語音資料對應的當前聲紋鑒別向量；其中，將聲紋特徵向量輸入預先訓練生成的背景通道模型，較佳地，該背景通道模型為高斯混合模型，利用該背景通道模型來計算聲紋特徵向量，得出對應的當前聲紋鑒別向量(即i-vector)。 The building module 102 is configured to input the voiceprint feature vector into the background channel model generated by the pre-training to construct a current voiceprint discrimination vector corresponding to the voice data; wherein the voiceprint feature vector is input into the background channel generated by the pre-training The model, preferably, the background channel model is a Gaussian mixture model, and the background channel model is used to calculate the voiceprint feature vector to obtain a corresponding current voiceprint discrimination vector (ie, i-vector).

具體地，該計算過程包括： Specifically, the calculation process includes:

1)、選擇高斯模型：首先，利用通用背景通道模型中的參數來計算每幀資料在不同高斯模型的似然對數值，通過對似然對數值矩陣每列並行排序，選取前N個高斯模型，最終獲得一每幀資料在混合高斯模型中數值的矩陣：Loglike=E(X)*D(X)^-1*X ^T-0.5*D(X)^-1*(X.²)^T，其中，Loglike為似然對數值矩陣，E(X)為通用背景通道模型訓練出來的均值矩陣，D(X)為共變異數矩陣，X為資料矩陣，X.²為矩陣每個值取平方。 1) Select the Gaussian model: First, use the parameters in the general background channel model to calculate the likelihood value of each frame of data in different Gaussian models. By sorting the columns of the likelihood logarithmic matrix in parallel, select the first N Gaussian models. Finally, obtain a matrix of values per frame of data in the mixed Gaussian model: Loglike = E ( X ) * D ( X ) ^-1 * X ^T -0.5* D ( X ) ^-1 *( X . ² ) ^T , where Loglike is the likelihood logarithmic matrix, E(X) is the mean matrix trained by the general background channel model, D(X) is the covariance matrix, X is the data matrix, and X. ² is the square of each value of the matrix.

2)、計算後驗機率：將每幀資料X進行X * XT計算，得到一個對稱矩陣，可簡化為下三角矩陣，並將元素按順序排列為1行，變成一個N幀乘以該下三角矩陣個數緯度的一個向量進行計算，將所有幀的該向量組合成新的資料矩陣，同時將通用背景模型中計算機率的共變異數矩陣，每個矩陣也簡化為下三角矩陣，變成與新資料矩陣類似的矩陣，在通過通用背景通道模型中的均值矩陣及共變異數矩陣算出每幀資料的在該選擇的高斯模型下的似然對數值，然後進行Softmax回歸，最後進行正歸化操作，得到每幀在混合高斯模型後驗機率分佈，將每幀的機率分佈向量組成機率矩陣。 2) Calculate the posterior probability: X * XT calculation is performed for each frame of data X to obtain a symmetric matrix, which can be simplified into a lower triangular matrix, and the elements are arranged in order of 1 row, and become an N frame multiplied by the lower triangle. A vector of latitudes of the matrix is calculated, and the vectors of all frames are combined into a new data matrix. At the same time, the computer-wide covariance matrix of the general background model is simplified, and each matrix is also reduced to a lower triangular matrix. A matrix similar to the data matrix, the likelihood logarithm of each frame of data in the selected Gaussian model is calculated by the mean matrix and the common variance matrix in the general background channel model, and then Softmax regression is performed, and finally the normalization operation is performed. Obtain a posterior probability distribution of each frame in the mixed Gaussian model, and construct a probability matrix of the probability distribution vector of each frame.

在計算得到一階，二階係數以後，並行計算一次項及二次項，然後通過一次項及二次項計算當前聲紋鑒別向量。 After the first-order and second-order coefficients are calculated, the primary term and the quadratic term are calculated in parallel, and then the current voiceprint discrimination vector is calculated by the primary term and the quadratic term.

較佳地，背景通道模型為高斯混合模型，基於聲紋識別的身份驗證的系統還包括：第二獲取模組，用於獲取預設數量的語音資料樣本，並獲取各語音資料樣本對應的聲紋特徵，並基於各語音資料樣本對應的聲紋特徵構建各語音資料樣本對應的聲紋特徵向量；劃分模組，用於將各語音資料樣本對應的聲紋特徵向量分為第一比例的訓練集及第二比例的驗證集，該第一比例及第二比例的和小於等於1；訓練模組，用於利用該訓練集中的聲紋特徵向量對高斯混合模型進行訓練，並在訓練完成後，利用該驗證集對訓練後的高斯混合模型的準確率進行驗證；處理模組，用於若該準確率大於預設閾值，則模型訓練結束，以訓練後的高斯混合模型作為該背景通道模型，或者，若該準確率小於等於預設閾值，則增加該語音資料樣本的數量，並基於增加後的語音資料樣本重新進行訓練。 Preferably, the background channel model is a Gaussian mixture model, and the voiceprint recognition based authentication system further includes: a second acquisition module, configured to acquire a preset number of voice data samples, and obtain sounds corresponding to the voice data samples. The pattern features, and the voiceprint feature vector corresponding to each voice data sample is constructed based on the voiceprint features corresponding to each phonetic data sample; the segmentation module is configured to divide the voiceprint feature vector corresponding to each phonetic data sample into the first ratio training. And the second ratio of the verification set, the sum of the first ratio and the second ratio is less than or equal to 1; the training module is configured to train the Gaussian mixture model by using the voiceprint feature vector in the training set, and after the training is completed The verification set is used to verify the accuracy of the trained Gaussian mixture model; the processing module is configured to end the model training if the accuracy is greater than a preset threshold, and the trained Gaussian mixture model is used as the background channel model. Or, if the accuracy is less than or equal to the preset threshold, increase the number of the voice data samples, and based on the increased voice data sample Retrain.

第一驗證模組103，用於計算該當前聲紋鑒別向量與預存的該用戶的標準聲紋鑒別向量之間的空間距離，基於該距離對該用戶進行身份驗證，並生成驗證結果。 The first verification module 103 is configured to calculate a spatial distance between the current voiceprint discrimination vector and the pre-stored standard voiceprint discrimination vector of the user, perform identity verification on the user based on the distance, and generate a verification result.

在一較佳的實施例中，在上述圖5的實施例的基礎上，上述第一獲取模組101具體用於對該語音資料進行預加權、分幀及加窗處理；對每一個加窗進行傅立葉轉換得到對應的頻譜；將該頻譜輸入梅爾濾波器以輸出得到梅爾頻譜；在梅爾頻譜上面進行倒頻譜分析以獲得梅爾頻率倒頻譜係數MFCC，基於該梅爾頻率倒頻譜係數MFCC組成對應的聲紋特徵向量。 In a preferred embodiment, based on the foregoing embodiment of FIG. 5, the first acquiring module 101 is specifically configured to perform pre-weighting, framing, and windowing on the voice data; Perform Fourier transform to obtain the corresponding spectrum; input the spectrum into the Meyer filter to output the Mel spectrum; perform cepstral analysis on the Mel spectrum to obtain the Mel frequency cepstral coefficient MFCC, based on the Mel frequency cepstral coefficient The MFCC constitutes a corresponding voiceprint feature vector.

其中，預加權處理實際是高通濾波處理，濾除低頻資料，使得語音資料中的高頻特性更加突顯，具體地，高通濾波的傳遞函數為：H(Z)=1-αZ ^-1，其中，Z為語音資料，α為常量係數，較佳地，α的取值為0.97；由於聲音信號只在較短時間內呈現平穩性，因此將一段聲音信號分成N段短時間的信號(即N幀)，且為了避免聲音的連續性特徵遺失，相鄰幀之間有一段重複區域，重複區域一般為每幀長的1/2；在對語音資料進行分幀後，每一幀信號都當成平穩信號來處理，但吉布斯效應的存在，語音資料的起始幀及結束幀是不連續的，在分幀之後，更加背離原始語音，因此，需要對語音資料進行加窗處理。 Wherein, the pre-weighting process is actually a high-pass filtering process, filtering out the low-frequency data, so that the high-frequency characteristics in the voice data are more prominent. Specifically, the transfer function of the high-pass filter is: H ( Z )=1- αZ ^-1 , wherein Z is the speech data, α is a constant coefficient, preferably, the value of α is 0.97; since the sound signal exhibits smoothness only in a short time, the sound signal is divided into N segments of short time signals (ie, N frames) ), and in order to avoid the loss of the continuity feature of the sound, there is a repeating area between adjacent frames, and the repeating area is generally 1/2 of the length of each frame; after framing the voice data, each frame signal is regarded as stable The signal is processed, but the existence of the Gibbs effect, the start frame and the end frame of the speech data are discontinuous, and after the framing, the original speech is further deviated. Therefore, the speech data needs to be windowed.

其中，倒頻譜分析例如為取對數、做反轉換，反轉換一般是通過DCT離散餘弦轉換來實現，取DCT後的第2個到第13個係數作為 MFCC係數。梅爾頻率倒頻譜係數MFCC即為這幀語音資料的聲紋特徵，將每幀的梅爾頻率倒頻譜係數MFCC組成特徵資料矩陣，該特徵資料矩陣即為語音資料的聲紋特徵向量。 Among them, the cepstrum analysis is, for example, taking the logarithm and performing the inverse transform. The inverse transform is generally realized by DCT discrete cosine transform, and the second to thirteenth coefficients after the DCT are taken as the MFCC coefficients. The Mel frequency cepstral coefficient MFCC is the voiceprint feature of the speech data of this frame. The Mel frequency cepstral coefficient MFCC of each frame is composed into a feature data matrix, which is the voiceprint feature vector of the speech data.

在一較佳的實施例中，在上述圖5的實施例的基礎上，該第一驗證模組103具體用於計算該當前聲紋鑒別向量與預存的該用戶的標準聲紋鑒別向量之間的餘弦距離：，為該標準聲紋鑒別向量，為當前聲紋鑒別向量；若該餘弦距離小於或者等於預設的距離閾值，則生成驗證通過的資訊；若該餘弦距離大於預設的距離閾值，則生成驗證不通過的資訊。 In a preferred embodiment, based on the foregoing embodiment of FIG. 5, the first verification module 103 is specifically configured to calculate between the current voiceprint discrimination vector and the pre-stored standard voiceprint discrimination vector of the user. Cosine distance: , Identify the vector for the standard voiceprint, A vector for identifying the current voiceprint; if the cosine distance is less than or equal to a preset distance threshold, generating information for verifying the pass; if the cosine distance is greater than the preset distance threshold, generating information that the verification fails.

在一較佳的實施例中，在上述第4圖的實施例的基礎上，上述的第一驗證模組替換為第二驗證模組，用於計算該當前聲紋鑒別向量與預存的各標準聲紋鑒別向量之間的空間距離，獲取最小的空間距離，基於該最小的空間距離對該用戶進行身份驗證，並生成驗證結果。 In a preferred embodiment, based on the embodiment of FIG. 4, the first verification module is replaced with a second verification module for calculating the current voiceprint identification vector and the pre-stored standards. The spatial distance between the voiceprint discrimination vectors obtains a minimum spatial distance, and the user is authenticated based on the minimum spatial distance, and a verification result is generated.

本實施例與第5圖的實施例不同的是，本實施例在存儲標準聲紋鑒別向量時並不攜帶用戶的標識資訊，在驗證用戶的身份時，計算當前聲紋鑒別向量與預存的各標準聲紋鑒別向量之間的空間距離，並取得最小的空間距離，如果該最小的空間距離小於預設的距離閾值(該距離閾值與上述實施例的距離閾值相同或者不同)，則驗證通過，否則驗證失敗。 This embodiment is different from the embodiment of FIG. 5 in that the present embodiment does not carry the user's identification information when storing the standard voiceprint authentication vector, and calculates the current voiceprint identification vector and the pre-stored each when verifying the identity of the user. The standard voiceprint discriminates the spatial distance between the vectors and obtains a minimum spatial distance. If the minimum spatial distance is less than a preset distance threshold (the distance threshold is the same as or different from the distance threshold of the above embodiment), the verification passes, Otherwise the verification fails.

以上該僅為本發明的較佳實施例，並不用以限制本發明，凡在本發明的精神及原則之內，所作的任何修改、等同替換、改進等，均應包含在本發明的保護範圍之內。 The above is only the preferred embodiment of the present invention, and is not intended to limit the present invention. Any modifications, equivalents, improvements, etc., which are within the spirit and scope of the present invention, should be included in the scope of the present invention. within.

Claims

A method for authentication based on voiceprint recognition includes the steps of: after receiving a voice material of a user performing an identity verification, acquiring a voiceprint feature of the voice data, and constructing a corresponding one based on the voiceprint feature a voiceprint feature vector; inputting the voiceprint feature vector into a background channel model generated by pre-training to construct a current voiceprint discrimination vector corresponding to the voice data; and calculating the current voiceprint discrimination vector and the pre-stored user A spatial distance between a standard voiceprint discrimination vector, the identity verification is performed on the user based on the spatial distance, and a verification result is generated.

The method for authenticating voiceprint recognition based on claim 1, wherein the voiceprint feature of the voice material is obtained after receiving the voice data of the user who performs the identity verification, and based on the The step of constructing the corresponding voiceprint feature vector of the voiceprint feature comprises the substeps of: performing a pre-weighting, a framing and a windowing process on the voice data; performing a Fourier transform on each windowing process to obtain a corresponding spectrum. Entering the spectrum into a mel filter to output a mel spectrum; and performing a cepstrum analysis on the mel spectrum to obtain a Mel frequency cepstral coefficient, based on the Mel frequency cepstral coefficient composition A striated feature vector.

The method for authenticating voiceprint recognition based on claim 1, wherein the calculating the spatial distance between the current voiceprint discrimination vector and the pre-stored standard voiceprint discrimination vector of the user is based on the space The step of authenticating the user and generating the verification result includes the substep of calculating a cosine distance between the current voiceprint discrimination vector and the pre-stored standard voiceprint discrimination vector of the user: , Identify the vector for the standard voiceprint, And identifying a vector for the current voiceprint; if the cosine distance is less than or equal to a preset distance threshold, generating a verification pass information; and if the cosine distance is greater than the preset distance threshold, generating a verification failure News.

The method of voiceprint recognition based authentication according to any one of claims 1 to 3, wherein the sound of the voice material is acquired after the voice material of the user who has performed the identity verification is received The step of constructing the corresponding voiceprint feature vector based on the voiceprint feature includes the steps of: acquiring a preset number of plurality of voice data samples, and acquiring a voiceprint feature corresponding to each voice data sample, and based on each The voiceprint feature corresponding to the voice data sample constructs a voiceprint feature vector corresponding to each voice data sample; and the voiceprint feature vector corresponding to each voice data sample is divided into a training set of a first ratio and a second ratio a verification set, the sum of the first ratio and the second ratio is less than or equal to 1; using the voiceprint feature vector in the training set to train a Gaussian mixture model, and after the training is completed, using the verification set pair after training The accuracy of the Gaussian mixture model is verified; and if the accuracy is greater than a predetermined threshold, the Gaussian mixture model is trained to end the Gaussian after training And combining the voiceprint feature vector into the background channel model generated by the pre-training to construct the background channel model in the step of the current voiceprint discrimination vector corresponding to the voice data, or if the accuracy is less than Or equal to the preset threshold, the number of the voice data samples is increased, and the training is resumed based on the added voice data samples.

The method for authenticating voiceprint recognition based on claim 1 or 2, wherein the calculating the spatial distance between the current voiceprint discrimination vector and the pre-stored standard voiceprint discrimination vector of the user is based on The spatial distance is used to perform the identity verification on the user, and the step of generating the verification result is replaced by: calculating a plurality of spatial distances between the current voiceprint discrimination vector and the pre-stored plurality of standard voiceprint discrimination vectors, and obtaining a minimum The spatial distance, an identity verification is performed on the user based on the minimum spatial distance, and a verification result is generated.

A voiceprint recognition-based authentication system, comprising: a first acquisition module, configured to acquire a voiceprint feature of a voice data after receiving a voice data of a user performing an identity verification, and based on The voiceprint feature constructs a corresponding voiceprint feature vector; a building module is configured to input the voiceprint feature vector into a background channel model generated by pre-training to construct a current voiceprint discrimination vector corresponding to the voice data. a first verification module, configured to calculate a spatial distance between the current voiceprint discrimination vector and a pre-stored standard voiceprint discrimination vector of the user, perform the identity verification on the user based on the spatial distance, and generate A verification result.

The system for authenticating voiceprint recognition based on claim 6, wherein the first obtaining module is specifically configured to perform a pre-weighting, a framing, and a windowing process on the voice data; Windowing processing performs a Fourier transform to obtain a corresponding spectrum; inputting the spectrum into a Meyer filter to output a Merr spectrum; performing a cepstrum analysis on the Mel spectrum to obtain a Mel frequency cepstral coefficient And corresponding to a voicing feature vector based on the Mel frequency cepstral coefficient.

The system for authenticating voiceprint recognition based on claim 6, wherein the first verification module is specifically configured to calculate between the current voiceprint discrimination vector and the pre-stored standard voiceprint discrimination vector of the user. One cosine distance: , Identify the vector for the standard voiceprint, And identifying a vector for the current voiceprint; if the cosine distance is less than or equal to a preset distance threshold, generating a verification pass information; if the cosine distance is greater than the preset distance threshold, generating a verification failure message .

The system for voiceprint recognition based authentication according to any one of claims 6 to 8, further comprising: a second acquisition module, configured to acquire a preset number of multiple voice data samples, and obtain each a voiceprint feature corresponding to the voice data sample, and constructing a voiceprint feature vector corresponding to each voice data sample based on the voiceprint feature corresponding to each voice data sample; a partitioning module for corresponding to each voice data sample The voiceprint feature vector is divided into a training set of a first ratio and a verification set of a second ratio, the sum of the first ratio and the second ratio being less than or equal to 1; a training module for utilizing the training set The voiceprint feature vector trains a Gaussian mixture model, and after the training is completed, uses the verification set to verify an accuracy rate of the trained Gaussian mixture model; a processing module is used for the accuracy rate If the accuracy is less than or equal to the preset threshold, the Gaussian mixture model is trained to end the training, and the Gaussian mixture model is used as the background channel model, or if the accuracy is less than or equal to the preset threshold, The number of the voice data samples is increased, and the training is resumed based on the added voice data samples.

The system according to claim 6 or 7, wherein the first verification module is replaced with a second verification module for calculating the current voiceprint discrimination vector and the pre-stored number. A plurality of spatial distances between the standard voiceprint discrimination vectors obtain a minimum spatial distance, perform an identity verification on the user based on the minimum spatial distance, and generate a verification result.