CN109326294B

CN109326294B - Text-related voiceprint key generation method

Info

Publication number: CN109326294B
Application number: CN201811139547.4A
Authority: CN
Inventors: 吴震东
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2018-09-28
Filing date: 2018-09-28
Publication date: 2022-09-20
Anticipated expiration: 2038-09-28
Also published as: CN109326294A

Abstract

The invention relates to a text-related voiceprint key generation method. The method comprises the steps of voiceprint key training and voiceprint key extraction; and training a voiceprint key to obtain a voiceprint key extraction matrix through a voiceprint sample collected in the early stage. Voiceprint key extraction a voiceprint sample to be extracted is preprocessed and then multiplied by a key extraction matrix obtained by voiceprint key training to obtain a voiceprint key. The invention utilizes the spectrogram related to the speaker text to more fully express the voice characteristics of the speaker and simultaneously keep more stable similarity of the front and rear sampling samples. On the basis, a voiceprint stable characteristic vector extraction matrix is trained from a plurality of spectrogram by using a machine learning method, and a subsequent sample is processed by using the matrix, so that a more stable voiceprint key can be extracted. The method has the characteristics of good stability, simplicity and convenience in use.

Description

Text-related voiceprint key generation method

Technical Field

The invention belongs to the technical field of network space security, and relates to a text-related voiceprint key generation method.

Background

The voiceprint recognition technology is a mature biological feature recognition technology, the accuracy of voiceprint recognition is improved to a certain extent along with the rapid development of artificial intelligence technology in recent years, the voiceprint recognition accuracy can reach more than 96% in a low-noise environment, and the voiceprint recognition technology is widely applied to identity authentication scenes.

With the deep application of the voiceprint technology, an attempt is made in the technical field to directly extract a stable digital sequence from a human voiceprint to be used as a biological key, i.e., various keys can be directly generated by the voiceprint, and the method is seamlessly integrated with the existing password and public and private key cryptography, so that the inconvenience in the voiceprint acquisition and storage process and the safety problem possibly caused can be avoided, and the means and the method of network authentication are further enriched.

The voiceprint biometric key technology has been studied to a certain extent, for example, the document encryption and decryption method based on voiceprint in the chinese patent invention ZL201110003202.8 provides a scheme for extracting a stable key sequence from voiceprint information. But the scheme only uses a chessboard method to stabilize the voiceprint characteristic value, the stabilization effect is limited and the key length is insufficient. The invention discloses a method for generating a human voiceprint biological key, which is disclosed in Chinese patent ZL201410074511.8, and provides a technical route for extracting a voiceprint Gaussian model and projecting model characteristic parameters to a high-dimensional space to obtain a stable voiceprint key. The stability of the voiceprint key obtained by the scheme is obviously improved compared with that of the former patent, but for a key authentication environment with high stability requirement, the stability of extracting the voiceprint biological key by the technical scheme still needs to be further improved.

Disclosure of Invention

The invention aims to provide a text-related voiceprint key generation method.

The method comprises the steps of voiceprint key training and voiceprint key extraction; and training a voiceprint key to obtain a voiceprint key extraction matrix through a voiceprint sample collected in the early stage. Voiceprint key extraction a voiceprint sample to be extracted is preprocessed and then multiplied by a key extraction matrix obtained by voiceprint key training to obtain a voiceprint key. The method comprises the following specific steps:

step one, voiceprint key training, which comprises the following specific steps:

firstly, a user logs own voice for the same text information, generally 1-3 continuous words, and repeats the voice more than 20 times, and the times are adjusted by the user according to training conditions.

Recording more than 10 voices of different users reading the same text information, and repeating the voices for more than 20 times respectively; more than 10 different users are recorded to read different text messages and voices with similar duration, and the voice reading is repeated more than 20 times respectively.

Thirdly, preprocessing the first and second step recorded voices, and extracting a voiceprint spectrogram specifically comprises the following steps:

1) pre-enhancement (Pre-Emphasis):

s1(N) represents the speech time domain signal, where N is 0, 1,2 …, N-1, and the pre-emphasis formula is: s (n) ═ S1(n) -a × S1(n-1), 0.9< a < 1.0. a is the coefficient to be emphasized for adjusting the amplitude to be enhanced.

2) Framing, i.e. Framing the speech signal.

3) Hamming Window (Hamming Window) processing:

the speech time domain signal after the sound framing is S (N), wherein N is 0, 1,2 … and N-1, and represents that the speech time domain signal is divided into N sections of speech signals; then, the time domain signal of the voice multiplied by the hamming window is S' (n), see table:

S’(n)＝S(n)*W(n) ⑴；

obtaining:

and a is 0.46, the value range of a is 0.3-0.7, and the specific numerical value is determined according to experimental and empirical data. w (n) is a Hamming window function, has smoother low-pass characteristic and can better reflect the frequency characteristic of the short-time voice signal.

4) Fast Fourier Transform (FFT):

performing radix-2 FFT on the voice time domain signal S' (n) multiplied by the Hamming window to obtain a linear frequency spectrum X (n, k), and performing radix-2 FFT to obtain a general algorithm in the field; x (n, k) is a spectrum energy density function of the nth speech frame, k corresponds to a spectrum segment, and each speech frame corresponds to a time slice on a time axis.

5) Generating a text-related voiceprint spectrogram:

using time n as time axis coordinate and k as frequency spectrum axis coordinate, calculating | X (n, k) & lt ² The value of (A) is expressed as a gray level and is displayed on the corresponding coordinate point position, namely, the vocal print spectrogram is formed. By transforming 10log ₁₀ (|X(n,k)| ² ) A dB representation of the spectrogram was obtained.

And fourthly, preprocessing the voiceprint spectrogram in a filtering, normalization and other manners, wherein specific filtering manners include general filtering manners in signal processing fields of gauss, wavelet, binarization and the like, which manner is specifically adopted or a combination of several manners is adopted, and the user selects the filtering manner according to actual test conditions. The normalization processing means that the sizes of the speech spectrograms are unified to a fixed length and width, the values of each pixel point of the speech spectrograms are unified to a range of 0-255, and the specific method can be a general method in the field, for example, the image size adjustment can be realized by using an nonresizing function in a matlab function library.

And fifthly, performing machine learning on the voiceprint spectrogram to obtain a voiceprint stable characteristic learning matrix, namely a voiceprint key extraction matrix.

And the voiceprint spectrogram obtained in the fourth step is divided into two categories, wherein one category is the voiceprint spectrogram of the relevant text of the user, and the other category is the comparison voiceprint spectrogram formed by mixing the relevant text of the non-user and the non-relevant text, and is called a positive and negative sample set.

By using M ═ M ₁ ,M ₂ ]Positive and negative sample sets, M, representing participation in training _i ＝[x _i1 ,x _i2 ,...,x _iL ]I belongs to {1,2} and represents the ith sample set, wherein, i is a positive sample when 1 is equal to 1, and is a negative sample when 2 is equal to 2; x is the number of _ir ∈R ^d ,1≤i≤2,1≤r≤L，x _ir Forming a two-dimensional matrix by the values of all pixel points of a voiceprint spectrogram, sequentially splicing each row of the two-dimensional matrix to obtain a one-dimensional row vector, and transposing to obtain a one-dimensional column vector x _ir ，x _ir Length d, R ^d And L represents that L voiceprint spectrogram modes exist in the same sample set, namely L column vectors.

Now, according to the characteristics of the two types of samples, the voiceprint key extraction matrix W is trained ₁ ，W ₁ ∈R ^d×dz The method has the following advantages:

wherein

For the positive sample mean of the training samples,

is the negative sample mean of the training samples. J is a cost function reflecting the voiceprint key extraction matrix W of the training sample ₁ And calculating the distance difference between the projected image and the positive and negative sample set mean value by using the Euclidean distance.

Order:

solving matrix (H) ₁ -H ₂ ) Obtaining a voiceprint key extraction matrix W by the characteristic value and the characteristic vector of the voiceprint key ₁ Namely: (H) ₁ -H ₂ ) w ═ λ w; w is a matrix (H) ₁ -H ₂ ) λ is the eigenvalue.

Due to { w ₁ ,w ₂ ,...,w _dz Is an eigenvector corresponding to the eigenvalue { lambda } ₁ ,λ ₂ ,...,λ _dz In which λ is ₁ ≥λ ₂ ≥...≥λ _dz ≧ 0, the eigenvectors with eigenvalues less than 0 are not included in the matrix W ₁ The structure of (1).

So far, training out a voiceprint key extraction matrix W ₁ 。

Step two, voiceprint key extraction, which comprises the following specific steps:

step 1, a user logs own text related voice for about 3 seconds.

And 2, extracting a voiceprint spectrogram, and particularly referring to the third step.

And 3, preprocessing the voiceprint spectrogram by filtering, normalizing and the like, converting the voiceprint spectrogram into a matrix form, and splicing the matrix form according to the rows in sequence to obtain a voiceprint vector x _t 。

Step 4, learning a matrix W by using the voiceprint stable characteristics trained in the step one ₁ And the transposed voiceprint vector x obtained by the step 3 is multiplied by the left side _t I.e. W ₁ ^T ·x _t To obtain d _z Dimensional voiceprint feature vector x _tz ，x _tz Is the stabilized voiceprint feature vector.

Step 5, for x _tz Each dimensional component of (a) is subjected to a chessboard method operation, and the vocal print characteristic vector is further stabilized as

The chessboard method comprises the following steps:

for x _tz Each dimension component in (1) is denoted as x _tzi ；

The quantization formula is shown as the following:

wherein D is the size of the grid of the chessboard method, a positive number is taken, a specific value can be selected by a user according to experience, and the value of Λ (x) is generally between 0 and 63, and x is _tzi Is x _tz Λ (x) is an integer value.

Λ (x) is x _tzi The quantized value is the closest x in the checkerboard _tzi The coordinate values of the grid of points and the origin of coordinates.

Step 6, taking the vector of the calculation result of the step five

The first 32 or 64 components are spliced front and back, and each component takes a value of 0-64, so that 4-bit key calculation can be formed, and a 128-bit or 256-bit voiceprint key can be formed; and finishing the extraction of the voiceprint key.

The invention utilizes the spectrogram related to the speaker text to more fully express the voice characteristics of the speaker and simultaneously keep more stable similarity of the front and the rear sampling samples. On the basis, a voiceprint stable characteristic vector extraction matrix is trained from a plurality of spectrogram by using a machine learning method, and a subsequent sample is processed by using the matrix, so that a more stable voiceprint key can be extracted. The method has the characteristics of good stability, simplicity and convenience in use.

Drawings

FIG. 1 is a flowchart of voiceprint key training in accordance with the present invention;

FIG. 2 is a flowchart of voiceprint spectrogram generation in accordance with the present invention;

FIG. 3 is a spectrogram of the voiceprint of the present invention;

FIG. 4 is a flowchart of voiceprint key extraction in accordance with the present invention.

FIG. 5 is a schematic diagram of machine learning of voiceprint features in accordance with the present invention.

Detailed Description

A text-related voiceprint key generation method comprises voiceprint key training and voiceprint key extraction; and training a voiceprint key to obtain a voiceprint key extraction matrix through a voiceprint sample collected in the early stage. Voiceprint key extraction the voiceprint sample to be extracted is preprocessed and then multiplied by a key extraction matrix obtained by voiceprint key training to obtain the voiceprint key. The method comprises the following specific steps:

step one, voiceprint key training, as shown in fig. 1, the specific steps are:

in the first step, the user logs his own voice for the same text message, generally 1-3 consecutive words, and repeats the same voice for more than 20 times (the number of times can be adjusted by the user according to the training situation).

Step three, preprocessing the first and second step recorded voices, as shown in fig. 2 and 3, and extracting a voiceprint spectrogram specifically comprises the following steps:

1) pre-enhancement (Pre-Emphasis):

s1(N) represents the speech time domain signal, where N is 0, 1,2 …, N-1, the pre-emphasis formula is: s (n) ═ S1(n) -a × S1(n-1), 0.9< a < 1.0. a is the coefficient to be enhanced for adjusting the amplitude to be enhanced.

2) Framing, i.e. Framing the speech signal.

3) Hamming Window (Hamming Window) processing:

S’(n)＝S(n)*W(n) ⑴；

obtaining:

5) Fast Fourier Transform (FFT):

5) Generating a text-related voiceprint spectrogram:

And fourthly, preprocessing the voiceprint spectrogram through filtering, normalization and the like, wherein specific filtering modes include general filtering modes in signal processing fields of gauss, wavelet, binarization and the like, which mode is specifically adopted or a combination of several modes is adopted, and the mode is selected by a user according to actual test conditions. The normalization processing means that the sizes of the speech spectrograms are unified to a fixed length and width, the value of each pixel point of the speech spectrograms is unified to a range of 0-255, and the specific method can be all general methods in the field, for example, the image size adjustment can be realized by an imresize function in a matlab function library.

With M ═ M ₁ ,M ₂ ]Positive and negative sample sets, M, representing participation in training _i ＝[x _i1 ,x _i2 ,...,x _iL ]I belongs to {1,2} to represent the ith sample set, i ═ 1 is a positive sample, i ═ 2 is a negative sample; x is a radical of a fluorine atom _ir ∈R ^d ,1≤i≤2,1≤r≤L，x _ir Forming a two-dimensional matrix by the values of all pixel points of a voiceprint spectrogram as a one-dimensional column vector, and then converting the two-dimensional matrix into a two-dimensional matrixSequentially splicing each row of the matrix to obtain a one-dimensional row vector, and transposing to obtain a one-dimensional column vector x _ir ，x _ir Length d, R ^d And L represents that L voiceprint spectrogram modes exist in the same sample set, namely L column vectors.

wherein

For the positive sample mean of the training samples,

is the negative sample mean of the training samples. J is a cost function reflecting the voiceprint key extraction matrix W of the training sample ₁ And calculating the distance difference between the projected image and the mean value of the positive and negative sample sets by using Euclidean distance.

Order:

Due to { w ₁ ,w ₂ ,...,w _dz Is an eigenvector corresponding to the eigenvalue { lambda } ₁ ,λ ₂ ,...,λ _dz In which λ is ₁ ≥λ ₂ ≥...≥λ _dz Not less than 0, and eigenvectors with eigenvalues less than 0 are not included in the matrix W ₁ The structure of (1).

So far, training out a voiceprint key extraction matrix W ₁ 。

Step two, voiceprint key extraction, as shown in fig. 4, the specific steps are as follows:

step 1, a user logs own text related voice for about 3 seconds.

Step 4, learning matrix W by using the vocal print stable characteristics trained in the step one ₁ And the transposed vocal print vector x obtained by the step 3 is multiplied by the left side _t I.e. W ₁ ^T ·x _t D is obtained _z Dimensional voiceprint feature vector x _tz ，x _tz Is the stabilized voiceprint feature vector.

Step 5, for x _tz Each dimension component of (a) is subjected to a chessboard method operation to further stabilize the vocal print feature vector as

The chessboard method comprises the following steps:

for x _tz Each dimension component in (1) is denoted as x _tzi ；

The quantization formula is shown as the following:

Λ (x) is x _tzi The quantized value is the closest x in the checkerboard _tzi Point and coordinate value of the grid of origin of coordinates.

Step 6, taking the vector of the calculation result of the step five

The invention extracts the vocal print spectrogram from the text-related voice by utilizing the characteristic that the vocal print frequency spectrum of the text-related voice of the same speaker has higher similarity, a plurality of vocal print spectrogram obtained by sampling the same text of the same speaker for a plurality of times have higher similarity, and meanwhile, the vocal print spectrogram extracted from the same text of different speakers has more obvious difference. After extracting the voiceprint spectrogram, extracting common characteristic information from a plurality of voiceprint spectrograms by a machine learning method as shown in fig. 5, and obtaining a text-related voiceprint key after segmented quantization. The voiceprint key does not need a server to reserve a biological characteristic template, has higher safety, can be fused with encryption and decryption algorithms of a general network such as AES (advanced encryption standard), RSA (rivest-Shamir-Adleman) and the like, and is convenient for a user to use. The method can obtain a more stable voiceprint key, the voiceprint key extraction accuracy is more than 95%, and the key length can reach 256 bits.

Claims

1. A text-related voiceprint key generation method is characterized by comprising the following steps: the method comprises the steps of voiceprint key training and voiceprint key extraction; training a voiceprint key to obtain a voiceprint key extraction matrix through a voiceprint sample collected in the early stage; voiceprint key extraction a voiceprint sample to be extracted is preprocessed and then multiplied by a key extraction matrix obtained by voiceprint key training to obtain a voiceprint key; the method comprises the following specific steps:

firstly, recording own voice for the same text information, generally 1-3 continuous words, by a user, repeating the recording for more than 20 times, wherein the times are adjusted by the user according to training conditions;

recording more than 10 voices of different users reading the same text information, and repeating the voices for more than 20 times respectively; more than 10 different users are recorded to read different text information and voices with similar duration, and the voice reading is repeated for more than 20 times;

1) pre-reinforcing:

s1(N) represents the speech time domain signal, where N is 0, 1,2 …, N-1, and the pre-emphasis formula is: s (n) -S1 (n) -a S1(n-1), 0.9< a < 1.0; a is the coefficient to be enhanced for adjusting the amplitude to be enhanced;

2) framing, i.e., framing the speech signal;

3) hamming window processing:

S’(n)＝S(n)*W(n) ⑴；

obtaining:

a is 0.46, the value range of a is 0.3-0.7, and specific numerical values are determined according to experimental and empirical data; w (n) is a Hamming window function, has smoother low-pass characteristic and can better reflect the frequency characteristic of the short-time voice signal;

4) fast fourier transform FFT:

performing radix-2 FFT on the voice time domain signal S' (n) multiplied by the Hamming window to obtain a linear frequency spectrum X (n, k), and performing radix-2 FFT to obtain a general algorithm in the field; x (n, k) is a spectrum energy density function of the nth section of speech frame, k corresponds to a spectrum section, and each section of speech frame corresponds to a time slice on a time axis;

5) generating a text-related voiceprint spectrogram:

using time n as time axis coordinate and k as frequency spectrum axis coordinate, calculating | X (n, k) & lt ² The value of (A) is expressed as a gray level and is displayed on the corresponding coordinate point position, namely, a voiceprint spectrogram is formed; by transforming 10log ₁₀ (|X(n,k)| ² ) Obtaining dB representation of the spectrogram;

fourthly, filtering and normalizing the voiceprint spectrogram, wherein the specific filtering modes comprise Gaussian filtering, wavelet filtering and binary filtering, and a user can optionally select one or more modes for filtering according to the actual test condition;

fifthly, machine learning is carried out on the voiceprint spectrogram to obtain a voiceprint stable characteristic learning matrix, namely a voiceprint key extraction matrix;

dividing the voiceprint spectrogram obtained in the fourth step into two categories, namely a voiceprint spectrogram of a related text of the user, and a comparison voiceprint spectrogram formed by mixing a related text of a non-user and a non-related text, wherein the comparison voiceprint spectrogram is called a positive and negative sample set;

with M ═ M ₁ ,M ₂ ]Representing positive and negative sample sets participating in the training, M _i ＝[x _i1 ,x _i2 ,...,x _iL ]I belongs to {1,2} to represent the ith sample set, i ═ 1 is a positive sample, i ═ 2 is a negative sample; x is the number of _ir ∈R ^d ,1≤i≤2,1≤r≤L，x _ir Forming a two-dimensional matrix by the values of all pixel points of a voiceprint spectrogram, sequentially splicing each row of the two-dimensional matrix to obtain a one-dimensional row vector, and transposing to obtain a one-dimensional column vector x _ir ，x _ir Length d, R ^d Representing a d-dimensional real number domain, wherein L represents that L voiceprint spectrogram, namely L column vectors, exist in the same type of sample set;

wherein

For the positive sample mean of the training samples,

is the negative sample mean of the training sample; j is a cost function and reflects the voiceprint density of the training sampleKey extraction matrix W ₁ Calculating the distance difference between the projected image and the positive and negative sample set mean value by using Euclidean distance;

order:

solving matrix (H) ₁ -H ₂ ) Obtaining a voiceprint key extraction matrix W according to the characteristic value and the characteristic vector ₁ Namely: (H) ₁ -H ₂ ) w ═ λ w; w is a matrix (H) ₁ -H ₂ ) λ is a eigenvalue;

due to { w ₁ ,w ₂ ,...,w _dz Is an eigenvector corresponding to the eigenvalue { lambda } ₁ ,λ ₂ ,...,λ _dz In which λ is ₁ ≥λ ₂ ≥...≥λ _dz ≧ 0, the eigenvectors with eigenvalues less than 0 are not included in the matrix W ₁ The structure of (1);

training a voiceprint key extraction matrix W to ₁ ；

step 1, recording relevant voices of texts of a user for about 3 seconds;

step 2, extracting a voiceprint spectrogram, and particularly referring to the step one and the third step;

and 3, filtering and normalizing the voiceprint spectrogram, converting the voiceprint spectrogram into a matrix form, and splicing the matrix form according to the rows in sequence to obtain a voiceprint vector x _t ；

Step 4, extracting a matrix W by using the voiceprint key trained in the step one ₁ And the transposed voiceprint vector x obtained by the step 3 is multiplied by the left side _t I.e. W ₁ ^T ·x _t To obtain d _z Dimensional voiceprint feature vector x _tz ，x _tz The stable vocal print feature vector is obtained;

step 5, for x _tz Each dimension component of the voice print is subjected to chessboard operation to further stabilize the voice print characteristic vector as x _tz ；

The chessboard method comprises the following steps:

for x _tz Each dimension component in (1) is denoted as x _tzi ；

The quantization formula is shown as the following:

wherein D is the size of the grid of the chessboard method, a positive number is taken, a specific value can be selected by a user according to experience, and the value of Λ (x) is generally between 0 and 63, and x is _tzi Is x _tz Λ (x) is an integer value;

Λ (x) is x _tzi The quantized value is the closest x in the checkerboard _tzi Coordinate values of the grid of points and the origin of coordinates;

step 6, taking the vector of the calculation result of the step five

2. A method of generating a text dependent voiceprint key as claimed in claim 1 wherein: and fourthly, the normalization processing means that the sizes of the speech spectrograms are unified to a fixed length and width, the value of each pixel point of the speech spectrograms is unified to be within the range of 0-255, and the normalization processing can be realized by adopting an nonresizing function in a matlab function library.