CN109326294B - Text-related voiceprint key generation method - Google Patents
Text-related voiceprint key generation method Download PDFInfo
- Publication number
- CN109326294B CN109326294B CN201811139547.4A CN201811139547A CN109326294B CN 109326294 B CN109326294 B CN 109326294B CN 201811139547 A CN201811139547 A CN 201811139547A CN 109326294 B CN109326294 B CN 109326294B
- Authority
- CN
- China
- Prior art keywords
- voiceprint
- key
- spectrogram
- matrix
- training
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 36
- 239000011159 matrix material Substances 0.000 claims abstract description 55
- 238000000605 extraction Methods 0.000 claims abstract description 41
- 238000012549 training Methods 0.000 claims abstract description 34
- 239000013598 vector Substances 0.000 claims abstract description 33
- 238000010801 machine learning Methods 0.000 claims abstract description 7
- 238000001914 filtration Methods 0.000 claims description 16
- 230000006870 function Effects 0.000 claims description 15
- 238000001228 spectrum Methods 0.000 claims description 13
- 230000001755 vocal effect Effects 0.000 claims description 11
- 238000009432 framing Methods 0.000 claims description 9
- 238000012545 processing Methods 0.000 claims description 9
- 238000007781 pre-processing Methods 0.000 claims description 7
- 238000004364 calculation method Methods 0.000 claims description 6
- 238000010606 normalization Methods 0.000 claims description 6
- 238000013139 quantization Methods 0.000 claims description 4
- 238000002156 mixing Methods 0.000 claims description 3
- 238000012360 testing method Methods 0.000 claims description 3
- 230000001131 transforming effect Effects 0.000 claims description 3
- 230000001419 dependent effect Effects 0.000 claims 1
- 230000003014 reinforcing effect Effects 0.000 claims 1
- 238000005070 sampling Methods 0.000 abstract description 3
- 238000005516 engineering process Methods 0.000 description 6
- 239000000284 extract Substances 0.000 description 2
- 238000007429 general method Methods 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 229910052731 fluorine Inorganic materials 0.000 description 1
- 125000001153 fluoro group Chemical group F* 0.000 description 1
- 230000006641 stabilisation Effects 0.000 description 1
- 238000011105 stabilization Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/04—Training, enrolment or model building
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L9/00—Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
- H04L9/08—Key distribution or management, e.g. generation, sharing or updating, of cryptographic keys or passwords
- H04L9/0861—Generation of secret information including derivation or calculation of cryptographic keys or passwords
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L9/00—Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
- H04L9/08—Key distribution or management, e.g. generation, sharing or updating, of cryptographic keys or passwords
- H04L9/0861—Generation of secret information including derivation or calculation of cryptographic keys or passwords
- H04L9/0866—Generation of secret information including derivation or calculation of cryptographic keys or passwords involving user or device identifiers, e.g. serial number, physical or biometrical information, DNA, hand-signature or measurable physical characteristics
Landscapes
- Engineering & Computer Science (AREA)
- Computer Security & Cryptography (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Complex Calculations (AREA)
- Electrically Operated Instructional Devices (AREA)
Abstract
The invention relates to a text-related voiceprint key generation method. The method comprises the steps of voiceprint key training and voiceprint key extraction; and training a voiceprint key to obtain a voiceprint key extraction matrix through a voiceprint sample collected in the early stage. Voiceprint key extraction a voiceprint sample to be extracted is preprocessed and then multiplied by a key extraction matrix obtained by voiceprint key training to obtain a voiceprint key. The invention utilizes the spectrogram related to the speaker text to more fully express the voice characteristics of the speaker and simultaneously keep more stable similarity of the front and rear sampling samples. On the basis, a voiceprint stable characteristic vector extraction matrix is trained from a plurality of spectrogram by using a machine learning method, and a subsequent sample is processed by using the matrix, so that a more stable voiceprint key can be extracted. The method has the characteristics of good stability, simplicity and convenience in use.
Description
Technical Field
The invention belongs to the technical field of network space security, and relates to a text-related voiceprint key generation method.
Background
The voiceprint recognition technology is a mature biological feature recognition technology, the accuracy of voiceprint recognition is improved to a certain extent along with the rapid development of artificial intelligence technology in recent years, the voiceprint recognition accuracy can reach more than 96% in a low-noise environment, and the voiceprint recognition technology is widely applied to identity authentication scenes.
With the deep application of the voiceprint technology, an attempt is made in the technical field to directly extract a stable digital sequence from a human voiceprint to be used as a biological key, i.e., various keys can be directly generated by the voiceprint, and the method is seamlessly integrated with the existing password and public and private key cryptography, so that the inconvenience in the voiceprint acquisition and storage process and the safety problem possibly caused can be avoided, and the means and the method of network authentication are further enriched.
The voiceprint biometric key technology has been studied to a certain extent, for example, the document encryption and decryption method based on voiceprint in the chinese patent invention ZL201110003202.8 provides a scheme for extracting a stable key sequence from voiceprint information. But the scheme only uses a chessboard method to stabilize the voiceprint characteristic value, the stabilization effect is limited and the key length is insufficient. The invention discloses a method for generating a human voiceprint biological key, which is disclosed in Chinese patent ZL201410074511.8, and provides a technical route for extracting a voiceprint Gaussian model and projecting model characteristic parameters to a high-dimensional space to obtain a stable voiceprint key. The stability of the voiceprint key obtained by the scheme is obviously improved compared with that of the former patent, but for a key authentication environment with high stability requirement, the stability of extracting the voiceprint biological key by the technical scheme still needs to be further improved.
Disclosure of Invention
The invention aims to provide a text-related voiceprint key generation method.
The method comprises the steps of voiceprint key training and voiceprint key extraction; and training a voiceprint key to obtain a voiceprint key extraction matrix through a voiceprint sample collected in the early stage. Voiceprint key extraction a voiceprint sample to be extracted is preprocessed and then multiplied by a key extraction matrix obtained by voiceprint key training to obtain a voiceprint key. The method comprises the following specific steps:
step one, voiceprint key training, which comprises the following specific steps:
firstly, a user logs own voice for the same text information, generally 1-3 continuous words, and repeats the voice more than 20 times, and the times are adjusted by the user according to training conditions.
Recording more than 10 voices of different users reading the same text information, and repeating the voices for more than 20 times respectively; more than 10 different users are recorded to read different text messages and voices with similar duration, and the voice reading is repeated more than 20 times respectively.
Thirdly, preprocessing the first and second step recorded voices, and extracting a voiceprint spectrogram specifically comprises the following steps:
1) pre-enhancement (Pre-Emphasis):
s1(N) represents the speech time domain signal, where N is 0, 1,2 …, N-1, and the pre-emphasis formula is: s (n) ═ S1(n) -a × S1(n-1), 0.9< a < 1.0. a is the coefficient to be emphasized for adjusting the amplitude to be enhanced.
2) Framing, i.e. Framing the speech signal.
3) Hamming Window (Hamming Window) processing:
the speech time domain signal after the sound framing is S (N), wherein N is 0, 1,2 … and N-1, and represents that the speech time domain signal is divided into N sections of speech signals; then, the time domain signal of the voice multiplied by the hamming window is S' (n), see table:
S’(n)=S(n)*W(n) ⑴;
obtaining:and a is 0.46, the value range of a is 0.3-0.7, and the specific numerical value is determined according to experimental and empirical data. w (n) is a Hamming window function, has smoother low-pass characteristic and can better reflect the frequency characteristic of the short-time voice signal.
4) Fast Fourier Transform (FFT):
performing radix-2 FFT on the voice time domain signal S' (n) multiplied by the Hamming window to obtain a linear frequency spectrum X (n, k), and performing radix-2 FFT to obtain a general algorithm in the field; x (n, k) is a spectrum energy density function of the nth speech frame, k corresponds to a spectrum segment, and each speech frame corresponds to a time slice on a time axis.
5) Generating a text-related voiceprint spectrogram:
using time n as time axis coordinate and k as frequency spectrum axis coordinate, calculating | X (n, k) & lt 2 The value of (A) is expressed as a gray level and is displayed on the corresponding coordinate point position, namely, the vocal print spectrogram is formed. By transforming 10log 10 (|X(n,k)| 2 ) A dB representation of the spectrogram was obtained.
And fourthly, preprocessing the voiceprint spectrogram in a filtering, normalization and other manners, wherein specific filtering manners include general filtering manners in signal processing fields of gauss, wavelet, binarization and the like, which manner is specifically adopted or a combination of several manners is adopted, and the user selects the filtering manner according to actual test conditions. The normalization processing means that the sizes of the speech spectrograms are unified to a fixed length and width, the values of each pixel point of the speech spectrograms are unified to a range of 0-255, and the specific method can be a general method in the field, for example, the image size adjustment can be realized by using an nonresizing function in a matlab function library.
And fifthly, performing machine learning on the voiceprint spectrogram to obtain a voiceprint stable characteristic learning matrix, namely a voiceprint key extraction matrix.
And the voiceprint spectrogram obtained in the fourth step is divided into two categories, wherein one category is the voiceprint spectrogram of the relevant text of the user, and the other category is the comparison voiceprint spectrogram formed by mixing the relevant text of the non-user and the non-relevant text, and is called a positive and negative sample set.
By using M ═ M 1 ,M 2 ]Positive and negative sample sets, M, representing participation in training i =[x i1 ,x i2 ,...,x iL ]I belongs to {1,2} and represents the ith sample set, wherein, i is a positive sample when 1 is equal to 1, and is a negative sample when 2 is equal to 2; x is the number of ir ∈R d ,1≤i≤2,1≤r≤L,x ir Forming a two-dimensional matrix by the values of all pixel points of a voiceprint spectrogram, sequentially splicing each row of the two-dimensional matrix to obtain a one-dimensional row vector, and transposing to obtain a one-dimensional column vector x ir ,x ir Length d, R d And L represents that L voiceprint spectrogram modes exist in the same sample set, namely L column vectors.
Now, according to the characteristics of the two types of samples, the voiceprint key extraction matrix W is trained 1 ,W 1 ∈R d×dz The method has the following advantages:
whereinFor the positive sample mean of the training samples,is the negative sample mean of the training samples. J is a cost function reflecting the voiceprint key extraction matrix W of the training sample 1 And calculating the distance difference between the projected image and the positive and negative sample set mean value by using the Euclidean distance.
Order:
solving matrix (H) 1 -H 2 ) Obtaining a voiceprint key extraction matrix W by the characteristic value and the characteristic vector of the voiceprint key 1 Namely: (H) 1 -H 2 ) w ═ λ w; w is a matrix (H) 1 -H 2 ) λ is the eigenvalue.
Due to { w 1 ,w 2 ,...,w dz Is an eigenvector corresponding to the eigenvalue { lambda } 1 ,λ 2 ,...,λ dz In which λ is 1 ≥λ 2 ≥...≥λ dz ≧ 0, the eigenvectors with eigenvalues less than 0 are not included in the matrix W 1 The structure of (1).
So far, training out a voiceprint key extraction matrix W 1 。
Step two, voiceprint key extraction, which comprises the following specific steps:
step 1, a user logs own text related voice for about 3 seconds.
And 2, extracting a voiceprint spectrogram, and particularly referring to the third step.
And 3, preprocessing the voiceprint spectrogram by filtering, normalizing and the like, converting the voiceprint spectrogram into a matrix form, and splicing the matrix form according to the rows in sequence to obtain a voiceprint vector x t 。
Step 4, learning a matrix W by using the voiceprint stable characteristics trained in the step one 1 And the transposed voiceprint vector x obtained by the step 3 is multiplied by the left side t I.e. W 1 T ·x t To obtain d z Dimensional voiceprint feature vector x tz ,x tz Is the stabilized voiceprint feature vector.
Step 5, for x tz Each dimensional component of (a) is subjected to a chessboard method operation, and the vocal print characteristic vector is further stabilized as
The chessboard method comprises the following steps:
for x tz Each dimension component in (1) is denoted as x tzi ;
The quantization formula is shown as the following:
wherein D is the size of the grid of the chessboard method, a positive number is taken, a specific value can be selected by a user according to experience, and the value of Λ (x) is generally between 0 and 63, and x is tzi Is x tz Λ (x) is an integer value.
Λ (x) is x tzi The quantized value is the closest x in the checkerboard tzi The coordinate values of the grid of points and the origin of coordinates.
Step 6, taking the vector of the calculation result of the step fiveThe first 32 or 64 components are spliced front and back, and each component takes a value of 0-64, so that 4-bit key calculation can be formed, and a 128-bit or 256-bit voiceprint key can be formed; and finishing the extraction of the voiceprint key.
The invention utilizes the spectrogram related to the speaker text to more fully express the voice characteristics of the speaker and simultaneously keep more stable similarity of the front and the rear sampling samples. On the basis, a voiceprint stable characteristic vector extraction matrix is trained from a plurality of spectrogram by using a machine learning method, and a subsequent sample is processed by using the matrix, so that a more stable voiceprint key can be extracted. The method has the characteristics of good stability, simplicity and convenience in use.
Drawings
FIG. 1 is a flowchart of voiceprint key training in accordance with the present invention;
FIG. 2 is a flowchart of voiceprint spectrogram generation in accordance with the present invention;
FIG. 3 is a spectrogram of the voiceprint of the present invention;
FIG. 4 is a flowchart of voiceprint key extraction in accordance with the present invention.
FIG. 5 is a schematic diagram of machine learning of voiceprint features in accordance with the present invention.
Detailed Description
A text-related voiceprint key generation method comprises voiceprint key training and voiceprint key extraction; and training a voiceprint key to obtain a voiceprint key extraction matrix through a voiceprint sample collected in the early stage. Voiceprint key extraction the voiceprint sample to be extracted is preprocessed and then multiplied by a key extraction matrix obtained by voiceprint key training to obtain the voiceprint key. The method comprises the following specific steps:
step one, voiceprint key training, as shown in fig. 1, the specific steps are:
in the first step, the user logs his own voice for the same text message, generally 1-3 consecutive words, and repeats the same voice for more than 20 times (the number of times can be adjusted by the user according to the training situation).
Recording more than 10 voices of different users reading the same text information, and repeating the voices for more than 20 times respectively; more than 10 different users are recorded to read different text messages and voices with similar duration, and the voice reading is repeated more than 20 times respectively.
Step three, preprocessing the first and second step recorded voices, as shown in fig. 2 and 3, and extracting a voiceprint spectrogram specifically comprises the following steps:
1) pre-enhancement (Pre-Emphasis):
s1(N) represents the speech time domain signal, where N is 0, 1,2 …, N-1, the pre-emphasis formula is: s (n) ═ S1(n) -a × S1(n-1), 0.9< a < 1.0. a is the coefficient to be enhanced for adjusting the amplitude to be enhanced.
2) Framing, i.e. Framing the speech signal.
3) Hamming Window (Hamming Window) processing:
the speech time domain signal after the sound framing is S (N), wherein N is 0, 1,2 … and N-1, and represents that the speech time domain signal is divided into N sections of speech signals; then, the time domain signal of the voice multiplied by the hamming window is S' (n), see table:
S’(n)=S(n)*W(n) ⑴;
obtaining:and a is 0.46, the value range of a is 0.3-0.7, and the specific numerical value is determined according to experimental and empirical data. w (n) is a Hamming window function, has smoother low-pass characteristic and can better reflect the frequency characteristic of the short-time voice signal.
5) Fast Fourier Transform (FFT):
performing radix-2 FFT on the voice time domain signal S' (n) multiplied by the Hamming window to obtain a linear frequency spectrum X (n, k), and performing radix-2 FFT to obtain a general algorithm in the field; x (n, k) is a spectrum energy density function of the nth speech frame, k corresponds to a spectrum segment, and each speech frame corresponds to a time slice on a time axis.
5) Generating a text-related voiceprint spectrogram:
using time n as time axis coordinate and k as frequency spectrum axis coordinate, calculating | X (n, k) & lt 2 The value of (A) is expressed as a gray level and is displayed on the corresponding coordinate point position, namely, the vocal print spectrogram is formed. By transforming 10log 10 (|X(n,k)| 2 ) A dB representation of the spectrogram was obtained.
And fourthly, preprocessing the voiceprint spectrogram through filtering, normalization and the like, wherein specific filtering modes include general filtering modes in signal processing fields of gauss, wavelet, binarization and the like, which mode is specifically adopted or a combination of several modes is adopted, and the mode is selected by a user according to actual test conditions. The normalization processing means that the sizes of the speech spectrograms are unified to a fixed length and width, the value of each pixel point of the speech spectrograms is unified to a range of 0-255, and the specific method can be all general methods in the field, for example, the image size adjustment can be realized by an imresize function in a matlab function library.
And fifthly, performing machine learning on the voiceprint spectrogram to obtain a voiceprint stable characteristic learning matrix, namely a voiceprint key extraction matrix.
And the voiceprint spectrogram obtained in the fourth step is divided into two categories, wherein one category is the voiceprint spectrogram of the relevant text of the user, and the other category is the comparison voiceprint spectrogram formed by mixing the relevant text of the non-user and the non-relevant text, and is called a positive and negative sample set.
With M ═ M 1 ,M 2 ]Positive and negative sample sets, M, representing participation in training i =[x i1 ,x i2 ,...,x iL ]I belongs to {1,2} to represent the ith sample set, i ═ 1 is a positive sample, i ═ 2 is a negative sample; x is a radical of a fluorine atom ir ∈R d ,1≤i≤2,1≤r≤L,x ir Forming a two-dimensional matrix by the values of all pixel points of a voiceprint spectrogram as a one-dimensional column vector, and then converting the two-dimensional matrix into a two-dimensional matrixSequentially splicing each row of the matrix to obtain a one-dimensional row vector, and transposing to obtain a one-dimensional column vector x ir ,x ir Length d, R d And L represents that L voiceprint spectrogram modes exist in the same sample set, namely L column vectors.
Now, according to the characteristics of the two types of samples, the voiceprint key extraction matrix W is trained 1 ,W 1 ∈R d×dz The method has the following advantages:
whereinFor the positive sample mean of the training samples,is the negative sample mean of the training samples. J is a cost function reflecting the voiceprint key extraction matrix W of the training sample 1 And calculating the distance difference between the projected image and the mean value of the positive and negative sample sets by using Euclidean distance.
Order:
solving matrix (H) 1 -H 2 ) Obtaining a voiceprint key extraction matrix W by the characteristic value and the characteristic vector of the voiceprint key 1 Namely: (H) 1 -H 2 ) w ═ λ w; w is a matrix (H) 1 -H 2 ) λ is the eigenvalue.
Due to { w 1 ,w 2 ,...,w dz Is an eigenvector corresponding to the eigenvalue { lambda } 1 ,λ 2 ,...,λ dz In which λ is 1 ≥λ 2 ≥...≥λ dz Not less than 0, and eigenvectors with eigenvalues less than 0 are not included in the matrix W 1 The structure of (1).
So far, training out a voiceprint key extraction matrix W 1 。
Step two, voiceprint key extraction, as shown in fig. 4, the specific steps are as follows:
step 1, a user logs own text related voice for about 3 seconds.
And 2, extracting a voiceprint spectrogram, and particularly referring to the third step.
And 3, preprocessing the voiceprint spectrogram by filtering, normalizing and the like, converting the voiceprint spectrogram into a matrix form, and splicing the matrix form according to the rows in sequence to obtain a voiceprint vector x t 。
Step 4, learning matrix W by using the vocal print stable characteristics trained in the step one 1 And the transposed vocal print vector x obtained by the step 3 is multiplied by the left side t I.e. W 1 T ·x t D is obtained z Dimensional voiceprint feature vector x tz ,x tz Is the stabilized voiceprint feature vector.
Step 5, for x tz Each dimension component of (a) is subjected to a chessboard method operation to further stabilize the vocal print feature vector as
The chessboard method comprises the following steps:
for x tz Each dimension component in (1) is denoted as x tzi ;
The quantization formula is shown as the following:
wherein D is the size of the grid of the chessboard method, a positive number is taken, a specific value can be selected by a user according to experience, and the value of Λ (x) is generally between 0 and 63, and x is tzi Is x tz Λ (x) is an integer value.
Λ (x) is x tzi The quantized value is the closest x in the checkerboard tzi Point and coordinate value of the grid of origin of coordinates.
Step 6, taking the vector of the calculation result of the step fiveThe first 32 or 64 components are spliced front and back, and each component takes a value of 0-64, so that 4-bit key calculation can be formed, and a 128-bit or 256-bit voiceprint key can be formed; and finishing the extraction of the voiceprint key.
The invention extracts the vocal print spectrogram from the text-related voice by utilizing the characteristic that the vocal print frequency spectrum of the text-related voice of the same speaker has higher similarity, a plurality of vocal print spectrogram obtained by sampling the same text of the same speaker for a plurality of times have higher similarity, and meanwhile, the vocal print spectrogram extracted from the same text of different speakers has more obvious difference. After extracting the voiceprint spectrogram, extracting common characteristic information from a plurality of voiceprint spectrograms by a machine learning method as shown in fig. 5, and obtaining a text-related voiceprint key after segmented quantization. The voiceprint key does not need a server to reserve a biological characteristic template, has higher safety, can be fused with encryption and decryption algorithms of a general network such as AES (advanced encryption standard), RSA (rivest-Shamir-Adleman) and the like, and is convenient for a user to use. The method can obtain a more stable voiceprint key, the voiceprint key extraction accuracy is more than 95%, and the key length can reach 256 bits.
Claims (2)
1. A text-related voiceprint key generation method is characterized by comprising the following steps: the method comprises the steps of voiceprint key training and voiceprint key extraction; training a voiceprint key to obtain a voiceprint key extraction matrix through a voiceprint sample collected in the early stage; voiceprint key extraction a voiceprint sample to be extracted is preprocessed and then multiplied by a key extraction matrix obtained by voiceprint key training to obtain a voiceprint key; the method comprises the following specific steps:
step one, voiceprint key training, which comprises the following specific steps:
firstly, recording own voice for the same text information, generally 1-3 continuous words, by a user, repeating the recording for more than 20 times, wherein the times are adjusted by the user according to training conditions;
recording more than 10 voices of different users reading the same text information, and repeating the voices for more than 20 times respectively; more than 10 different users are recorded to read different text information and voices with similar duration, and the voice reading is repeated for more than 20 times;
thirdly, preprocessing the first and second step recorded voices, and extracting a voiceprint spectrogram specifically comprises the following steps:
1) pre-reinforcing:
s1(N) represents the speech time domain signal, where N is 0, 1,2 …, N-1, and the pre-emphasis formula is: s (n) -S1 (n) -a S1(n-1), 0.9< a < 1.0; a is the coefficient to be enhanced for adjusting the amplitude to be enhanced;
2) framing, i.e., framing the speech signal;
3) hamming window processing:
the speech time domain signal after the sound framing is S (N), wherein N is 0, 1,2 … and N-1, and represents that the speech time domain signal is divided into N sections of speech signals; then, the time domain signal of the voice multiplied by the hamming window is S' (n), see table:
S’(n)=S(n)*W(n) ⑴;
obtaining:a is 0.46, the value range of a is 0.3-0.7, and specific numerical values are determined according to experimental and empirical data; w (n) is a Hamming window function, has smoother low-pass characteristic and can better reflect the frequency characteristic of the short-time voice signal;
4) fast fourier transform FFT:
performing radix-2 FFT on the voice time domain signal S' (n) multiplied by the Hamming window to obtain a linear frequency spectrum X (n, k), and performing radix-2 FFT to obtain a general algorithm in the field; x (n, k) is a spectrum energy density function of the nth section of speech frame, k corresponds to a spectrum section, and each section of speech frame corresponds to a time slice on a time axis;
5) generating a text-related voiceprint spectrogram:
using time n as time axis coordinate and k as frequency spectrum axis coordinate, calculating | X (n, k) & lt 2 The value of (A) is expressed as a gray level and is displayed on the corresponding coordinate point position, namely, a voiceprint spectrogram is formed; by transforming 10log 10 (|X(n,k)| 2 ) Obtaining dB representation of the spectrogram;
fourthly, filtering and normalizing the voiceprint spectrogram, wherein the specific filtering modes comprise Gaussian filtering, wavelet filtering and binary filtering, and a user can optionally select one or more modes for filtering according to the actual test condition;
fifthly, machine learning is carried out on the voiceprint spectrogram to obtain a voiceprint stable characteristic learning matrix, namely a voiceprint key extraction matrix;
dividing the voiceprint spectrogram obtained in the fourth step into two categories, namely a voiceprint spectrogram of a related text of the user, and a comparison voiceprint spectrogram formed by mixing a related text of a non-user and a non-related text, wherein the comparison voiceprint spectrogram is called a positive and negative sample set;
with M ═ M 1 ,M 2 ]Representing positive and negative sample sets participating in the training, M i =[x i1 ,x i2 ,...,x iL ]I belongs to {1,2} to represent the ith sample set, i ═ 1 is a positive sample, i ═ 2 is a negative sample; x is the number of ir ∈R d ,1≤i≤2,1≤r≤L,x ir Forming a two-dimensional matrix by the values of all pixel points of a voiceprint spectrogram, sequentially splicing each row of the two-dimensional matrix to obtain a one-dimensional row vector, and transposing to obtain a one-dimensional column vector x ir ,x ir Length d, R d Representing a d-dimensional real number domain, wherein L represents that L voiceprint spectrogram, namely L column vectors, exist in the same type of sample set;
now, according to the characteristics of the two types of samples, the voiceprint key extraction matrix W is trained 1 ,W 1 ∈R d×dz The method has the following advantages:
whereinFor the positive sample mean of the training samples,is the negative sample mean of the training sample; j is a cost function and reflects the voiceprint density of the training sampleKey extraction matrix W 1 Calculating the distance difference between the projected image and the positive and negative sample set mean value by using Euclidean distance;
order:
solving matrix (H) 1 -H 2 ) Obtaining a voiceprint key extraction matrix W according to the characteristic value and the characteristic vector 1 Namely: (H) 1 -H 2 ) w ═ λ w; w is a matrix (H) 1 -H 2 ) λ is a eigenvalue;
due to { w 1 ,w 2 ,...,w dz Is an eigenvector corresponding to the eigenvalue { lambda } 1 ,λ 2 ,...,λ dz In which λ is 1 ≥λ 2 ≥...≥λ dz ≧ 0, the eigenvectors with eigenvalues less than 0 are not included in the matrix W 1 The structure of (1);
training a voiceprint key extraction matrix W to 1 ;
Step two, voiceprint key extraction, which comprises the following specific steps:
step 1, recording relevant voices of texts of a user for about 3 seconds;
step 2, extracting a voiceprint spectrogram, and particularly referring to the step one and the third step;
and 3, filtering and normalizing the voiceprint spectrogram, converting the voiceprint spectrogram into a matrix form, and splicing the matrix form according to the rows in sequence to obtain a voiceprint vector x t ;
Step 4, extracting a matrix W by using the voiceprint key trained in the step one 1 And the transposed voiceprint vector x obtained by the step 3 is multiplied by the left side t I.e. W 1 T ·x t To obtain d z Dimensional voiceprint feature vector x tz ,x tz The stable vocal print feature vector is obtained;
step 5, for x tz Each dimension component of the voice print is subjected to chessboard operation to further stabilize the voice print characteristic vector as x tz ;
The chessboard method comprises the following steps:
for x tz Each dimension component in (1) is denoted as x tzi ;
The quantization formula is shown as the following:
wherein D is the size of the grid of the chessboard method, a positive number is taken, a specific value can be selected by a user according to experience, and the value of Λ (x) is generally between 0 and 63, and x is tzi Is x tz Λ (x) is an integer value;
Λ (x) is x tzi The quantized value is the closest x in the checkerboard tzi Coordinate values of the grid of points and the origin of coordinates;
step 6, taking the vector of the calculation result of the step fiveThe first 32 or 64 components are spliced front and back, and each component takes a value of 0-64, so that 4-bit key calculation can be formed, and a 128-bit or 256-bit voiceprint key can be formed; and finishing the extraction of the voiceprint key.
2. A method of generating a text dependent voiceprint key as claimed in claim 1 wherein: and fourthly, the normalization processing means that the sizes of the speech spectrograms are unified to a fixed length and width, the value of each pixel point of the speech spectrograms is unified to be within the range of 0-255, and the normalization processing can be realized by adopting an nonresizing function in a matlab function library.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811139547.4A CN109326294B (en) | 2018-09-28 | 2018-09-28 | Text-related voiceprint key generation method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811139547.4A CN109326294B (en) | 2018-09-28 | 2018-09-28 | Text-related voiceprint key generation method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109326294A CN109326294A (en) | 2019-02-12 |
CN109326294B true CN109326294B (en) | 2022-09-20 |
Family
ID=65266096
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811139547.4A Active CN109326294B (en) | 2018-09-28 | 2018-09-28 | Text-related voiceprint key generation method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109326294B (en) |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110322887B (en) * | 2019-04-28 | 2021-10-15 | 武汉大晟极科技有限公司 | Multi-type audio signal energy feature extraction method |
CN110223699B (en) * | 2019-05-15 | 2021-04-13 | 桂林电子科技大学 | Speaker identity confirmation method, device and storage medium |
CN111161705B (en) * | 2019-12-19 | 2022-11-18 | 寒武纪(西安)集成电路有限公司 | Voice conversion method and device |
CN112908303A (en) * | 2021-01-28 | 2021-06-04 | 广东优碧胜科技有限公司 | Audio signal processing method and device and electronic equipment |
CN113179157B (en) * | 2021-03-31 | 2022-05-17 | 杭州电子科技大学 | Text-related voiceprint biological key generation method based on deep learning |
CN113129897B (en) * | 2021-04-08 | 2024-02-20 | 杭州电子科技大学 | Voiceprint recognition method based on attention mechanism cyclic neural network |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2001092974A (en) * | 1999-08-06 | 2001-04-06 | Internatl Business Mach Corp <Ibm> | Speaker recognizing method, device for executing the same, method and device for confirming audio generation |
CN103873254A (en) * | 2014-03-03 | 2014-06-18 | 杭州电子科技大学 | Method for generating human vocal print biometric key |
CN103971690A (en) * | 2013-01-28 | 2014-08-06 | 腾讯科技(深圳)有限公司 | Voiceprint recognition method and device |
CN106128465A (en) * | 2016-06-23 | 2016-11-16 | 成都启英泰伦科技有限公司 | A kind of Voiceprint Recognition System and method |
CN107274890A (en) * | 2017-07-04 | 2017-10-20 | 清华大学 | Vocal print composes extracting method and device |
CN108198561A (en) * | 2017-12-13 | 2018-06-22 | 宁波大学 | A kind of pirate recordings speech detection method based on convolutional neural networks |
CN112786059A (en) * | 2021-03-11 | 2021-05-11 | 合肥市清大创新研究院有限公司 | Voiceprint feature extraction method and device based on artificial intelligence |
-
2018
- 2018-09-28 CN CN201811139547.4A patent/CN109326294B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2001092974A (en) * | 1999-08-06 | 2001-04-06 | Internatl Business Mach Corp <Ibm> | Speaker recognizing method, device for executing the same, method and device for confirming audio generation |
CN103971690A (en) * | 2013-01-28 | 2014-08-06 | 腾讯科技(深圳)有限公司 | Voiceprint recognition method and device |
CN103873254A (en) * | 2014-03-03 | 2014-06-18 | 杭州电子科技大学 | Method for generating human vocal print biometric key |
CN106128465A (en) * | 2016-06-23 | 2016-11-16 | 成都启英泰伦科技有限公司 | A kind of Voiceprint Recognition System and method |
CN107274890A (en) * | 2017-07-04 | 2017-10-20 | 清华大学 | Vocal print composes extracting method and device |
CN108198561A (en) * | 2017-12-13 | 2018-06-22 | 宁波大学 | A kind of pirate recordings speech detection method based on convolutional neural networks |
CN112786059A (en) * | 2021-03-11 | 2021-05-11 | 合肥市清大创新研究院有限公司 | Voiceprint feature extraction method and device based on artificial intelligence |
Non-Patent Citations (3)
Title |
---|
TL-CNN-GAP模型下的小样本声纹识别方法研究;丁冬兵;《电脑知识与技术》;20180825(第24期);全文 * |
基于PCNN的语谱图特征提取在说话人识别中的应用;马义德等;《计算机工程与应用》;20060801(第20期);全文 * |
语谱特征的身份认证向量识别方法;冯辉宗等;《重庆大学学报》;20170515(第05期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN109326294A (en) | 2019-02-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109326294B (en) | Text-related voiceprint key generation method | |
Srivastava et al. | Privacy and utility of x-vector based speaker anonymization | |
CN109584893B (en) | VAE and i-vector based many-to-many voice conversion system under non-parallel text condition | |
CN110659468B (en) | File encryption and decryption system based on C/S architecture and speaker identification technology | |
US8447614B2 (en) | Method and system to authenticate a user and/or generate cryptographic data | |
CN113436646B (en) | Camouflage voice detection method adopting combined features and random forest | |
CN112735435A (en) | Voiceprint open set identification method with unknown class internal division capability | |
CN110600046A (en) | Many-to-many speaker conversion method based on improved STARGAN and x vectors | |
Hsu et al. | Local wavelet acoustic pattern: A novel time–frequency descriptor for birdsong recognition | |
Do et al. | Speech Separation in the Frequency Domain with Autoencoder. | |
Marras et al. | Dictionary attacks on speaker verification | |
Sanderson et al. | Features for robust face-based identity verification | |
CN115101077A (en) | Voiceprint detection model training method and voiceprint recognition method | |
CN118212929A (en) | Personalized Ambiosonic voice enhancement method | |
Huang et al. | Audio-replay Attacks Spoofing Detection for Automatic Speaker Verification System | |
CN115761048A (en) | Face age editing method based on video time sequence | |
Korshunov et al. | Vulnerability of Automatic Identity Recognition to Audio-Visual Deepfakes | |
Thebaud et al. | Spoofing speaker verification with voice style transfer and reconstruction loss | |
Hassan et al. | Enhancing speaker identification through reverberation modeling and cancelable techniques using ANNs | |
Nainan et al. | A comparison of performance evaluation of ASR for noisy and enhanced signal using GMM | |
Wilson et al. | Voice Aging with Audio-Visual Style Transfer | |
CN116631406B (en) | Identity feature extraction method, equipment and storage medium based on acoustic feature generation | |
CN115050389B (en) | Voice over-frequency method and system | |
Abdulghani et al. | Voice Signature Recognition for UAV Pilots Identity Verification | |
He et al. | Speaker identification based on PCC feature vector |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |