CN113223503B - Core training voice selection method based on test feedback - Google Patents
Core training voice selection method based on test feedback Download PDFInfo
- Publication number
- CN113223503B CN113223503B CN202110473842.9A CN202110473842A CN113223503B CN 113223503 B CN113223503 B CN 113223503B CN 202110473842 A CN202110473842 A CN 202110473842A CN 113223503 B CN113223503 B CN 113223503B
- Authority
- CN
- China
- Prior art keywords
- training
- voice
- speech
- voices
- real
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000012549 training Methods 0.000 title claims abstract description 68
- 238000012360 testing method Methods 0.000 title claims abstract description 32
- 238000010187 selection method Methods 0.000 title claims abstract description 11
- 238000000034 method Methods 0.000 claims abstract description 13
- 230000001174 ascending effect Effects 0.000 claims description 3
- 238000001514 detection method Methods 0.000 description 10
- 238000011161 development Methods 0.000 description 9
- 238000011156 evaluation Methods 0.000 description 7
- 238000005457 optimization Methods 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 230000007123 defense Effects 0.000 description 2
- 230000007613 environmental effect Effects 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 108090000623 proteins and genes Proteins 0.000 description 2
- 238000012163 sequencing technique Methods 0.000 description 2
- 241001672694 Citrus reticulata Species 0.000 description 1
- 102100024109 Cyclin-T1 Human genes 0.000 description 1
- 101000910488 Homo sapiens Cyclin-T1 Proteins 0.000 description 1
- 238000007476 Maximum Likelihood Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000013145 classification model Methods 0.000 description 1
- 230000008451 emotion Effects 0.000 description 1
- 238000005265 energy consumption Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000009776 industrial production Methods 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
- 230000002194 synthesizing effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
Landscapes
- Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Electrically Operated Instructional Devices (AREA)
Abstract
The invention discloses a core training voice selection method based on test feedback, which is characterized in that acquired test voices are used for training to establish a reference model, then likelihood scores of original training voices on the reference model are calculated, various voices are sequenced according to the likelihood scores, and the sequenced various voices are selected according to a certain proportion to obtain core training voices. By the data selection method provided by the invention, high-quality training voices can be screened according to the feedback of the test result, and the obtained core training voices are combined with the feedback of the actual application information, so that the future recognition performance is better; the method is suitable for voice classification scenes such as voice recognition, speaker recognition, fake voice recognition and the like.
Description
Technical Field
The invention belongs to the technical field of voice recognition, and particularly relates to a core training voice selection method based on test feedback.
Background
The voiceprint authentication system has the advantages of low acquisition cost, easiness in acquisition, convenience in remote authentication and the like as a biological authentication mode, and is widely applied to the fields of access control systems, financial transactions, judicial appraisal and the like. With the rapid development of the voice synthesis technology, on one hand, more convenient services and better user experience are brought to people, such as real-sound intelligent customer service, real-sound intelligent navigation, audio reading, intelligent voice calling and the like, and on the other hand, huge challenges are brought to the security of a voiceprint authentication system, such as the performance of the voiceprint authentication system is obviously reduced by attacking the voiceprint authentication system by using synthesized voice, so that the research on synthesized voice detection has important significance.
The purpose of the synthesized speech detection is to detect the synthesized speech from the real speech. The existing experimental research on the detection of the synthesized voice is trained according to a training set by a match, and a large amount of training data is usually used; however, in practical situations, when more training data is used, the performance is rather degraded, because redundancy exists in the training data, and it is necessary to make data selection. In a real engineering problem, such a scenario is encountered: the test is performed in stages, and a small part of test data can be contacted at the beginning, which is equivalent to that a priori knowledge about the test environment is provided, so how to select the training data according to the small part of test data to obtain a better model so as to obtain better performance in the subsequent test stage, and the method is a practical problem worthy of discussion.
Disclosure of Invention
After the voice classification system obtains a certain amount of test data in actual operation, how to update the voice classification model by using the test data makes the future recognition performance better; aiming at the problem, the invention provides a core training voice selection method based on test feedback, which can select high-quality core training voice by using the existing test data, so that the model has better performance under the condition of using less training voice, thereby not only saving training time and energy consumption, but also improving the recognition performance.
A core training voice selection method based on test feedback comprises the following steps:
s1, training by using known partial test voices to obtain a reference model;
s2, calculating matching scores of all training voices on a reference model;
s3, sequencing each training voice in each type of set according to the model score;
and S4, selecting training voices ranked at the top one by one according to a certain proportion as core training voices.
Further, the specific implementation manner of step S1 is: for the N-type voice classification task, known partial test voices are divided into N sets according to the types of the known partial test voices, the test voices in each set are sequentially trained after characteristics of the test voices are extracted, so that reference models of the voices, namely N reference models, are obtained, and N is a natural number which is larger than 1, namely a set voice type number.
Further, the specific implementation manner of step S2 is: firstly, original training voices are classified according to voice categories to obtain N category training voice sets, then, the characteristics of the original training voices in the various training voice sets are extracted in sequence and input into reference models of corresponding categories, and matching scores of the training voices, namely model scores, are obtained through calculation and output. The way of extracting the speech features in this step is consistent with the way of extracting the speech features when the reference model is trained in step S1.
Further, the specific implementation manner of step S3 is: based on the matching scores of all the training voices obtained in step S2, the training voices in each class of training voice set are sorted according to voice category from large to small according to their model scores.
Further, the specific implementation manner of step S4 is: and according to the voice arrangement sequence in the various training voice sets obtained in the step S3, selecting the training voice with the top rank as the core training voice according to a certain proportion.
The core voice selection method provided by the invention can select the original training voice according to the feedback of the obtained test result, and the obtained core training voice combines the feedback of the actual application information, so that the future recognition performance is better.
Drawings
FIG. 1 is a flowchart illustrating steps of a method for selecting a core training speech in a stage test scenario according to the present invention.
Detailed Description
The invention is suitable for voice classification scenes such as voice recognition, speaker recognition, forged voice recognition and the like. For further understanding of the present invention, the following detailed description of the embodiments of the present invention will be made only with respect to specific embodiments of applications for selecting core training speech in synthesized speech detection, but it is to be understood that these descriptions are only intended to further illustrate features and advantages of the present invention, and are not intended to limit the claims of the present invention.
The experimental data in the present embodiment are the logical access database (ASVspoof 2019-LA) for the automatic speaker identification spoofing attack and defense countermeasure challenge match in 2019, the logical access database (ASVspoof 2015) for the automatic speaker identification spoofing attack and defense countermeasure challenge match in 2015, and the real scene synthesized speech detection data set (RS-SSD).
The ASVspoof challenge is initiated by a co-organization of several world-leading research institutes, university of Edinburgh, England, France, EURECOM, Japan NEC, university of east Finland, etc. The real speech of ASVspoof 2019 comes from 107 speakers, 61 being female and 46 being male, the data set is divided into three parts: training set (Train), development set (Dev), evaluation set (Eval), recording environment is quieter, no obvious channel or environmental noise. The false voices of a training set and a development set are generated from real voices by various algorithms, wherein the training set comprises 20 speakers, 12 speakers are female, 8 speakers are male, and comprises 2580 sentences of real voices and 22800 sentences of false voices; the development set comprises 20 speakers, 12 speakers are female, 8 speakers are male, and the development set comprises 2548 true voices and 22296 false voices; the evaluation set contains 67 speakers, 37 women and 30 men, and contains 7355 true voices and 63882 false voices, and the size of the evaluation set is about 4 GB.
The real speech of ASVspoof 2015 is from 106 speakers, 61 as females and 45 as males, the data set is divided into three parts: training set (Train), development set (Dev), evaluation set (Eval), recording environment is quieter, no obvious channel or environmental noise. The false voices of a training set and a development set are generated from real voices by various algorithms, wherein the training set comprises 25 speakers, 15 human females and 10 human males and comprises 3750 sentences of the real voices and 12625 sentences of the false voices; the development set comprises 35 speakers, 20 people are female, 15 people are male, and the development set comprises 2497 true voices and 49875 false voices; the evaluation set contained 46 speakers, 26 women, 20 men, and about 20 ten thousand test voices, and the evaluation set size was about 20 GB.
A Real-scene synthesized Speech Detection dataset (Real-scene synthesized Speech Detection dataset), referred to as RS-SSD dataset for short, wherein the synthesized Speech includes synthesized Speech from google, Tencent, Baidu and synthesized Speech from the Artificial Intelligence (AI) anchor of Xinhua corporation, the duration of which is 4.12 hours in total; real voice with equal duration comprises real voice from network media video, real voice from news video of Xinhua society and real voice from two databases of Chinese emotion Corpus (MASC) released by CCNT laboratory of Zhejiang university and Chinese Mandarin open source voice database AISHLL 1 provided by Hill shell. The voice contents of various categories are various and include voice contents of various scenes such as news reports, smart homes, unmanned driving, industrial production and the like.
As shown in fig. 1, the method for selecting core training speech based on test feedback of the present invention includes the following steps:
s1, training by using known partial test sentences to obtain reference model parameters;
s2, calculating likelihood scores of all training sentences on the reference model;
s3, respectively sequencing likelihood scores of all real sentences and synthesized sentences in a descending order and an ascending order;
and S4, selecting the real sentences and the synthesized sentences which are ranked at the front respectively to form a training set.
The specific implementation method of the step S1 is as follows: firstly, defining the real voice training corpus as X _ gene, the false voice training corpus as X _ spoofs and the number of target selection voices as M in the detection of synthetic voicegenuineAnd MspoofThe known partial test voices Q _ gene and Q _ speof are selected as a voice set Cgenuine,Cspoof。
For a known partial test voice Q, whose feature data is obtained, a 32-step LFCC of the voice may be used, plus a first-order delta feature and a second-order delta feature. Using these feature data, a GMM with a K-order Gaussian component is trained, and this GMM is used as a reference model for the selection of training speechAnd
the training of the GMM is a supervised optimization process, typically using maximum likelihood criteria. The whole process is divided into two parts of parameter initialization and parameter optimization, wherein the parameter initialization usually uses an LBG algorithm, and the parameter optimization usually uses an EM algorithm. Since the GMM training and the speech feature obtaining method are commonly applied to the existing synthesized speech detection system, they are not described in more detail here. For the choice of GMM model order K, typically a power of 2 such as 64, 128, 512, 1024, etc., it was found experimentally that the 512 order GMM synthesized speech detection system performs better for the 96-dimensional LFCC features used.
The specific implementation method of the step S2 is as follows: acquiring feature data of all training voices, wherein the voice feature acquisition mode is consistent with the acquisition mode of the data of the GMM reference model trained in the step S1; then aiming at each training voice, the reference model of the real voiceAnd synthesizing speechRespectively calculating log-likelihood scores on the reference model Differencing to obtain log-likelihood score ratio
The specific implementation method of the step S3 is as follows: for all the likelihood scores of the real voice and the synthesized voice obtained in step S2, y is obtained by sorting y in descending order and y in ascending order, respectively1,y2,…,ynY'1,y’2,…,y’n}。
The specific implementation method of the step S4 is as follows: for the ranked training speech y and y' obtained in step S3, the top M was selectedgenuineSentence true speech and MspoofSentence-synthesized speech is added to the selected speech sets respectivelyAnd CgenuineAnd CspoofI.e. by
In the following, we tested all voices in the evaluation set, and the experiments were based on the GMM system, except for the selection algorithm proposed by the present invention, comparing the error rate EER results of the experiments and the like using all data and random selection methods, as shown in table 1:
TABLE 1
As can be seen from Table 1, the invention can improve the system identification performance to a certain extent and the performance is superior to that of the random selection method, compared with the original method using all data training, EER is respectively improved by 0.63, 0.70 and 2.00 percentage points in three data sets when only 1/3 data is selected for training, and is respectively improved by 0.66, 1.24 and 4.07 percentage points in three data sets when only 1/2 data is selected for training.
The foregoing description of the embodiments is provided to enable one of ordinary skill in the art to make and use the invention, and it is to be understood that other modifications of the embodiments, and the generic principles defined herein may be applied to other embodiments without the use of inventive faculty, as will be readily apparent to those skilled in the art. Therefore, the present invention is not limited to the above embodiments, and those skilled in the art should make improvements and modifications to the present invention based on the disclosure of the present invention within the protection scope of the present invention.
Claims (1)
1. A core training voice selection method based on test feedback comprises the following steps:
s1, training by using known partial test sentences to obtain reference model parameters, specifically: first, define the real speech training corpus asXgenuine, with X as the synthesized speech corpusspoof, the number of the real voice selected by the target is MgenuineThe number of target-picked synthesized speeches is MspoofThe known partial true test speech is Qgenuine, a known partially synthesized test speech of Qspoof, the selected real speech set is CgenuineThe selected synthesized speech set is Cspoof;
For QgThe method comprises the steps of acquiring 32-order LFCC of which characteristic data comprise voice, adding first-order delta characteristics and second-order delta characteristics, training by using the characteristic data to obtain GMM with K-order Gaussian components, and taking the GMM as a real voice reference model for selecting training voice later
For Qspoof, training in the same way yields a synthesized speech reference model for selecting training speech
s3, for all real voices, carrying out descending arrangement according to the log-likelihood scores to obtain a real voice sequence y ═ y1,y2,…,yn}; for all the synthetic voices, the synthetic voice sequences γ ' ═ y ' were obtained by ascending arrangement according to the log-likelihood scores thereof '1,y’2,…,y’n};
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2020103568572 | 2020-04-29 | ||
CN202010356857 | 2020-04-29 |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113223503A CN113223503A (en) | 2021-08-06 |
CN113223503B true CN113223503B (en) | 2022-06-14 |
Family
ID=77090159
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110473842.9A Active CN113223503B (en) | 2020-04-29 | 2021-04-29 | Core training voice selection method based on test feedback |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113223503B (en) |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5477451A (en) * | 1991-07-25 | 1995-12-19 | International Business Machines Corp. | Method and system for natural language translation |
JP5228283B2 (en) * | 2006-04-19 | 2013-07-03 | カシオ計算機株式会社 | Speech synthesis dictionary construction device, speech synthesis dictionary construction method, and program |
JP6148150B2 (en) * | 2013-10-23 | 2017-06-14 | 日本電信電話株式会社 | Acoustic analysis frame reliability calculation device, acoustic model adaptation device, speech recognition device, their program, and acoustic analysis frame reliability calculation method |
BR112018014689A2 (en) * | 2016-01-22 | 2018-12-11 | Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e. V. | apparatus and method for encoding or decoding a multichannel signal using a broadband alignment parameter and a plurality of narrowband alignment parameters |
CN108304390B (en) * | 2017-12-15 | 2020-10-16 | 腾讯科技(深圳)有限公司 | Translation model-based training method, training device, translation method and storage medium |
-
2021
- 2021-04-29 CN CN202110473842.9A patent/CN113223503B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN113223503A (en) | 2021-08-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111524527B (en) | Speaker separation method, speaker separation device, electronic device and storage medium | |
Dennis | Sound event recognition in unstructured environments using spectrogram image processing | |
CN109543020B (en) | Query processing method and system | |
KR101618512B1 (en) | Gaussian mixture model based speaker recognition system and the selection method of additional training utterance | |
CN109887484A (en) | A kind of speech recognition based on paired-associate learning and phoneme synthesizing method and device | |
CN111243602A (en) | Voiceprint recognition method based on gender, nationality and emotional information | |
Rouvier et al. | Speaker diarization through speaker embeddings | |
CN111091809B (en) | Regional accent recognition method and device based on depth feature fusion | |
CN108962247A (en) | Based on gradual neural network multidimensional voice messaging identifying system and its method | |
Joshi et al. | A Study of speech emotion recognition methods | |
CN111563373A (en) | Attribute-level emotion classification method for focused attribute-related text | |
CN114678030A (en) | Voiceprint identification method and device based on depth residual error network and attention mechanism | |
El-Moneim et al. | Text-dependent and text-independent speaker recognition of reverberant speech based on CNN | |
Mishra et al. | Gender differentiated convolutional neural networks for speech emotion recognition | |
Anguera et al. | A novel speaker binary key derived from anchor models | |
Ozerov et al. | GMM-based classification from noisy features | |
CN111462762B (en) | Speaker vector regularization method and device, electronic equipment and storage medium | |
CN110807370B (en) | Conference speaker identity noninductive confirmation method based on multiple modes | |
CN113223503B (en) | Core training voice selection method based on test feedback | |
Mami et al. | Speaker recognition by location in the space of reference speakers | |
Hernández-Sierra et al. | Speaker recognition using a binary representation and specificities models | |
Kaur et al. | An efficient speaker recognition using quantum neural network | |
Raghib et al. | Emotion analysis and speech signal processing | |
Mitra et al. | Investigating Salient Representations and Label Variance in Dimensional Speech Emotion Analysis | |
CN113223537B (en) | Voice training data iterative updating method based on stage test feedback |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |