CN111009238A - Spliced voice recognition method, device and equipment - Google Patents
Spliced voice recognition method, device and equipment Download PDFInfo
- Publication number
- CN111009238A CN111009238A CN202010002558.9A CN202010002558A CN111009238A CN 111009238 A CN111009238 A CN 111009238A CN 202010002558 A CN202010002558 A CN 202010002558A CN 111009238 A CN111009238 A CN 111009238A
- Authority
- CN
- China
- Prior art keywords
- spliced
- voice data
- speech
- convolutional neural
- long
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 35
- 230000015654 memory Effects 0.000 claims abstract description 77
- 238000013527 convolutional neural network Methods 0.000 claims abstract description 67
- 238000013145 classification model Methods 0.000 claims abstract description 52
- 238000012549 training Methods 0.000 claims abstract description 46
- 208000011293 voice disease Diseases 0.000 claims abstract description 8
- 238000010606 normalization Methods 0.000 claims description 18
- 230000006870 function Effects 0.000 claims description 12
- 238000005457 optimization Methods 0.000 claims description 10
- 238000004590 computer program Methods 0.000 claims description 5
- 238000010276 construction Methods 0.000 claims description 3
- 238000012795 verification Methods 0.000 abstract description 13
- 230000008901 benefit Effects 0.000 description 8
- 238000010586 diagram Methods 0.000 description 5
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 238000004891 communication Methods 0.000 description 2
- 230000002093 peripheral effect Effects 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 230000006403 short-term memory Effects 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/04—Segmentation; Word boundary detection
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Signal Processing (AREA)
- Telephonic Communication Services (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a method, a device and equipment for identifying spliced voice. Wherein the method comprises the following steps: the method comprises the steps of obtaining normal voice data of a user, cutting the normal voice data into preset segments, splicing the normal voice data cut into the preset segments according to voice disorder to obtain spliced voice data, constructing a two-classification model based on the normal voice data and the spliced voice data, training a spliced voice model by adopting a long-short term memory network and a convolutional neural network, and identifying spliced voice for the voice data according to the two-classification model trained by the spliced voice model. Through the mode, the spliced voice can be recognized, and the safety of voice verification can be guaranteed.
Description
Technical Field
The invention relates to the technical field of voice recognition, in particular to a method, a device and equipment for recognizing spliced voice.
Background
In many scenes of real life, users often need to be subjected to voice verification, for example, software programs are logged in through voice verification or terminal equipment is logged in through voice verification, but some illegal persons cut voices of other users who are not the users, then spliced voices of specific audio contents are spliced, the spliced voices are tried to be adopted to imitate identities of real users to carry out voice verification, benefits are illegally obtained or some illegal operations are carried out, and the safety of voice verification cannot be guaranteed.
However, the prior art cannot recognize the spliced voice, and further cannot guarantee the safety of voice verification.
Disclosure of Invention
In view of this, the present invention provides a method, an apparatus, and a device for recognizing spliced speech, which can recognize spliced speech and further ensure the security of speech verification.
According to one aspect of the present invention, there is provided a method for recognizing spliced speech, including:
acquiring normal voice data of a user;
cutting the normal voice data into preset segments, and splicing the normal voice data cut into the preset segments according to the voice disorder sequence to obtain spliced voice data;
constructing a binary classification model based on the normal voice data and the spliced voice data;
training the spliced speech model of the two classification models by adopting a long-short term memory network and a convolutional neural network;
and performing recognition of spliced voice on the voice data according to the two classification models trained by the spliced voice model.
Wherein the constructing of the two classification models based on the normal speech data and the spliced speech data comprises:
and constructing a binary classification model based on the normal voice data and the spliced voice data by adopting a mode of respectively extracting the linear predictive analysis characteristic and the pitch characteristic of the normal voice data and the spliced voice data, carrying out differential operation and normalization operation on the linear predictive analysis characteristic and the pitch characteristic, and using the linear predictive analysis characteristic and the pitch characteristic after the differential operation and the normalization operation as training inputs of a long-short term memory network and a convolutional neural network.
The training of the spliced speech model on the two classification models by adopting the long-short term memory network and the convolutional neural network comprises the following steps:
and extracting acoustic features from the two classification models, inputting the extracted acoustic features into a long-short term memory network and a convolutional neural network, and training the spliced speech model by adopting the long-short term memory network and the convolutional neural network.
After the recognition of the spliced voice is performed on the voice data according to the two classification models trained by the spliced voice model, the method further comprises the following steps:
and performing parameter updating on the long-short term memory network and the convolutional neural network by using a loss function of cross entropy loss and an optimization algorithm, and performing training updating on the two classification models by using the long-short term memory network and the convolutional neural network after parameter updating through iteration of preset times.
According to another aspect of the present invention, there is provided a recognition apparatus for spliced voices, including:
the system comprises an acquisition module, a splicing module, a construction module, a training module and an identification module;
the acquisition module is used for acquiring normal voice data of a user;
the splicing module is used for cutting the normal voice data into preset segments and splicing the normal voice data cut into the preset segments according to a voice disorder sequence to obtain spliced voice data;
the building module is used for building a two-classification model based on the normal voice data and the spliced voice data;
the training module is used for training the spliced speech model of the two classification models by adopting a long-short term memory network and a convolutional neural network;
and the recognition module is used for recognizing spliced voice for the voice data according to the two classification models trained by the spliced voice model.
Wherein the building block is specifically configured to:
and constructing a binary classification model based on the normal voice data and the spliced voice data by adopting a mode of respectively extracting the linear predictive analysis characteristic and the pitch characteristic of the normal voice data and the spliced voice data, carrying out differential operation and normalization operation on the linear predictive analysis characteristic and the pitch characteristic, and using the linear predictive analysis characteristic and the pitch characteristic after the differential operation and the normalization operation as training inputs of a long-short term memory network and a convolutional neural network.
Wherein, the training module is specifically configured to:
and extracting acoustic features from the two classification models, inputting the extracted acoustic features into a long-short term memory network and a convolutional neural network, and training the spliced speech model by adopting the long-short term memory network and the convolutional neural network.
Wherein, the spliced voice recognition device further comprises:
an update module;
the updating module is used for carrying out parameter updating on the long-short term memory network and the convolutional neural network through a loss function of cross entropy loss and an optimization algorithm, and carrying out training updating on the two classification models through iteration of preset times by adopting the long-short term memory network and the convolutional neural network after the parameter updating.
According to still another aspect of the present invention, there is provided a recognition apparatus for spliced voices, comprising:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method of recognition of spliced speech as claimed in any one of the preceding claims.
According to yet another aspect of the present invention, there is provided a computer-readable storage medium storing a computer program, wherein the computer program is configured to implement the method for recognizing spliced speech according to any one of the above aspects when executed by a processor.
It can be found that, with the above scheme, normal voice data of a user can be acquired, and the normal voice data can be cut into a preset number of segments, and the normal voice data cut into the preset number of segments is spliced out of order according to voice to obtain spliced voice data, and a two-class model based on the normal voice data and the spliced voice data can be constructed, and a long-short term memory network and a convolutional neural network can be adopted to carry out training of a spliced voice model on the two-class model, and recognition of spliced voice can be carried out on the voice data according to the two-class model trained by the spliced voice model, so that recognition of the spliced voice can be realized, and further, the safety of voice verification can be ensured.
Further, the above scheme may adopt a mode of respectively extracting the linear prediction analysis feature and the pitch feature of the normal speech data and the spliced speech data, performing a difference operation and a normalization operation on the linear prediction analysis feature and the pitch feature, and using the linear prediction analysis feature and the pitch feature after the difference operation and the normalization operation as training inputs of a long-short term memory network and a convolutional neural network to construct a two-class model based on the normal speech data and the spliced speech data, which is advantageous because the long-short term memory network and the convolutional neural network can retain information of an audio context, thereby being capable of facilitating recognition of spliced speech.
Furthermore, according to the scheme, the acoustic features can be extracted from the two classification models, the extracted acoustic features are input into the long-short term memory network and the convolutional neural network, and the two classification models are trained by adopting the long-short term memory network and the convolutional neural network, so that the advantage that the features of the spliced voice can be more prominent due to the extracted acoustic features, and the accuracy of the recognition of the spliced voice can be improved.
Furthermore, the above scheme can perform parameter updating on the long-short term memory network and the convolutional neural network through the loss function of cross entropy loss and the optimization algorithm, and perform training and updating on the binary model through iteration of preset times by using the long-short term memory network and the convolutional neural network after the parameter updating, so that the advantage of improving the accuracy of the recognition of the spliced voice can be realized.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flow chart of an embodiment of a method for recognizing spliced speech according to the present invention;
FIG. 2 is a flow chart of another embodiment of the method for recognizing spliced speech according to the present invention;
FIG. 3 is a schematic structural diagram of an embodiment of a device for recognizing spliced speech according to the present invention;
FIG. 4 is a schematic structural diagram of another embodiment of a device for recognizing spliced speech according to the present invention;
fig. 5 is a schematic structural diagram of an embodiment of a device for recognizing spliced speech according to the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be noted that the following examples are only illustrative of the present invention, and do not limit the scope of the present invention. Similarly, the following examples are only some but not all examples of the present invention, and all other examples obtained by those skilled in the art without any inventive work are within the scope of the present invention.
The invention provides a spliced voice recognition method, which can realize the recognition of spliced voice and further ensure the safety of voice verification.
Referring to fig. 1, fig. 1 is a flowchart illustrating a method for recognizing a concatenated speech according to an embodiment of the present invention. It should be noted that the method of the present invention is not limited to the flow sequence shown in fig. 1 if the results are substantially the same. As shown in fig. 1, the method comprises the steps of:
s101: and acquiring normal voice data of the user.
In this embodiment, the user may be a single user or multiple users, and the present invention is not limited thereto.
In this embodiment, the normal voice data of multiple users may be obtained at one time, may also be obtained for multiple times, and may also be obtained one by one, and the like.
S102: and cutting the normal voice data into preset segments, and splicing the normal voice data cut into the preset segments according to the voice disorder sequence to obtain spliced voice data.
In this embodiment, the normal voice data may be cut into 2 preset segments, or the normal voice data may be cut into 3 preset segments, or the normal voice data may be cut into other preset segments, which is not limited by the present invention.
S103: and constructing a binary classification model based on the normal voice data and the spliced voice data.
Wherein, the constructing the two-classification model based on the normal voice data and the spliced voice data may include:
the method comprises the steps of respectively extracting LPC (Linear predictive coding) characteristics and pitch characteristics of the normal voice data and the spliced voice data, carrying out difference operation and normalization operation on the Linear predictive coding characteristics and the pitch characteristics, and using the Linear predictive coding characteristics and the pitch characteristics after the difference operation and the normalization operation as training inputs of LSTM (Long Short-Term Memory, Long Short-Term Memory network and Convolutional Neural network) and CNN (Convolutional Neural network) to construct a two-class model based on the normal voice data and the spliced voice data.
S104: and training the spliced speech model by adopting a long-short term memory network and a convolutional neural network.
The training of the spliced speech model to the two classification models by using the long-short term memory network and the convolutional neural network may include:
the acoustic features are extracted from the two classification models, the extracted acoustic features are input into the long-short term memory network and the convolutional neural network, and the two classification models are trained by adopting the long-short term memory network and the convolutional neural network, so that the advantage that the extracted acoustic features can make the features of spliced voice more prominent and improve the accuracy of recognition of the spliced voice is achieved.
In this embodiment, the long-short term memory network and the convolutional neural network may include two long-short term memory layers and two full-connected layers, may also include three long-short term memory layers and three full-connected layers, and may also include four long-short term memory layers and four full-connected layers, which is not limited in the present invention.
S105: and performing recognition of spliced voice on the voice data according to the two classification models trained by the spliced voice model.
After the recognizing of the spliced speech is performed on the speech data according to the two classification models trained by the spliced speech model, the method may further include:
the method has the advantages that the method can improve the accuracy of recognition of spliced voice by carrying out parameter updating on the long-short term memory network and the convolutional neural network through a loss function of cross entropy loss and an optimization algorithm, and training and updating the binary model through iteration of preset times by adopting the long-short term memory network and the convolutional neural network after parameter updating.
It can be found that, in this embodiment, the normal voice data of the user can be acquired, and the normal voice data can be cut into the preset number of segments, and the normal voice data cut into the preset number of segments is spliced out of order according to the voice to obtain spliced voice data, and a two-classification model based on the normal voice data and the spliced voice data can be constructed, and the training of the spliced voice model can be performed on the two-classification model by using a long-short term memory network and a convolutional neural network, and the recognition of the spliced voice can be performed on the voice data according to the two-classification model trained by the spliced voice model, so that the recognition of the spliced voice can be realized, and the security of the voice verification can be ensured.
Further, in this embodiment, a binary classification model based on the normal speech data and the spliced speech data may be constructed in a manner of extracting a linear predictive analysis feature and a pitch feature of the normal speech data and the spliced speech data, respectively, performing a difference operation and a normalization operation on the linear predictive analysis feature and the pitch feature, and using the linear predictive analysis feature and the pitch feature after the difference operation and the normalization operation as training inputs of a long-short term memory network and a convolutional neural network, which is advantageous in that the long-short term memory network and the convolutional neural network can retain information of an audio context, thereby facilitating recognition of the spliced speech.
Further, in this embodiment, the acoustic features may be extracted from the two classification models, and the extracted acoustic features may be input to the long-short term memory network and the convolutional neural network, and the training of the spliced speech model may be performed on the two classification models by using the long-short term memory network and the convolutional neural network.
Referring to fig. 2, fig. 2 is a flow chart of a method for recognizing a concatenated speech according to another embodiment of the present invention. In this embodiment, the method includes the steps of:
s201: and acquiring normal voice data of the user.
As described above in S101, further description is omitted here.
S202: and cutting the normal voice data into preset segments, and splicing the normal voice data cut into the preset segments according to the voice disorder sequence to obtain spliced voice data.
As described above in S102, further description is omitted here.
S203: and constructing a binary classification model based on the normal voice data and the spliced voice data.
As described above in S103, which is not described herein.
S204: and training the spliced speech model by adopting a long-short term memory network and a convolutional neural network.
As described above in S104, and will not be described herein.
S205: and performing recognition of spliced voice on the voice data according to the two classification models trained by the spliced voice model.
S206: and performing parameter updating on the long-short term memory network and the convolutional neural network by using a loss function of cross entropy loss and an optimization algorithm, and performing training updating on the binary model by using the long-short term memory network and the convolutional neural network after parameter updating through iteration of preset times.
It can be found that, in this embodiment, parameter updating can be performed on the long-short term memory network and the convolutional neural network by using a loss function of cross entropy loss and an optimization algorithm, and the long-short term memory network and the convolutional neural network after parameter updating are used to train and update the binary model through iteration of preset times, which has the advantage of improving the accuracy of recognition of the spliced speech.
The invention also provides a spliced voice recognition device, which can realize the recognition of spliced voice and further ensure the safety of voice verification.
Referring to fig. 3, fig. 3 is a schematic structural diagram of a recognition apparatus for concatenative speech according to an embodiment of the present invention. In this embodiment, the device 30 for recognizing the spliced speech includes an obtaining module 31, a splicing module 32, a constructing module 33, a training module 34, and a recognition module 35.
The obtaining module 31 is configured to obtain normal voice data of a user.
The splicing module 32 is configured to cut the normal voice data into preset segments, and splice the normal voice data cut into the preset segments according to a voice disorder order to obtain spliced voice data.
The building module 33 is configured to build a binary model based on the normal speech data and the spliced speech data.
The training module 34 is configured to train the spliced speech model using the long-short term memory network and the convolutional neural network.
The recognition module 35 is configured to perform recognition of the spliced speech on the speech data according to the two classification models trained by the spliced speech model.
Optionally, the building block 33 may be specifically configured to:
and constructing a binary classification model based on the normal voice data and the spliced voice data by adopting a mode of respectively extracting the linear predictive analysis characteristic and the pitch characteristic of the normal voice data and the spliced voice data, carrying out difference operation and normalization operation on the linear predictive analysis characteristic and the pitch characteristic, and taking the linear predictive analysis characteristic and the pitch characteristic subjected to the difference operation and the normalization operation as training inputs of a long-short term memory network and a convolutional neural network.
Optionally, the training module 34 may be specifically configured to:
and extracting acoustic features from the two classification models, inputting the extracted acoustic features into a long-short term memory network and a convolutional neural network, and training the two classification models by adopting the long-short term memory network and the convolutional neural network.
Referring to fig. 4, fig. 4 is a schematic structural diagram of a recognition apparatus for concatenative speech according to another embodiment of the present invention. Different from the previous embodiment, the device 40 for recognizing the spliced speech according to the present embodiment further includes an updating module 41.
The updating module 41 is configured to perform parameter updating on the long-short term memory network and the convolutional neural network through a loss function of cross entropy loss and an optimization algorithm, and perform training and updating on the binary model through iteration of preset times by using the long-short term memory network and the convolutional neural network after the parameter updating.
Each unit module of the device 30/40 for recognizing spliced speech can respectively execute the corresponding steps in the above embodiments of the method, and therefore, the detailed description of each unit module is omitted here, and please refer to the description of the corresponding steps above.
The present invention further provides a device for recognizing spliced speech, as shown in fig. 5, including: at least one processor 51; and a memory 52 communicatively coupled to the at least one processor 51; the memory 52 stores instructions executable by the at least one processor 51, and the instructions are executed by the at least one processor 51 to enable the at least one processor 51 to execute the above-mentioned method for recognizing the spliced speech.
Wherein the memory 52 and the processor 51 are coupled in a bus, which may comprise any number of interconnected buses and bridges, which couple one or more of the various circuits of the processor 51 and the memory 52 together. The bus may also connect various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. A bus interface provides an interface between the bus and the transceiver. The transceiver may be one element or a plurality of elements, such as a plurality of receivers and transmitters, providing a means for communicating with various other apparatus over a transmission medium. The data processed by the processor 51 is transmitted over a wireless medium via an antenna, which further receives the data and transmits the data to the processor 51.
The processor 51 is responsible for managing the bus and general processing and may also provide various functions including timing, peripheral interfaces, voltage regulation, power management, and other control functions. And the memory 52 may be used to store data used by the processor 51 in performing operations.
The present invention further provides a computer-readable storage medium storing a computer program. The computer program realizes the above-described method embodiments when executed by a processor.
It can be found that, with the above scheme, normal voice data of a user can be acquired, and the normal voice data can be cut into a preset number of segments, and the normal voice data cut into the preset number of segments is spliced out of order according to voice to obtain spliced voice data, and a two-class model based on the normal voice data and the spliced voice data can be constructed, and a long-short term memory network and a convolutional neural network can be adopted to carry out training of a spliced voice model on the two-class model, and recognition of spliced voice can be carried out on the voice data according to the two-class model trained by the spliced voice model, so that recognition of the spliced voice can be realized, and further, the safety of voice verification can be ensured.
Further, the above scheme may adopt a mode of respectively extracting the linear prediction analysis feature and the pitch feature of the normal speech data and the spliced speech data, performing a difference operation and a normalization operation on the linear prediction analysis feature and the pitch feature, and using the linear prediction analysis feature and the pitch feature after the difference operation and the normalization operation as training inputs of a long-short term memory network and a convolutional neural network to construct a two-class model based on the normal speech data and the spliced speech data, which is advantageous because the long-short term memory network and the convolutional neural network can retain information of an audio context, thereby being capable of facilitating recognition of spliced speech.
Furthermore, according to the scheme, the acoustic features can be extracted from the two classification models, the extracted acoustic features are input into the long-short term memory network and the convolutional neural network, and the two classification models are trained by adopting the long-short term memory network and the convolutional neural network, so that the advantage that the features of the spliced voice can be more prominent due to the extracted acoustic features, and the accuracy of the recognition of the spliced voice can be improved.
Furthermore, the above scheme can perform parameter updating on the long-short term memory network and the convolutional neural network through the loss function of cross entropy loss and the optimization algorithm, and perform training and updating on the binary model through iteration of preset times by using the long-short term memory network and the convolutional neural network after the parameter updating, so that the advantage of improving the accuracy of the recognition of the spliced voice can be realized.
In the several embodiments provided in the present invention, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a module or a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be substantially or partially implemented in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above description is only a part of the embodiments of the present invention, and not intended to limit the scope of the present invention, and all equivalent devices or equivalent processes performed by the present invention through the contents of the specification and the drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.
Claims (10)
1. A method for recognizing spliced speech, comprising:
acquiring normal voice data of a user;
cutting the normal voice data into preset segments, and splicing the normal voice data cut into the preset segments according to the voice disorder sequence to obtain spliced voice data;
constructing a binary classification model based on the normal voice data and the spliced voice data;
training the spliced speech model of the two classification models by adopting a long-short term memory network and a convolutional neural network;
and performing recognition of spliced voice on the voice data according to the two classification models trained by the spliced voice model.
2. The method for recognizing spliced speech according to claim 1, wherein said constructing a binary model based on the normal speech data and the spliced speech data comprises:
and constructing a binary classification model based on the normal voice data and the spliced voice data by adopting a mode of respectively extracting the linear predictive analysis characteristic and the pitch characteristic of the normal voice data and the spliced voice data, carrying out differential operation and normalization operation on the linear predictive analysis characteristic and the pitch characteristic, and using the linear predictive analysis characteristic and the pitch characteristic after the differential operation and the normalization operation as training inputs of a long-short term memory network and a convolutional neural network.
3. The method for recognizing spliced speech as claimed in claim 1, wherein said training of the spliced speech model using the long-short term memory network and the convolutional neural network comprises:
and extracting acoustic features from the two classification models, inputting the extracted acoustic features into a long-short term memory network and a convolutional neural network, and training the spliced speech model by adopting the long-short term memory network and the convolutional neural network.
4. The method for recognizing spliced speech as claimed in claim 1, further comprising, after said recognizing spliced speech from speech data according to said trained binary model of spliced speech model:
and performing parameter updating on the long-short term memory network and the convolutional neural network by using a loss function of cross entropy loss and an optimization algorithm, and performing training updating on the two classification models by using the long-short term memory network and the convolutional neural network after parameter updating through iteration of preset times.
5. A device for recognizing spliced speech, comprising:
the system comprises an acquisition module, a splicing module, a construction module, a training module and an identification module;
the acquisition module is used for acquiring normal voice data of a user;
the splicing module is used for cutting the normal voice data into preset segments and splicing the normal voice data cut into the preset segments according to a voice disorder sequence to obtain spliced voice data;
the building module is used for building a two-classification model based on the normal voice data and the spliced voice data;
the training module is used for training the spliced speech model of the two classification models by adopting a long-short term memory network and a convolutional neural network;
and the recognition module is used for recognizing spliced voice for the voice data according to the two classification models trained by the spliced voice model.
6. The device for recognizing spliced speech as claimed in claim 5, wherein said construction module is specifically configured to:
and constructing a binary classification model based on the normal voice data and the spliced voice data by adopting a mode of respectively extracting the linear predictive analysis characteristic and the pitch characteristic of the normal voice data and the spliced voice data, carrying out differential operation and normalization operation on the linear predictive analysis characteristic and the pitch characteristic, and using the linear predictive analysis characteristic and the pitch characteristic after the differential operation and the normalization operation as training inputs of a long-short term memory network and a convolutional neural network.
7. The device for recognizing spliced speech as claimed in claim 5, wherein said training module is specifically configured to:
and extracting acoustic features from the two classification models, inputting the extracted acoustic features into a long-short term memory network and a convolutional neural network, and training the spliced speech model by adopting the long-short term memory network and the convolutional neural network.
8. The apparatus for recognizing spliced speech as defined in claim 5, further comprising:
an update module;
the updating module is used for carrying out parameter updating on the long-short term memory network and the convolutional neural network through a loss function of cross entropy loss and an optimization algorithm, and carrying out training updating on the two classification models through iteration of preset times by adopting the long-short term memory network and the convolutional neural network after the parameter updating.
9. A spliced speech recognition apparatus, comprising:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method of recognition of spliced speech as claimed in any one of claims 1 to 4.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, implements the method for recognizing spliced speech according to any one of claims 1 to 4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010002558.9A CN111009238B (en) | 2020-01-02 | 2020-01-02 | Method, device and equipment for recognizing spliced voice |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010002558.9A CN111009238B (en) | 2020-01-02 | 2020-01-02 | Method, device and equipment for recognizing spliced voice |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111009238A true CN111009238A (en) | 2020-04-14 |
CN111009238B CN111009238B (en) | 2023-06-23 |
Family
ID=70120411
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010002558.9A Active CN111009238B (en) | 2020-01-02 | 2020-01-02 | Method, device and equipment for recognizing spliced voice |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111009238B (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111477235A (en) * | 2020-04-15 | 2020-07-31 | 厦门快商通科技股份有限公司 | Voiceprint acquisition method, device and equipment |
CN111583947A (en) * | 2020-04-30 | 2020-08-25 | 厦门快商通科技股份有限公司 | Voice enhancement method, device and equipment |
CN111583946A (en) * | 2020-04-30 | 2020-08-25 | 厦门快商通科技股份有限公司 | Voice signal enhancement method, device and equipment |
CN111599351A (en) * | 2020-04-30 | 2020-08-28 | 厦门快商通科技股份有限公司 | Voice recognition method, device and equipment |
CN113516969A (en) * | 2021-09-14 | 2021-10-19 | 北京远鉴信息技术有限公司 | Spliced voice identification method and device, electronic equipment and storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102456345A (en) * | 2010-10-19 | 2012-05-16 | 盛乐信息技术(上海)有限公司 | Concatenated speech detection system and method |
US20160027444A1 (en) * | 2014-07-22 | 2016-01-28 | Nuance Communications, Inc. | Method and apparatus for detecting splicing attacks on a speaker verification system |
CN108288470A (en) * | 2017-01-10 | 2018-07-17 | 富士通株式会社 | Auth method based on vocal print and device |
CN109376264A (en) * | 2018-11-09 | 2019-02-22 | 广州势必可赢网络科技有限公司 | A kind of audio-frequency detection, device, equipment and computer readable storage medium |
CN110491391A (en) * | 2019-07-02 | 2019-11-22 | 厦门大学 | A kind of deception speech detection method based on deep neural network |
-
2020
- 2020-01-02 CN CN202010002558.9A patent/CN111009238B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102456345A (en) * | 2010-10-19 | 2012-05-16 | 盛乐信息技术(上海)有限公司 | Concatenated speech detection system and method |
US20160027444A1 (en) * | 2014-07-22 | 2016-01-28 | Nuance Communications, Inc. | Method and apparatus for detecting splicing attacks on a speaker verification system |
CN108288470A (en) * | 2017-01-10 | 2018-07-17 | 富士通株式会社 | Auth method based on vocal print and device |
CN109376264A (en) * | 2018-11-09 | 2019-02-22 | 广州势必可赢网络科技有限公司 | A kind of audio-frequency detection, device, equipment and computer readable storage medium |
CN110491391A (en) * | 2019-07-02 | 2019-11-22 | 厦门大学 | A kind of deception speech detection method based on deep neural network |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111477235A (en) * | 2020-04-15 | 2020-07-31 | 厦门快商通科技股份有限公司 | Voiceprint acquisition method, device and equipment |
CN111583947A (en) * | 2020-04-30 | 2020-08-25 | 厦门快商通科技股份有限公司 | Voice enhancement method, device and equipment |
CN111583946A (en) * | 2020-04-30 | 2020-08-25 | 厦门快商通科技股份有限公司 | Voice signal enhancement method, device and equipment |
CN111599351A (en) * | 2020-04-30 | 2020-08-28 | 厦门快商通科技股份有限公司 | Voice recognition method, device and equipment |
CN113516969A (en) * | 2021-09-14 | 2021-10-19 | 北京远鉴信息技术有限公司 | Spliced voice identification method and device, electronic equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN111009238B (en) | 2023-06-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111009238B (en) | Method, device and equipment for recognizing spliced voice | |
CN107492379B (en) | Voiceprint creating and registering method and device | |
EP3451328A1 (en) | Method and apparatus for verifying information | |
CN111444340A (en) | Text classification and recommendation method, device, equipment and storage medium | |
CN112380853B (en) | Service scene interaction method and device, terminal equipment and storage medium | |
CN110942763B (en) | Speech recognition method and device | |
CN112052321A (en) | Man-machine conversation method, device, computer equipment and storage medium | |
CN113283238B (en) | Text data processing method and device, electronic equipment and storage medium | |
CN111160004B (en) | Method and device for establishing sentence-breaking model | |
CN111159358A (en) | Multi-intention recognition training and using method and device | |
CN113837669B (en) | Evaluation index construction method of label system and related device | |
CN112836521A (en) | Question-answer matching method and device, computer equipment and storage medium | |
CN111798118B (en) | Enterprise operation risk monitoring method and device | |
CN114626380A (en) | Entity identification method and device, electronic equipment and storage medium | |
CN112115248B (en) | Method and system for extracting dialogue strategy structure from dialogue corpus | |
CN112765981A (en) | Text information generation method and device | |
CN112632248A (en) | Question answering method, device, computer equipment and storage medium | |
CN111128234B (en) | Spliced voice recognition detection method, device and equipment | |
CN112925895A (en) | Natural language software operation and maintenance method and device | |
CN110570877B (en) | Sign language video generation method, electronic device and computer readable storage medium | |
CN116935277A (en) | Multi-mode emotion recognition method and device | |
CN111179912A (en) | Detection method, device and equipment for spliced voice | |
CN115101075A (en) | Voice recognition method and related device | |
CN111128235A (en) | Age prediction method, device and equipment based on voice | |
CN111477235B (en) | Voiceprint acquisition method, voiceprint acquisition device and voiceprint acquisition equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |