CN111009238A

CN111009238A - Spliced voice recognition method, device and equipment

Info

Publication number: CN111009238A
Application number: CN202010002558.9A
Authority: CN
Inventors: 陈剑超; 肖龙源; 李稀敏; 蔡振华; 刘晓葳
Original assignee: Xiamen Kuaishangtong Technology Co Ltd
Current assignee: Xiamen Kuaishangtong Technology Co Ltd
Priority date: 2020-01-02
Filing date: 2020-01-02
Publication date: 2020-04-14
Anticipated expiration: 2040-01-02
Also published as: CN111009238B

Abstract

The invention discloses a method, a device and equipment for identifying spliced voice. Wherein the method comprises the following steps: the method comprises the steps of obtaining normal voice data of a user, cutting the normal voice data into preset segments, splicing the normal voice data cut into the preset segments according to voice disorder to obtain spliced voice data, constructing a two-classification model based on the normal voice data and the spliced voice data, training a spliced voice model by adopting a long-short term memory network and a convolutional neural network, and identifying spliced voice for the voice data according to the two-classification model trained by the spliced voice model. Through the mode, the spliced voice can be recognized, and the safety of voice verification can be guaranteed.

Description

Spliced voice recognition method, device and equipment

Technical Field

The invention relates to the technical field of voice recognition, in particular to a method, a device and equipment for recognizing spliced voice.

Background

In many scenes of real life, users often need to be subjected to voice verification, for example, software programs are logged in through voice verification or terminal equipment is logged in through voice verification, but some illegal persons cut voices of other users who are not the users, then spliced voices of specific audio contents are spliced, the spliced voices are tried to be adopted to imitate identities of real users to carry out voice verification, benefits are illegally obtained or some illegal operations are carried out, and the safety of voice verification cannot be guaranteed.

However, the prior art cannot recognize the spliced voice, and further cannot guarantee the safety of voice verification.

Disclosure of Invention

In view of this, the present invention provides a method, an apparatus, and a device for recognizing spliced speech, which can recognize spliced speech and further ensure the security of speech verification.

According to one aspect of the present invention, there is provided a method for recognizing spliced speech, including:

acquiring normal voice data of a user;

cutting the normal voice data into preset segments, and splicing the normal voice data cut into the preset segments according to the voice disorder sequence to obtain spliced voice data;

constructing a binary classification model based on the normal voice data and the spliced voice data;

training the spliced speech model of the two classification models by adopting a long-short term memory network and a convolutional neural network;

and performing recognition of spliced voice on the voice data according to the two classification models trained by the spliced voice model.

Wherein the constructing of the two classification models based on the normal speech data and the spliced speech data comprises:

and constructing a binary classification model based on the normal voice data and the spliced voice data by adopting a mode of respectively extracting the linear predictive analysis characteristic and the pitch characteristic of the normal voice data and the spliced voice data, carrying out differential operation and normalization operation on the linear predictive analysis characteristic and the pitch characteristic, and using the linear predictive analysis characteristic and the pitch characteristic after the differential operation and the normalization operation as training inputs of a long-short term memory network and a convolutional neural network.

The training of the spliced speech model on the two classification models by adopting the long-short term memory network and the convolutional neural network comprises the following steps:

and extracting acoustic features from the two classification models, inputting the extracted acoustic features into a long-short term memory network and a convolutional neural network, and training the spliced speech model by adopting the long-short term memory network and the convolutional neural network.

After the recognition of the spliced voice is performed on the voice data according to the two classification models trained by the spliced voice model, the method further comprises the following steps:

and performing parameter updating on the long-short term memory network and the convolutional neural network by using a loss function of cross entropy loss and an optimization algorithm, and performing training updating on the two classification models by using the long-short term memory network and the convolutional neural network after parameter updating through iteration of preset times.

According to another aspect of the present invention, there is provided a recognition apparatus for spliced voices, including:

the system comprises an acquisition module, a splicing module, a construction module, a training module and an identification module;

the acquisition module is used for acquiring normal voice data of a user;

the splicing module is used for cutting the normal voice data into preset segments and splicing the normal voice data cut into the preset segments according to a voice disorder sequence to obtain spliced voice data;

the building module is used for building a two-classification model based on the normal voice data and the spliced voice data;

the training module is used for training the spliced speech model of the two classification models by adopting a long-short term memory network and a convolutional neural network;

and the recognition module is used for recognizing spliced voice for the voice data according to the two classification models trained by the spliced voice model.

Wherein the building block is specifically configured to:

Wherein, the training module is specifically configured to:

Wherein, the spliced voice recognition device further comprises:

an update module;

the updating module is used for carrying out parameter updating on the long-short term memory network and the convolutional neural network through a loss function of cross entropy loss and an optimization algorithm, and carrying out training updating on the two classification models through iteration of preset times by adopting the long-short term memory network and the convolutional neural network after the parameter updating.

According to still another aspect of the present invention, there is provided a recognition apparatus for spliced voices, comprising:

at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method of recognition of spliced speech as claimed in any one of the preceding claims.

According to yet another aspect of the present invention, there is provided a computer-readable storage medium storing a computer program, wherein the computer program is configured to implement the method for recognizing spliced speech according to any one of the above aspects when executed by a processor.

It can be found that, with the above scheme, normal voice data of a user can be acquired, and the normal voice data can be cut into a preset number of segments, and the normal voice data cut into the preset number of segments is spliced out of order according to voice to obtain spliced voice data, and a two-class model based on the normal voice data and the spliced voice data can be constructed, and a long-short term memory network and a convolutional neural network can be adopted to carry out training of a spliced voice model on the two-class model, and recognition of spliced voice can be carried out on the voice data according to the two-class model trained by the spliced voice model, so that recognition of the spliced voice can be realized, and further, the safety of voice verification can be ensured.

Further, the above scheme may adopt a mode of respectively extracting the linear prediction analysis feature and the pitch feature of the normal speech data and the spliced speech data, performing a difference operation and a normalization operation on the linear prediction analysis feature and the pitch feature, and using the linear prediction analysis feature and the pitch feature after the difference operation and the normalization operation as training inputs of a long-short term memory network and a convolutional neural network to construct a two-class model based on the normal speech data and the spliced speech data, which is advantageous because the long-short term memory network and the convolutional neural network can retain information of an audio context, thereby being capable of facilitating recognition of spliced speech.

Furthermore, according to the scheme, the acoustic features can be extracted from the two classification models, the extracted acoustic features are input into the long-short term memory network and the convolutional neural network, and the two classification models are trained by adopting the long-short term memory network and the convolutional neural network, so that the advantage that the features of the spliced voice can be more prominent due to the extracted acoustic features, and the accuracy of the recognition of the spliced voice can be improved.

Furthermore, the above scheme can perform parameter updating on the long-short term memory network and the convolutional neural network through the loss function of cross entropy loss and the optimization algorithm, and perform training and updating on the binary model through iteration of preset times by using the long-short term memory network and the convolutional neural network after the parameter updating, so that the advantage of improving the accuracy of the recognition of the spliced voice can be realized.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flow chart of an embodiment of a method for recognizing spliced speech according to the present invention;

FIG. 2 is a flow chart of another embodiment of the method for recognizing spliced speech according to the present invention;

FIG. 3 is a schematic structural diagram of an embodiment of a device for recognizing spliced speech according to the present invention;

FIG. 4 is a schematic structural diagram of another embodiment of a device for recognizing spliced speech according to the present invention;

fig. 5 is a schematic structural diagram of an embodiment of a device for recognizing spliced speech according to the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be noted that the following examples are only illustrative of the present invention, and do not limit the scope of the present invention. Similarly, the following examples are only some but not all examples of the present invention, and all other examples obtained by those skilled in the art without any inventive work are within the scope of the present invention.

The invention provides a spliced voice recognition method, which can realize the recognition of spliced voice and further ensure the safety of voice verification.

Referring to fig. 1, fig. 1 is a flowchart illustrating a method for recognizing a concatenated speech according to an embodiment of the present invention. It should be noted that the method of the present invention is not limited to the flow sequence shown in fig. 1 if the results are substantially the same. As shown in fig. 1, the method comprises the steps of:

s101: and acquiring normal voice data of the user.

In this embodiment, the user may be a single user or multiple users, and the present invention is not limited thereto.

In this embodiment, the normal voice data of multiple users may be obtained at one time, may also be obtained for multiple times, and may also be obtained one by one, and the like.

S102: and cutting the normal voice data into preset segments, and splicing the normal voice data cut into the preset segments according to the voice disorder sequence to obtain spliced voice data.

In this embodiment, the normal voice data may be cut into 2 preset segments, or the normal voice data may be cut into 3 preset segments, or the normal voice data may be cut into other preset segments, which is not limited by the present invention.

S103: and constructing a binary classification model based on the normal voice data and the spliced voice data.

Wherein, the constructing the two-classification model based on the normal voice data and the spliced voice data may include:

the method comprises the steps of respectively extracting LPC (Linear predictive coding) characteristics and pitch characteristics of the normal voice data and the spliced voice data, carrying out difference operation and normalization operation on the Linear predictive coding characteristics and the pitch characteristics, and using the Linear predictive coding characteristics and the pitch characteristics after the difference operation and the normalization operation as training inputs of LSTM (Long Short-Term Memory, Long Short-Term Memory network and Convolutional Neural network) and CNN (Convolutional Neural network) to construct a two-class model based on the normal voice data and the spliced voice data.

S104: and training the spliced speech model by adopting a long-short term memory network and a convolutional neural network.

The training of the spliced speech model to the two classification models by using the long-short term memory network and the convolutional neural network may include:

the acoustic features are extracted from the two classification models, the extracted acoustic features are input into the long-short term memory network and the convolutional neural network, and the two classification models are trained by adopting the long-short term memory network and the convolutional neural network, so that the advantage that the extracted acoustic features can make the features of spliced voice more prominent and improve the accuracy of recognition of the spliced voice is achieved.

In this embodiment, the long-short term memory network and the convolutional neural network may include two long-short term memory layers and two full-connected layers, may also include three long-short term memory layers and three full-connected layers, and may also include four long-short term memory layers and four full-connected layers, which is not limited in the present invention.

S105: and performing recognition of spliced voice on the voice data according to the two classification models trained by the spliced voice model.

After the recognizing of the spliced speech is performed on the speech data according to the two classification models trained by the spliced speech model, the method may further include:

the method has the advantages that the method can improve the accuracy of recognition of spliced voice by carrying out parameter updating on the long-short term memory network and the convolutional neural network through a loss function of cross entropy loss and an optimization algorithm, and training and updating the binary model through iteration of preset times by adopting the long-short term memory network and the convolutional neural network after parameter updating.

It can be found that, in this embodiment, the normal voice data of the user can be acquired, and the normal voice data can be cut into the preset number of segments, and the normal voice data cut into the preset number of segments is spliced out of order according to the voice to obtain spliced voice data, and a two-classification model based on the normal voice data and the spliced voice data can be constructed, and the training of the spliced voice model can be performed on the two-classification model by using a long-short term memory network and a convolutional neural network, and the recognition of the spliced voice can be performed on the voice data according to the two-classification model trained by the spliced voice model, so that the recognition of the spliced voice can be realized, and the security of the voice verification can be ensured.

Further, in this embodiment, a binary classification model based on the normal speech data and the spliced speech data may be constructed in a manner of extracting a linear predictive analysis feature and a pitch feature of the normal speech data and the spliced speech data, respectively, performing a difference operation and a normalization operation on the linear predictive analysis feature and the pitch feature, and using the linear predictive analysis feature and the pitch feature after the difference operation and the normalization operation as training inputs of a long-short term memory network and a convolutional neural network, which is advantageous in that the long-short term memory network and the convolutional neural network can retain information of an audio context, thereby facilitating recognition of the spliced speech.

Further, in this embodiment, the acoustic features may be extracted from the two classification models, and the extracted acoustic features may be input to the long-short term memory network and the convolutional neural network, and the training of the spliced speech model may be performed on the two classification models by using the long-short term memory network and the convolutional neural network.

Referring to fig. 2, fig. 2 is a flow chart of a method for recognizing a concatenated speech according to another embodiment of the present invention. In this embodiment, the method includes the steps of:

s201: and acquiring normal voice data of the user.

As described above in S101, further description is omitted here.

S202: and cutting the normal voice data into preset segments, and splicing the normal voice data cut into the preset segments according to the voice disorder sequence to obtain spliced voice data.

As described above in S102, further description is omitted here.

S203: and constructing a binary classification model based on the normal voice data and the spliced voice data.

As described above in S103, which is not described herein.

S204: and training the spliced speech model by adopting a long-short term memory network and a convolutional neural network.

As described above in S104, and will not be described herein.

S205: and performing recognition of spliced voice on the voice data according to the two classification models trained by the spliced voice model.

S206: and performing parameter updating on the long-short term memory network and the convolutional neural network by using a loss function of cross entropy loss and an optimization algorithm, and performing training updating on the binary model by using the long-short term memory network and the convolutional neural network after parameter updating through iteration of preset times.

It can be found that, in this embodiment, parameter updating can be performed on the long-short term memory network and the convolutional neural network by using a loss function of cross entropy loss and an optimization algorithm, and the long-short term memory network and the convolutional neural network after parameter updating are used to train and update the binary model through iteration of preset times, which has the advantage of improving the accuracy of recognition of the spliced speech.

The invention also provides a spliced voice recognition device, which can realize the recognition of spliced voice and further ensure the safety of voice verification.

Referring to fig. 3, fig. 3 is a schematic structural diagram of a recognition apparatus for concatenative speech according to an embodiment of the present invention. In this embodiment, the device 30 for recognizing the spliced speech includes an obtaining module 31, a splicing module 32, a constructing module 33, a training module 34, and a recognition module 35.

The obtaining module 31 is configured to obtain normal voice data of a user.

The splicing module 32 is configured to cut the normal voice data into preset segments, and splice the normal voice data cut into the preset segments according to a voice disorder order to obtain spliced voice data.

The building module 33 is configured to build a binary model based on the normal speech data and the spliced speech data.

The training module 34 is configured to train the spliced speech model using the long-short term memory network and the convolutional neural network.

The recognition module 35 is configured to perform recognition of the spliced speech on the speech data according to the two classification models trained by the spliced speech model.

Optionally, the building block 33 may be specifically configured to:

and constructing a binary classification model based on the normal voice data and the spliced voice data by adopting a mode of respectively extracting the linear predictive analysis characteristic and the pitch characteristic of the normal voice data and the spliced voice data, carrying out difference operation and normalization operation on the linear predictive analysis characteristic and the pitch characteristic, and taking the linear predictive analysis characteristic and the pitch characteristic subjected to the difference operation and the normalization operation as training inputs of a long-short term memory network and a convolutional neural network.

Optionally, the training module 34 may be specifically configured to:

and extracting acoustic features from the two classification models, inputting the extracted acoustic features into a long-short term memory network and a convolutional neural network, and training the two classification models by adopting the long-short term memory network and the convolutional neural network.

Referring to fig. 4, fig. 4 is a schematic structural diagram of a recognition apparatus for concatenative speech according to another embodiment of the present invention. Different from the previous embodiment, the device 40 for recognizing the spliced speech according to the present embodiment further includes an updating module 41.

The updating module 41 is configured to perform parameter updating on the long-short term memory network and the convolutional neural network through a loss function of cross entropy loss and an optimization algorithm, and perform training and updating on the binary model through iteration of preset times by using the long-short term memory network and the convolutional neural network after the parameter updating.

Each unit module of the device 30/40 for recognizing spliced speech can respectively execute the corresponding steps in the above embodiments of the method, and therefore, the detailed description of each unit module is omitted here, and please refer to the description of the corresponding steps above.

The present invention further provides a device for recognizing spliced speech, as shown in fig. 5, including: at least one processor 51; and a memory 52 communicatively coupled to the at least one processor 51; the memory 52 stores instructions executable by the at least one processor 51, and the instructions are executed by the at least one processor 51 to enable the at least one processor 51 to execute the above-mentioned method for recognizing the spliced speech.

Wherein the memory 52 and the processor 51 are coupled in a bus, which may comprise any number of interconnected buses and bridges, which couple one or more of the various circuits of the processor 51 and the memory 52 together. The bus may also connect various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. A bus interface provides an interface between the bus and the transceiver. The transceiver may be one element or a plurality of elements, such as a plurality of receivers and transmitters, providing a means for communicating with various other apparatus over a transmission medium. The data processed by the processor 51 is transmitted over a wireless medium via an antenna, which further receives the data and transmits the data to the processor 51.

The processor 51 is responsible for managing the bus and general processing and may also provide various functions including timing, peripheral interfaces, voltage regulation, power management, and other control functions. And the memory 52 may be used to store data used by the processor 51 in performing operations.

The present invention further provides a computer-readable storage medium storing a computer program. The computer program realizes the above-described method embodiments when executed by a processor.

In the several embodiments provided in the present invention, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a module or a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be substantially or partially implemented in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above description is only a part of the embodiments of the present invention, and not intended to limit the scope of the present invention, and all equivalent devices or equivalent processes performed by the present invention through the contents of the specification and the drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A method for recognizing spliced speech, comprising:

acquiring normal voice data of a user;

2. The method for recognizing spliced speech according to claim 1, wherein said constructing a binary model based on the normal speech data and the spliced speech data comprises:

3. The method for recognizing spliced speech as claimed in claim 1, wherein said training of the spliced speech model using the long-short term memory network and the convolutional neural network comprises:

4. The method for recognizing spliced speech as claimed in claim 1, further comprising, after said recognizing spliced speech from speech data according to said trained binary model of spliced speech model:

5. A device for recognizing spliced speech, comprising:

the acquisition module is used for acquiring normal voice data of a user;

6. The device for recognizing spliced speech as claimed in claim 5, wherein said construction module is specifically configured to:

7. The device for recognizing spliced speech as claimed in claim 5, wherein said training module is specifically configured to:

8. The apparatus for recognizing spliced speech as defined in claim 5, further comprising:

an update module;

9. A spliced speech recognition apparatus, comprising:

at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method of recognition of spliced speech as claimed in any one of claims 1 to 4.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, implements the method for recognizing spliced speech according to any one of claims 1 to 4.