CN114627856A

CN114627856A - Voice recognition method, voice recognition device, storage medium and electronic equipment

Info

Publication number: CN114627856A
Application number: CN202210331745.0A
Authority: CN
Inventors: 周立峰; 朱浩齐; 周森; 杨卫强; 李雨珂; 魏凯峰
Original assignee: Hangzhou Netease Zhiqi Technology Co Ltd
Current assignee: Hangzhou Netease Zhiqi Technology Co Ltd
Priority date: 2022-03-30
Filing date: 2022-03-30
Publication date: 2022-06-14

Abstract

The embodiment of the disclosure relates to a voice recognition method, a voice recognition device, a storage medium and electronic equipment, and relates to the technical field of audio processing. The method comprises the following steps: acquiring audio data to be identified of a target person; inputting the audio data to be recognized into the first voice feature extraction submodel to obtain initial feature data of the audio data to be recognized; inputting the initial feature data into a second voice feature extraction submodel to obtain depth feature data of the audio data to be recognized; and inputting the depth characteristic data into the voice identification sub-model to obtain a classification result of the audio data to be identified. The method and the device improve the accuracy and the recognition efficiency of the forged voice.

Description

Voice recognition method, voice recognition device, storage medium and electronic equipment

Technical Field

Embodiments of the present disclosure relate to the field of audio processing technologies, and in particular, to a speech recognition method, a speech recognition apparatus, a computer-readable storage medium, and an electronic device.

Background

With the maturity of audio processing technology, it happens occasionally that a certain voice processing technology is used to synthesize and forge the voice of a target person, for example, text information can be processed into audio data of the target person; or processing the audio data of the non-target person into the audio data of the target person. The forged voice can bring infringement on information safety, property and the like to a counterfeited person. In particular, for some target persons with certain social influence, such as entrepreneurs, etc., the voice of the target person can be forged to cause more serious adverse effect. Therefore, it is necessary to recognize the forged voice.

This section is intended to provide a background or context to the embodiments of the disclosure recited in the claims and the description herein is not admitted to be prior art by inclusion in this section.

Disclosure of Invention

However, the existing counterfeit voice recognition scheme is mainly implemented by manual processing, is easy to generate errors, and is inefficient.

For this reason, an improved speech recognition scheme is highly desired, which can improve the efficiency and accuracy of recognizing the forged speech.

In this context, embodiments of the present disclosure are intended to provide a speech recognition method, apparatus, computer-readable storage medium, and electronic device.

According to a first aspect of embodiments of the present disclosure, there is provided a speech recognition method, the method including:

acquiring audio data to be identified of a target person;

inputting the audio data to be recognized into a first voice feature extraction submodel to obtain initial feature data of the audio data to be recognized, wherein the initial feature data comprises phase data of the audio data to be recognized, the first voice feature extraction submodel is a submodel of a pre-trained voice recognition model, and the voice recognition model further comprises a second voice feature extraction submodel and a voice recognition submodel;

inputting the initial feature data into the second voice feature extraction submodel to obtain depth feature data of the audio data to be recognized, wherein the depth feature data comprises time domain feature data, frequency domain feature data and feature weight data of the frequency domain feature data of the audio data to be recognized;

and inputting the depth characteristic data into the voice counterfeit distinguishing sub-model to obtain a classification result of the audio data to be recognized.

Optionally, the second voice feature extraction submodel includes a plurality of time-frequency feature extraction networks, a time-frequency feature fusion network, and a feature weight construction network, and the inputting the initial feature data into the second voice feature extraction submodel to obtain the depth feature data of the audio data to be recognized includes:

inputting the initial characteristic data into a plurality of time-frequency characteristic extraction networks which are connected in sequence to obtain time-domain characteristic data to be fused and frequency-domain characteristic data to be fused which are output by each time-frequency characteristic extraction network;

inputting a plurality of time domain feature data to be fused and a plurality of frequency domain feature data to be fused into the time-frequency feature fusion network, fusing the plurality of time domain feature data to obtain fused time domain feature data, and fusing the plurality of frequency domain feature data to obtain fused frequency domain feature data;

inputting the fused frequency domain feature data into the feature weight construction network, and determining feature weight data of the fused frequency domain feature data in a time sequence dimension;

determining the fusion time domain feature data as the time domain feature data, determining the fusion frequency domain feature data as the frequency domain feature data, and determining the feature weight data of the fusion frequency domain feature data in time sequence dimension as the feature weight data of the frequency domain feature data to obtain the depth feature data of the audio data to be identified.

Optionally, the audio data to be recognized is multiple frames of audio data to be recognized, the first speech feature extraction submodel is a wav2vec model, the wav2vec model includes a feature extraction layer and a context coding layer, and the inputting the audio data to be recognized into the first speech feature extraction submodel to obtain initial feature data of the audio data to be recognized includes:

inputting the audio data to be identified into the feature extraction layer to obtain shallow feature data of each frame of the audio data to be identified, wherein the shallow feature data comprises phase data of each frame of the audio data to be identified;

and inputting a plurality of shallow feature data into the context coding layer, and extracting associated feature data among all frames of the audio data to be recognized to obtain initial feature data of the audio data to be recognized.

Optionally, the inputting the depth feature data into the voice identification sub-model to obtain a classification result of the audio data to be identified includes:

inputting the depth characteristic data into the voice counterfeit distinguishing sub-model to obtain a classification result value;

if the classification result value is larger than or equal to a preset threshold value, determining that the audio data to be recognized is a forged voice of the target person, wherein the forged voice comprises a synthesized voice and/or a converted voice;

and if the classification result value is smaller than a preset threshold value, determining that the audio data to be recognized is the real voice of the target person.

Optionally, training the speech recognition model includes:

acquiring sample initial characteristic data corresponding to sample audio data of a plurality of target persons;

inputting the sample initial characteristic data into the second voice characteristic extraction submodel to be trained to obtain sample depth characteristic data of the sample audio data;

inputting the sample depth characteristic data into the voice identification sub-model to obtain a prediction class classification value;

determining a first loss function value according to the prediction class classification value and a class label value of the sample audio data;

and updating the parameters of the second voice feature extraction submodel and the voice identification submodel according to the first loss function value.

Optionally, the inputting the sample initial feature data into the second speech feature extraction sub-model to be trained to obtain sample depth feature data of the sample audio data includes:

inputting the initial sample characteristic data into a plurality of time-frequency characteristic extraction networks to be trained which are connected in sequence to obtain time-domain characteristic data to be fused and frequency-domain characteristic data to be fused of the sample output by each time-frequency characteristic extraction network to be trained;

inputting a plurality of time domain feature data to be fused of the samples and a plurality of frequency domain feature data to be fused of the samples into the time-frequency feature fusion network to be trained, fusing the time domain feature data to be fused of the samples to obtain sample fusion time domain feature data, and fusing the frequency domain feature data to be fused of the samples to obtain sample fusion frequency domain feature data;

inputting the sample fusion frequency domain feature data into the feature weight construction network to be trained, and determining sample feature weight data of the sample fusion frequency domain feature data in a time sequence dimension;

determining the sample fusion time domain feature data as sample time domain feature data, determining the sample fusion frequency domain feature data as sample frequency domain feature data, and determining the sample fusion frequency domain feature data in time sequence dimension as sample feature weight data of the sample frequency domain feature data to obtain sample depth feature data of the sample audio data.

Optionally, the determining a first loss function value according to the prediction class classification value and the class label value of the sample audio data includes:

inputting the prediction class classification value and the class label value of the sample audio data into a first loss function, and determining the first loss function value;

the first loss function includes:

where N denotes the number of sample audio data, i denotes the ith sample audio data among the N sample audio data, α denotes a scale factor,

a distance representing a center of real voice feature data from the prediction class classification value,

a center representing the real voice feature data,

representing the prediction class classification value, y_iA class label value representing the sample audio data.

Optionally, the speech recognition model further includes a language classification submodel and a gradient back-propagation layer, and after the sample initial feature data is input into the second speech feature extraction submodel to be trained to obtain sample depth feature data of the sample audio data, the method further includes:

inputting the sample depth feature data into the gradient reverse transmission layer and the language classification submodel in sequence to obtain a predicted language classification value;

determining a second loss function value according to the predicted language classification value and the language label value of the sample audio data;

the updating the parameters of the second voice feature extraction submodel and the voice identification submodel according to the first loss function value comprises the following steps:

updating the parameters of the voice identification submodel according to the first loss function value, and updating the parameters of the language classification submodel according to the second loss function value;

and updating the parameters of the second voice feature extraction submodel according to the first loss function value, the second loss function value and the gradient reverse transmission layer.

Optionally, the updating the parameter of the second speech feature extraction submodel according to the first loss function value, the second loss function value, and the gradient back-propagation layer includes:

transmitting the second loss function value to the gradient reverse transmission layer, and processing the second loss function value according to a gradient reverse transmission parameter to obtain a second loss function updating value;

processing a second loss function according to the gradient back propagation parameter to obtain an updated second loss function, and determining a combined loss function according to the first loss function and the updated second loss function;

determining a gradient value of the combined loss function with respect to a parameter of the second speech feature extraction submodel;

and obtaining the updated parameters of the second speech feature extraction submodel according to the gradient value, the first loss function value, the second loss function updating value and the model learning rate.

Optionally, training the speech recognition model includes:

acquiring sample audio data of a plurality of target persons;

inputting the sample audio data into the first voice feature extraction submodel to be trained to obtain sample initial feature data;

determining a third loss function value according to the sample initial characteristic data;

and updating the parameters of the first voice feature extraction submodel according to the third loss function value.

Optionally, before acquiring the audio data to be identified of the target person, the method further includes:

acquiring audio data to be processed;

extracting acoustic features of the audio data to be processed to obtain acoustic feature data of the audio data to be processed;

and if the voiceprint recognition result of the acoustic feature data corresponds to the voiceprint of the target person, determining the audio data to be processed as the audio data to be recognized of the target person.

According to a second aspect of embodiments of the present disclosure, there is provided a speech recognition apparatus, the apparatus comprising:

the acquisition module is configured to acquire audio data to be identified of a target person;

the first feature extraction module is configured to input the audio data to be recognized into a first voice feature extraction submodel to obtain initial feature data of the audio data to be recognized, the initial feature data comprise phase data of the audio data to be recognized, the first voice feature extraction submodel is a submodel of a pre-trained voice recognition model, and the voice recognition model further comprises a second voice feature extraction submodel and a voice counterfeit recognition submodel;

a second feature extraction module configured to input the initial feature data into the second voice feature extraction submodel to obtain depth feature data of the audio data to be recognized, where the depth feature data includes time domain feature data and frequency domain feature data of the audio data to be recognized, and feature weight data of the frequency domain feature data;

and the false distinguishing module is configured to input the depth characteristic data into the voice false distinguishing submodel to obtain a classification result of the audio data to be recognized.

Optionally, the second speech feature extraction submodel includes a plurality of time-frequency feature extraction networks, a time-frequency feature fusion network, and a feature weight construction network, and the second feature extraction module is configured to:

inputting a plurality of time domain feature data to be fused and a plurality of frequency domain feature data to be fused into the time-frequency feature fusion network, fusing the time domain feature data to obtain fused time domain feature data, and fusing the frequency domain feature data to obtain fused frequency domain feature data;

determining the fused time domain feature data as the time domain feature data, determining the fused frequency domain feature data as the frequency domain feature data, and determining the feature weight data of the fused frequency domain feature data in a time sequence dimension as the feature weight data of the frequency domain feature data to obtain the depth feature data of the audio data to be identified.

Optionally, the audio data to be recognized is multiple frames of audio data to be recognized, the first speech feature extraction sub-model is a wav2vec model, the wav2vec model includes a feature extraction layer and a context coding layer, and the first feature extraction module is configured to:

Optionally, the authentication module is configured to:

inputting the depth feature data into the voice counterfeit distinguishing submodel to obtain a classification result value;

Optionally, the apparatus further includes a first model training module configured to:

determining a first loss function value from the prediction class classification value and a class label value of the sample audio data;

Optionally, the first model training module is configured to:

the first loss function includes:

representing the distance of the center of the real speech feature data from the prediction class classification value,

a center representing the real voice feature data,

representing the pre-stageMeasure the class classification value, y_iA class label value representing the sample audio data.

Optionally, the speech recognition model further includes a language classification sub-model and a gradient back-propagation layer, and the first model training module is further configured to:

the first model training module configured to:

updating the parameters of the voice authentication submodel according to the first loss function value, and updating the parameters of the language classification submodel according to the second loss function value;

Optionally, the first model training module is configured to:

Optionally, the apparatus further includes a second model training module configured to:

acquiring sample audio data of a plurality of target persons;

Optionally, the apparatus further comprises an audio recognition module configured to:

acquiring audio data to be processed;

According to a third aspect of embodiments of the present disclosure, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of the first aspect.

According to a fourth aspect of the disclosed embodiments, there is provided an electronic device comprising: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to perform the method of the first aspect via execution of the executable instructions.

According to the voice method, the device, the computer-readable storage medium and the electronic device in the embodiments of the present disclosure, since the phase information and the time-frequency feature information of the forged voice and the real voice have a large difference, the voice recognition model may classify whether the audio data to be recognized is the forged voice based on the initial feature data including the phase data of the audio data to be recognized and the depth feature data including the time-domain feature data, the frequency-domain feature data and the feature weight data of the frequency-domain feature data of the audio data to be recognized, so that the accuracy of the obtained classification result may be improved.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:

FIG. 1 is a system architecture diagram illustrating a speech recognition method operating environment according to an embodiment of the present disclosure;

FIG. 2 illustrates a flow diagram of a speech recognition method of an embodiment of the present disclosure;

FIG. 3 illustrates a schematic diagram of a speech recognition model according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram illustrating a first speech feature extraction submodel according to an embodiment of the disclosure;

FIG. 5 is a schematic diagram illustrating a second speech feature extraction submodel according to an embodiment of the disclosure;

FIG. 6 illustrates a schematic flow chart for training a speech recognition model according to an embodiment of the present disclosure;

FIG. 7 illustrates a flow diagram for another method of training a speech recognition model according to embodiments of the present disclosure;

FIG. 8 illustrates a schematic structural diagram of another speech recognition model of an embodiment of the present disclosure;

FIG. 9 is a schematic flow chart diagram illustrating training a speech recognition model according to another embodiment of the present disclosure;

FIG. 10 is a flow chart illustrating a process of updating parameters of a second speech feature extraction submodel according to an embodiment of the disclosure;

fig. 11 is a schematic diagram illustrating a structure of a speech recognition apparatus according to an embodiment of the present disclosure;

fig. 12 shows a block diagram of the structure of an electronic device according to an embodiment of the present disclosure.

In the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.

Detailed Description

The principles and spirit of the present disclosure will be described with reference to a number of exemplary embodiments. It is understood that these embodiments are given solely for the purpose of enabling those skilled in the art to better understand and to practice the present disclosure, and are not intended to limit the scope of the present disclosure in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

As will be appreciated by one of skill in the art, embodiments of the present disclosure may be embodied as a system, apparatus, device, method, or computer program product. Accordingly, the present disclosure may be embodied in the form of: entirely hardware, entirely software (including firmware, resident software, micro-code, etc.), or a combination of hardware and software.

According to an embodiment of the disclosure, a voice recognition method, a voice recognition device, a computer-readable storage medium and an electronic device are provided.

In this document, any number of elements in the drawings is by way of example and not by way of limitation, and any nomenclature is used solely for differentiation and not by way of limitation.

The principles and spirit of the present disclosure are explained in detail below with reference to several representative embodiments of the present disclosure.

Summary of The Invention

The inventor of the present disclosure finds that, with the development of audio processing technology, voice forgery can be performed depending on the audio processing technology, and the forged voice can be obtained based on a voice synthesis technology and a voice conversion technology, wherein the voice synthesis technology can convert non-audio information (for example, text information or text information in a picture) into audio data of a target person to obtain forged voice, and the forged voice can cause the public to misunderstand that the corresponding non-audio information is personally spoken by the target person according to the real meaning of the target person; the voice conversion technology can convert the audio data of a certain person into the audio data of a target person to obtain fake voice, and the fake voice can cause lawless persons to fit the speaking content from other persons in the manner of kissing, style and pronunciation of the target person, so that the aim of pretending the target person to speak is fulfilled. For some target persons with certain social influence, the forged voice about the target persons can cause more serious adverse effects if the forged voice is spread on a streaming media platform. Therefore, the streaming media platform needs to authenticate the audio data in the audio files before releasing the audio files related to the streaming media platform, so as to prevent lawbreakers from releasing fake audio files related to target persons.

Therefore, before the streaming media platform issues the audio file, the streaming media platform can identify the audio file and determine whether the audio data corresponding to the audio file is the forged voice of the target person. The auditor needs to be familiar with the speech characteristics of each target person, such as speaking rhythm, style, tone and the like, so that the auditing efficiency is low, and extremely high labor cost is required; meanwhile, in the manual auditing process, the auditing result is also related to the professional degree and the concentration degree of auditors, so that the situation that the forged voice is judged to be the non-forged voice easily occurs, and the accuracy of the auditing result is low.

In view of the above, the basic idea of the present disclosure is: a speech recognition method, a speech recognition device, a computer-readable storage medium and an electronic device are provided, which can acquire audio data to be recognized of a target person; inputting audio data to be recognized into a first voice feature extraction submodel to obtain initial feature data of the audio data to be recognized, wherein the initial feature data comprises phase data of the audio data to be recognized, the first voice feature extraction submodel is a submodel of a pre-trained voice recognition model, and the voice recognition model further comprises a second voice feature extraction submodel and a voice counterfeit discrimination submodel; inputting the initial characteristic data into a second voice characteristic extraction submodel to obtain depth characteristic data of the audio data to be recognized, wherein the depth characteristic data comprises time domain characteristic data, frequency domain characteristic data and characteristic weight data of the frequency domain characteristic data of the audio data to be recognized; and inputting the depth characteristic data into the voice discrimination sub-model to obtain a classification result of the audio data to be recognized. The audio data to be recognized of the target person can be recognized based on the pre-trained voice recognition model, and the recognition efficiency of determining whether the audio data to be recognized is the forged voice is improved; the voice recognition model can improve the accuracy of judging whether the audio data to be recognized is forged data or not based on the initial characteristic data and the depth characteristic data of the audio data to be recognized.

Having described the general principles of the present disclosure, various non-limiting embodiments of the present disclosure are described in detail below.

Application scene overview

It should be noted that the following application scenarios are merely illustrated to facilitate understanding of the spirit and principles of the present disclosure, and embodiments of the present disclosure are not limited in this respect. Rather, embodiments of the present disclosure may be applied to any scenario where applicable.

The present disclosure can be applied to a scenario of recognizing a forged voice, and is particularly applicable to a scenario of recognizing a forged voice obtained by a voice synthesis technology or a voice conversion technology, for example: for an audio file to be issued, the music platform or the short video platform can identify the audio file to be issued by using a pre-trained voice recognition model after determining that the audio file to be issued is an audio file of a target person, so as to determine whether audio data stored in the audio file to be issued is forged audio data of the target person. By adopting the technical scheme of the embodiment of the disclosure, the recognition efficiency of the audio data in the audio file to be issued can be improved, and the accuracy of the recognition result can be improved.

Fig. 1 is a system architecture diagram illustrating an environment in which a speech recognition method provided in an embodiment of the present disclosure operates. As shown in fig. 1, the system architecture 100 may include: a server 110 and a user terminal 120. The server 110 may be a background server of the speech recognition service, for example, a server of a music platform, a server of a short video platform, or a server of an audio-video platform. The user terminal 120 may be a user terminal used by a user, and generally, a network connection may be established between the server 110 and the user terminal 120 for interaction.

In an alternative embodiment, the speech recognition server may pre-train the speech recognition model, and the trained speech recognition model may be deployed in the server 110; the server 110 may receive an audio and video uploading request sent by the user terminal 120, analyze the audio and video uploading request to obtain audio data to be recognized of a target person, and the server 110 may recognize the audio data to be recognized by using a pre-trained speech recognition model to obtain a classification result for the audio data to be recognized.

In an optional implementation manner, the voice recognition service party may pre-train a voice recognition model, the trained voice recognition model is configured in an application program of the voice recognition service party, the server 110 may receive an application program downloading request sent by the user terminal 120, and an application program installation package configured with the voice recognition model is sent to the user terminal 120, and the user terminal 120 may respond to an audio and video file uploading operation of a user, and recognize the audio data to be recognized by using the pre-trained voice recognition model, so as to obtain a classification result for the audio data to be recognized.

Exemplary method

An exemplary embodiment of the present disclosure first provides a speech recognition method, which may be applied to a server, and the embodiments of the present disclosure describe the speech recognition method by taking the example of applying the speech recognition method to the server. As shown in fig. 2, the method may include the following steps S201 to S204:

step S201, acquiring audio data to be identified of a target person;

in the disclosed embodiment, the target person is a particular person who may be voice forged. For example, if the voice of public characters such as leaders, entrepreneurs, stars, etc. is not allowed to be forged on the streaming media platform, the target person may be a public character; alternatively, if the voice of the user is not permitted to be forged in the verification of the bank account and the like relating to information security, property and the like, the target person may be the user to be verified.

Step S202, inputting the audio data to be recognized into a first voice feature extraction submodel to obtain initial feature data of the audio data to be recognized;

in the embodiment of the present disclosure, the initial feature data includes phase data of audio data to be recognized, the first speech feature extraction submodel is a submodel of a pre-trained speech recognition model, and the speech recognition model further includes a second speech feature extraction submodel and a speech recognition submodel.

Step S203, inputting the initial feature data into a second voice feature extraction submodel to obtain depth feature data of the audio data to be recognized;

in the embodiment of the present disclosure, the depth feature data includes time domain feature data, frequency domain feature data, and feature weight data of the frequency domain feature data of the audio data to be identified.

And step S204, inputting the depth characteristic data into the voice fake identification sub-model to obtain a classification result of the audio data to be identified.

To sum up, the voice recognition method provided by the embodiment of the present disclosure may obtain audio data to be recognized of a target person, input the audio data to be recognized into the first voice feature extraction submodel to obtain initial feature data of the audio data to be recognized, input the initial feature data into the second voice feature extraction submodel to obtain depth feature data of the audio data to be recognized, and input the depth feature data into the voice recognition submodel to obtain a classification result of the audio data to be recognized. Because the phase information and the time-frequency characteristic information of the forged voice and the real voice have great difference, the voice recognition model can classify whether the audio data to be recognized is the forged voice or not based on the initial characteristic data of the phase data containing the audio data to be recognized and the depth characteristic data of the time domain characteristic data, the frequency domain characteristic data and the characteristic weight data of the frequency domain characteristic data containing the audio data to be recognized, and the accuracy of the obtained classification result can be improved.

In an alternative embodiment, in step S201, the server may obtain the audio data to be identified of the target person.

In the embodiment of the disclosure, for a third-party platform, when an audio file needs to be published, the audio data stored in the audio file needing to be published can be determined as audio data to be processed, whether the audio data to be processed is audio data of a target person is judged, and when the audio data to be processed is determined to be the audio data of the target person, the audio data to be processed is determined as audio data to be identified of the target person; when it is determined that the audio data to be processed is not the audio data of the target person, it may be determined that there is no need to perform recognition processing of a fake voice on the audio data to be processed, and the flow may be ended.

In an alternative embodiment, the process of the server determining whether the audio data to be processed is the audio data of the target person may include: acquiring audio data to be processed; extracting acoustic features of the audio data to be processed to obtain acoustic feature data of the audio data to be processed; and performing voiceprint recognition on the acoustic characteristic data, and if the voiceprint recognition result of the acoustic characteristic data corresponds to the voiceprint of the target person, determining the audio data to be processed as the audio data to be recognized of the target person. The acoustic feature refers to a physical quantity representing acoustic characteristics of voice, and may include an energy concentration region representing a tone color, a formant frequency, a formant intensity and a bandwidth, and a duration, a fundamental frequency, an average speech sound power, and the like representing prosodic characteristics of voice. Whether the audio data to be processed is the audio data of the target person or not can be determined based on an acoustic feature extraction technology and a voiceprint recognition technology, and recognition accuracy of the audio data to be processed is improved.

The process of extracting the acoustic features of the audio data to be processed by the server may be implemented based on an MFCC (Mel-Frequency cepstral Coefficients, Mel Frequency cepstrum coefficient) technology, or implemented based on an FBank (Mel-scale filterbank, Mel scale filter bank) technology, which is not limited in the embodiment of the present disclosure; the process of the server for voiceprint recognition of the acoustic feature data can also be realized based on the MFCC technology or the FBank technology.

In an optional implementation manner, when determining that the audio data to be processed is the audio data of the target person, the server determines the audio data to be processed as the audio data to be identified of the target person, and acquires the audio data to be identified of the target person.

In an optional embodiment, in step S202, the server may input the audio data to be recognized into the first speech feature extraction sub-model, so as to obtain initial feature data of the audio data to be recognized.

In the embodiment of the disclosure, for the audio data to be recognized of the target person, the voice data to be recognized may be recognized based on a pre-trained voice recognition model to determine whether the audio data to be recognized is a counterfeit voice of the target person.

In an alternative implementation, as shown in fig. 3, fig. 3 illustrates a schematic structural diagram of a speech recognition model provided by an embodiment of the present disclosure, where the speech recognition model may include a first speech feature extraction sub-model 301, a second speech feature extraction sub-model 302 and a speech recognition sub-model 303, where the first speech feature extraction sub-model 301 is used to extract initial feature data of speech data to be recognized, where the initial feature data includes phase data of the audio data to be recognized, and the phase data is data of phase information of the audio data to be recognized; the second speech feature extraction submodel 302 is configured to perform depth feature extraction on the initial feature data to obtain depth feature data of the audio data to be recognized, where the depth feature data may include time domain feature data, frequency domain feature data, and feature weight data of the frequency domain feature data of the audio data to be recognized; and the voice counterfeit distinguishing sub-model 303 is used for recognizing the voice data to be recognized according to the depth feature data and judging whether the voice data to be recognized is a counterfeit voice of the target person, and the voice counterfeit distinguishing sub-model is a full-connection network layer.

It should be noted that the voice recognition model provided in the embodiment of the present disclosure includes two feature extraction submodels, where the first voice feature extraction submodel can extract richer feature data in the audio data to be recognized, and whether the audio data to be recognized is a counterfeit voice is recognized based on the feature data extracted by the first voice feature extraction submodel, so as to improve the accuracy of the voice recognition result; the second voice feature extraction submodel can determine the time domain feature and the frequency domain feature of the audio data to be recognized based on the initial feature data extracted by the first voice feature extraction submodel, and determine the feature weight for the frequency domain feature so as to extract the feature data which can be used for accurately performing voice recognition in the audio data to be recognized, thereby further improving the accuracy of the voice recognition result determined by the voice recognition model.

In an alternative embodiment, since the first speech feature extraction submodel needs to obtain more abundant and comprehensive feature information in the audio data to be recognized, the first speech feature extraction submodel may include a wav2vec model. As shown in fig. 4, the first speech feature extraction submodel may be a wav2vec2.0 model, the wav2vec2.0 model includes a feature extraction layer 401 and a context coding layer 402, and the feature extraction layer is configured to extract shallow feature data in the audio data, where the shallow feature data may include phase data of the audio data, so as to more accurately identify whether the audio data to be identified is a counterfeit audio; and the context coding layer is used for extracting the associated characteristic data among the frames of the audio data so as to acquire richer characteristic data.

Since the second speech feature extraction sub-model needs to extract depth feature data, and may include a time-frequency feature extraction network and a feature weight construction network, and optionally, in order to obtain richer depth feature information, the number of the time-frequency feature extraction networks may include a plurality of time-frequency feature extraction networks, as shown in fig. 5, the second speech feature extraction sub-model may include a plurality of time-frequency feature extraction networks 501, a time-frequency feature fusion network 502 and a feature weight construction network 503, wherein the time-frequency feature extraction network is configured to obtain time-domain feature data to be fused and frequency-domain feature data to be fused of the audio data to be recognized according to the initial feature data, the time-frequency feature extraction network may be a CBAM-ResBlock network, and the CBAM-ResBlock network is composed of a Convolutional Block Attention Module (CBAM) and a residual error network (ResBlock), the number of the time-frequency feature extraction networks can be determined based on actual needs, which is not limited in the embodiment of the disclosure, for example, the number of the time-frequency feature extraction networks can be 6, so that the richness of the obtained depth feature data can be further improved on the basis of ensuring the voice recognition efficiency; the time-frequency characteristic fusion network is used for fusing a plurality of time-domain characteristic data to obtain fused time-domain characteristic data and fusing a plurality of frequency-domain characteristic data to obtain fused frequency-domain characteristic data, and the time-frequency characteristic fusion network can be a convolutional neural network; and the characteristic weight construction network is used for determining the characteristic weight data of the fused frequency domain characteristic data in the time sequence dimension, and the characteristic weight construction network can be a self-attention network (self-attention).

It should be noted that, for a plurality of time-frequency feature extraction networks, feature data extracted by a bottom time-frequency feature extraction network contains richer local voice information, feature data extracted by a high-level time-frequency feature extraction network contains richer global voice information, and the pseudo-identification of the audio data to be identified is performed according to the local voice information and the global voice information, so that the accuracy of the pseudo-identification result can be improved; for the feature weight construction network, the importance degree of the fusion frequency domain feature data at different moments in the voice recognition can be determined based on an attention mechanism, and corresponding weights are distributed to the fusion frequency domain feature data, so that the voice pseudo-discrimination submodel can determine the feature weights for the frequency domain features according to the time domain features and the frequency domain features, and determine a more accurate recognition result. The audio data to be identified is usually audio data of a period of time, and the different moments refer to different moments in the audio data of the period of time.

In an embodiment of the present disclosure, the training process of the speech recognition model may include training of a first speech feature extraction submodel, and training of a second speech feature extraction submodel and a speech recognition submodel. The training process of the first voice feature extraction submodel and the training process of the second voice feature extraction submodel and the voice identification submodel may be performed simultaneously or separately, and each submodel may be trained separately under the condition that the hardware condition of the server is limited to obtain the voice recognition model.

In an optional implementation manner, the embodiment of the present disclosure explains the training process of the speech recognition model by taking a first speech feature extraction submodel and a training process of a second speech feature extraction submodel and a speech recognition submodel as examples separately.

The first speech feature extraction submodel is a wav2vec model as shown in fig. 4, and the process of training the speech recognition model may be as shown in fig. 6, and includes steps S601 to S604:

step S601, sample audio data of a plurality of target persons are obtained;

in the embodiment of the present disclosure, in order to improve the generalization capability of the speech recognition model, the speech recognition model to be trained may be trained by using sample audio data of a plurality of different target persons. The sample audio data may include sample real audio data and sample fake audio data of the target person. The server may have pre-stored therein sample audio data for different target persons, which may be stored in a data storage device of the server, which may be a magnetic disk.

In an alternative embodiment, the server may retrieve sample audio data for a plurality of target persons in the data storage device in response to a load operation on the sample audio data.

Step S602, inputting sample audio data into a first voice feature extraction submodel to be trained to obtain sample initial feature data;

in the disclosed embodiment, the sample audio data of each target person may include a plurality of samples, and each sample audio data may include a plurality of frames of sample audio data.

In an alternative embodiment, the process of inputting the sample audio data into the first speech feature extraction submodel to be trained by the server to obtain the sample initial feature data may include: the sample audio data can be input into the first voice feature extraction submodel to be trained, so that the feature extraction layer to be trained obtains sample shallow feature data of each frame of sample audio data, wherein the sample shallow feature data comprises phase data of each frame of sample audio data; inputting a plurality of sample shallow layer feature data into a context coding layer, extracting associated feature data among frames of the sample audio data, and obtaining sample initial feature data of the sample audio data;

step S603, determining a third loss function value according to the sample initial feature data;

in the embodiment of the present disclosure, since the first speech feature extraction submodel is the wav2vec model, the third Loss function may be composed of two parts, including a reactive Loss function (reactive Loss) and a Diversity Loss function (Diversity Loss), wherein a sum of the reactive Loss function and the Diversity Loss function may be determined as the third Loss function.

In an alternative embodiment, the process of determining, by the server, the third loss function value from the sample initial characteristic data may include: and determining a third loss function value according to the sample initial characteristic data and the third loss function.

Step S604, updating the parameter of the first speech feature extraction submodel according to the third loss function value.

In this step S604, the process of updating the parameter of the first speech feature extraction submodel according to the third loss function value may include: if the third loss function value is smaller than a preset first loss function threshold value, determining that the first voice feature extraction sub-model is trained completely; if the third loss function value is greater than or equal to the preset first loss function threshold value, determining a first gradient value of the third loss function relative to the first voice feature extraction submodel, determining a first product of the model learning rate and the first gradient value, determining a difference value between a parameter of the first voice feature extraction submodel and the first product to obtain an updated parameter of the first voice feature extraction submodel, and continuously repeating the steps S602 to S604 until the first voice feature extraction submodel is trained. The first loss function threshold and the model learning rate may be predetermined based on actual needs, which is not limited in the embodiments of the present disclosure.

The model structure of the second speech feature extraction submodel is shown in fig. 5, and the process of training the speech recognition model may be shown in fig. 7, where the process includes steps S701 to S705:

step S701, acquiring sample initial characteristic data corresponding to sample audio data of a plurality of target persons;

in this embodiment of the present disclosure, sample initial feature data of a plurality of target people may be stored in advance in the server, where the sample initial feature data is determined based on the foregoing steps S601 to S602, and details of this embodiment of the present disclosure are not described herein.

In this step S701, the server may obtain sample initial feature data corresponding to sample audio data of a plurality of target persons stored in the server in response to a loading operation on the sample initial feature data.

Step S702, inputting the initial characteristic data of the sample into a second voice characteristic extraction submodel to be trained to obtain the depth characteristic data of the sample of the audio data;

in an alternative embodiment, the server inputs the sample initial feature data into the second speech feature extraction submodel to be trained, and the process of obtaining the sample depth feature data of the sample audio data may include: inputting the initial characteristic data of the sample into a plurality of time-frequency characteristic extraction networks to be trained which are connected in sequence to obtain time-domain characteristic data to be fused and frequency-domain characteristic data to be fused of the sample which are output by each time-frequency characteristic extraction network to be trained; inputting a plurality of samples to-be-fused time domain feature data and a plurality of samples to-be-fused frequency domain feature data into a time-frequency feature fusion network to be trained, fusing the plurality of samples to-be-fused time domain feature data to obtain sample fused time domain feature data, and fusing the plurality of samples to-be-fused frequency domain feature data to obtain sample fused frequency domain feature data; inputting the sample fusion frequency domain characteristic data into a characteristic weight construction network to be trained, and determining sample characteristic weight data of the sample fusion frequency domain characteristic data in a time sequence dimension; determining the sample fusion time domain feature data as sample time domain feature data, determining the sample fusion frequency domain feature data as sample frequency domain feature data, and determining the sample fusion frequency domain feature data in the time sequence dimension as sample feature weight data of the sample frequency domain feature data to obtain sample depth feature data of the sample audio data.

Step S703, inputting the sample depth characteristic data into a voice identification sub-model to obtain a prediction category classification value;

step S704, determining a first loss function value according to the prediction class classification value and the class label value of the sample audio data;

in the embodiment of the present disclosure, the sample audio data cannot cover all the forged voice types (text information is converted into audio data, or audio data of a certain person is converted into audio data of another person), and if a two-classification loss function (such as a cross entropy loss function) is adopted, the voice recognition model is over-fitted; therefore, based on the idea of single-class learning (one-class learning), only the center of the real voice feature data of the target person is learned, so that the distance between the real voice feature data and the center of the real voice feature data is relatively small, and the forged voice is away from the real voice by a certain distance, to achieve this goal, the first loss function in the embodiment of the present disclosure is oc-softmax loss, and the first loss function includes:

represents the distance of the center of the real speech feature data from the prediction class classification value,

a center representing the real voice feature data,

representing a prediction class classification value, y_iA class label value representing the sample audio data. Wherein, α and

can be determined based on actual needs, and the embodiment of the disclosure does not limit the determination.

In an alternative embodiment, the process of the server determining the first loss function value according to the prediction class classification value and the class label value of the sample audio data may include inputting the prediction class classification value and the class label value of the sample audio data into the first loss function, and determining the first loss function value.

Step S705, updating the parameters of the second speech feature extraction submodel and the speech identification submodel according to the first loss function value.

In an alternative embodiment, the process of updating the parameters of the second speech feature extraction submodel and the speech discriminator submodel according to the first loss function value by the server may include: if the first loss function value is smaller than a preset second loss function threshold value, determining that the second voice feature extraction sub-model and the voice counterfeit identification sub-model are trained completely;

and if the first loss function value is larger than or equal to a preset second loss function threshold value, determining a second gradient value of the first loss function relative to the voice counterfeit detection sub-model, then obtaining a second product according to the second gradient value and the model learning rate, and then determining a difference value between the parameter of the voice counterfeit detection sub-model and the second product to obtain the updated parameter of the voice counterfeit detection sub-model. And further, determining a third gradient value of the first loss function relative to the second voice feature extraction submodel, then determining a third product according to the third gradient value and the model learning rate, and finally determining a difference value between the parameter of the second voice feature extraction submodel and the third product to obtain the updated parameter of the second voice feature extraction submodel. And continuing to repeat the steps S702 to S705 until the second voice feature extraction submodel and the voice identification submodel are trained. The second loss function threshold and the model learning rate may be determined based on actual needs, which is not limited in the embodiments of the present disclosure.

In an alternative embodiment, there may be an unbalanced number of language samples in the sample audio data of the target person, for example, the number of language a sample audio data is significantly greater than that of language B sample audio data. The training of the speech recognition model by using the sample audio data of different languages and unbalanced quantity may result in that the speech recognition model has different recognition capabilities for the speech of different languages, for example, the accuracy of the speech recognition model for recognizing the speech of language a is higher than the accuracy of the speech recognition model for recognizing the speech of language B, that is, the recognition capability of the speech recognition model is greatly interfered by the language information, and the speech recognition model can be said to have higher sensitivity for the language information.

In order to reduce the sensitivity of the speech recognition model to language information and improve the robustness of the model, the speech recognition model may further include a gradient back propagation layer and a language classification submodel. The speech recognition model shown in fig. 8 further includes a gradient reverse propagation layer 304 and a language classification submodel 305 on the basis of the model structure of the speech recognition model shown in fig. 3, wherein, in the model training process, the learning objectives of the second speech feature extraction submodel and the speech recognition submodel include two, one is to improve the recognition accuracy of whether the speech to be recognized is a forged speech, and the other is to reduce the sensitivity of the speech recognition model to the language information; the second speech feature extraction submodel and the language classification submodel aim to improve the sensitivity of the speech recognition model to language information and accurately classify the languages of the audio data, wherein the training target of the second speech feature extraction submodel on the language information comprises a group of counterstudy targets, so that in the training process of the speech recognition model, the loss function associated with the language classification submodel can be updated by using the gradient back-propagation parameter of the gradient back-propagation layer to obtain an updated loss function, in the process of updating the model parameter of the second speech feature extraction submodel, the gradient of the updated loss function on the model parameter is opposite to the gradient of the non-updated loss function on the model parameter, and based on the gradient of the updated loss function on the model parameter, the reverse update of the model parameter can be realized, so that the language information cannot be concerned by the second voice feature extraction submodel, and a voice recognition model with low sensitivity on the language information of the audio data to be recognized is obtained; in practical application, the speech recognition model cannot extract language information in the audio data to be recognized, so that the speech recognition model can be prevented from being interfered by the language information, and the accuracy of judging whether the audio data to be recognized is forged by the speech recognition model is improved.

Optionally, when the model structure of the second speech feature extraction submodel is shown in fig. 8, the process of training the speech recognition model may be as shown in fig. 9, and includes steps S901 to S908:

step S901, acquiring sample initial characteristic data corresponding to sample audio data of a plurality of target persons;

in an optional implementation manner, reference may be made to step S701 in a process of acquiring sample initial feature data corresponding to sample audio data of multiple target persons by a server, which is not described in detail in this embodiment of the disclosure.

Step S902, inputting the initial characteristic data of the sample into a second voice characteristic extraction submodel to be trained to obtain the depth characteristic data of the sample of the audio data of the sample;

in an optional implementation manner, the server may refer to step S702 in the process of inputting the sample initial feature data into the second speech feature extraction sub-model to be trained to obtain the sample depth feature data of the sample audio data, which is not described in detail in this embodiment of the disclosure.

Step S903, inputting the sample depth characteristic data into a voice false distinguishing sub-model to obtain a prediction category classification value;

step S904, determining a first loss function value according to the prediction class classification value and the class label value of the sample audio data;

in an optional implementation manner, the step S704 may be referred to in the process that the server determines the first loss function value according to the prediction class classification value and the class label value of the sample audio data, which is not described in detail in this embodiment of the disclosure.

Step S905, sequentially inputting the sample depth characteristic data into a gradient reverse transmission layer and a language classification sub-model to obtain a predicted language classification value;

in the embodiment of the present disclosure, in the training process of the speech recognition model, the adjustment process of the model parameters is implemented based on an error back propagation method, in the forward prediction process of sample data, the gradient back transmission layer plays a role of data transmission, in the process of the back adjustment of the model parameters, the gradient back transmission layer may update the loss function and the loss function value associated with the language classification submodel by using the gradient back transmission parameters, and update the model parameters of the second speech feature extraction submodel by using the updated loss function and the updated loss function update value, so that the depth feature data extracted by the trained second speech feature extraction submodel does not include language information, and a speech recognition model with low sensitivity to the language information of the audio data to be recognized is obtained.

In an alternative embodiment, the step of sequentially inputting the sample depth feature data into the gradient back-propagation layer and the language classification submodel to obtain the predicted language classification value may include: and sequentially inputting the sample depth feature data into a gradient reverse transmission layer so that the gradient reverse transmission layer transmits the sample depth feature data to the language classification submodel, and the language classification submodel can determine a predicted language classification value according to the sample depth feature data.

Step S906, determining a second loss function value according to the predicted language classification value and the language label value of the sample audio data;

in the embodiment of the present disclosure, the second loss function is a loss function associated with the language classification submodel, and the type of the second loss function may be determined based on actual needs, which is not limited in the embodiment of the present disclosure, and for example, the second loss function may be a countervailing loss function L_a(Adversarial loss)。

In an alternative embodiment, the step of determining, by the server, the second loss function value according to the language classification value of the prediction language and the language tag value of the sample audio data may include: and inputting the predicted language classification value and the language label value of the sample audio data into a second loss function to obtain a second loss function value.

Step S907, updating the parameters of the voice identification sub-model according to the first loss function value, and updating the parameters of the language classification sub-model according to the second loss function value;

in an alternative embodiment, the process of updating the parameters of the voice authentication sub-model according to the first loss function value by the server may include: if the first loss function value is smaller than a preset second loss function threshold value, determining that the voice counterfeit identification submodel is completely trained; and if the first loss function value is larger than or equal to a preset second loss function threshold value, determining a second gradient value of the first loss function relative to the voice counterfeit identification submodel, determining a second product of the model learning rate and the second gradient value, and determining a difference value between the parameter of the voice counterfeit identification submodel and the second product to obtain the updated parameter of the voice counterfeit identification submodel. And continuing to execute the steps S902 to S907 until the training of the speech recognition sub-model is completed.

The process of updating the parameters of the language classification submodel by the server according to the second loss function value may include: if the second loss function value is smaller than a preset third loss function threshold value, determining that the language classification sub-model is completely trained; and if the second loss function value is larger than or equal to a preset third loss function threshold value, determining a fourth gradient value of the second loss function relative to the language classification submodel, determining a fourth product of the model learning rate and the fourth gradient value, and determining a difference value between the parameter of the voice counterfeit identification submodel and the fourth product to obtain the updated parameter of the voice counterfeit identification submodel. And continuing to execute the steps S902 to S907 until the training of the voice identification sub-model is completed. The model learning rate and the preset third loss function threshold may be determined based on actual needs, which is not limited in the embodiments of the present disclosure.

Step S908 is performed to update the parameters of the second speech feature extraction submodel according to the first loss function value, the second loss function value, and the gradient back propagation layer.

In an alternative embodiment, as shown in fig. 10, the process of updating the parameters of the second speech feature extraction submodel by the server according to the first loss function value, the second loss function value and the gradient back-propagation layer may include steps S1001 to S1004:

step S1001, transmitting the second loss function value to a gradient back-transmission layer, and processing the second loss function value according to the gradient back-transmission parameter to obtain a second loss function update value;

in embodiments of the present disclosure, the gradient back propagation parameter may be:

wherein γ is a preset parameter value, p is a ratio of a current iteration number of the model to a total number of model iterations, and the total number of model iterations may be determined based on actual needs, which is not limited in the embodiments of the present disclosure.

In an alternative embodiment, the processing, by the server, the second loss function value according to the gradient back-propagation parameter to obtain an updated value of the second loss function may include: and determining the product of the gradient back propagation parameter and the second loss function value to obtain a second loss function updating value.

Step S1002, processing the second loss function according to the gradient back propagation parameter to obtain an updated second loss function, and determining a combined loss function according to the updated second loss function and the first loss function;

in an alternative embodiment, the processing the second loss function according to the gradient back-propagation parameter to obtain an updated second loss function may include: and determining the product of the gradient back propagation parameter and the second loss function to obtain an updated second loss function.

The process of determining, by the server, the combined loss function according to the updated second loss function and the updated first loss function may include: determining the sum of the first loss function and the updated second loss function to obtain a combined loss function L ═ L_OCS-λL_a。

Step S1003, determining the gradient value of the combined loss function relative to the parameter of the second voice feature extraction submodel;

step S1004, obtaining a parameter after the second speech feature extraction submodel is updated according to the gradient value, the first loss function value, the second loss function update value, and the model learning rate.

In an alternative embodiment, the obtaining of the updated parameter of the second speech feature extraction submodel according to the gradient value, the first loss function value, the second loss function update value and the model learning rate may include: determining the sum of the first loss function value and the second loss function updating value to obtain a total loss function value, and determining that the second voice feature extraction sub-model is trained completely if the total loss function value is smaller than a fourth loss function threshold value; and if the total loss function value is greater than or equal to the fourth loss function threshold, determining a fifth product of the gradient value and the model learning rate, and determining a difference value between the parameter of the second voice feature extraction submodel and the fifth product to obtain the updated parameter of the second voice feature extraction submodel. And continuing to execute the steps S902 to S908 until the second speech feature extraction submodel is determined to be trained. The gradient value in step S1004 is a gradient value of the combined loss function with respect to the parameter of the second speech feature extraction submodel, and the fourth loss function threshold and the model learning rate may be determined based on actual needs, which is not limited in the embodiment of the present disclosure.

It should be noted that, in the embodiment of the present disclosure, in the process of training the speech recognition model, after it is determined that each sub-model in the speech recognition model is trained, it may be determined that the speech recognition model is trained completely.

In an optional implementation manner, the first speech feature extraction submodel in the speech recognition model is the wav2vec model, and in step S202, the process of inputting the audio data to be recognized into the first speech feature extraction submodel by the server to obtain the initial feature data of the audio data to be recognized may include: inputting the audio data to be identified into a feature extraction layer to obtain shallow feature data of each frame of audio data to be identified; and inputting a plurality of shallow layer feature data into a context coding layer, and extracting associated feature data among all frames of the audio data to be identified to obtain initial feature data of the audio data to be identified. The audio data to be identified is multi-frame audio data to be identified, and the shallow feature data comprises phase data of each frame of audio data to be identified. The first voice feature extraction submodel can be used for obtaining richer feature information, and the accuracy of the voice recognition result is improved.

In an alternative embodiment, in step S203, the server may input the initial feature data into the second speech feature extraction sub-model, so as to obtain the depth feature data of the audio data to be recognized.

The process of inputting the initial feature data into the second voice feature extraction submodel by the server to obtain the depth feature data of the audio data to be recognized may include: inputting the initial characteristic data into a plurality of time-frequency characteristic extraction networks which are connected in sequence to obtain time-domain characteristic data to be fused and frequency-domain characteristic data to be fused which are output by each time-frequency characteristic extraction network; inputting a plurality of time domain feature data to be fused and a plurality of frequency domain feature data to be fused into a time-frequency feature fusion network, fusing the plurality of time domain feature data to obtain fused time domain feature data, and fusing the plurality of frequency domain feature data to obtain fused frequency domain feature data; inputting the fused frequency domain feature data into a feature weight construction network, and determining feature weight data of the fused frequency domain feature data in a time sequence dimension; determining the fused time domain feature data as time domain feature data, determining the fused frequency domain feature data as frequency domain feature data, and determining the feature weight data of the fused frequency domain feature data in a time sequence dimension as the feature weight data of the frequency domain feature data to obtain the depth feature data of the audio data to be identified. The depth feature data of the audio data to be recognized can be acquired, so that the accuracy of the recognition result of the audio data to be recognized determined based on the depth feature data is improved.

In an alternative embodiment, in step S204, the server may input the depth feature data into the speech recognition sub-model, and obtain a classification result of the audio data to be recognized.

In the embodiment of the present disclosure, the classification result of the audio data to be recognized may indicate real audio data of a target person when the audio data to be recognized is the audio data to be recognized, or indicate counterfeit audio data of the target person when the audio data to be recognized is the audio data to be recognized, where the type of the counterfeit audio data may be that text information is converted into the audio data of the target person, or that the audio data of a certain person is converted into the audio data of the target person.

In an alternative embodiment, the process of inputting the depth feature data into the speech discrimination submodel by the server to obtain the classification result of the audio data to be recognized may include: inputting the depth characteristic data into a voice counterfeit distinguishing submodel to obtain a classification result value; if the classification result value is larger than or equal to the preset threshold value, determining that the audio data to be recognized is forged voice of the target person, wherein the forged voice comprises synthesized voice and/or converted voice; and if the classification result value is smaller than the preset threshold value, determining that the audio data to be recognized is the real voice of the target person. The preset threshold may be determined based on actual needs, which is not limited in the embodiments of the present disclosure. The synthesized voice refers to a forged voice obtained by converting non-audio information (for example, text information or text information in a picture) into audio data of a target person, and the forged voice may cause the public to misunderstand that the corresponding non-audio information is personally spoken by the target person according to the real meaning representation thereof; the converted voice refers to a forged voice obtained by converting the audio data of a certain person into the audio data of a target person, and the forged voice can cause a lawless person to fit the speech contents from other persons in the manner of kiss, style and pronunciation of the target person, so that the aim of pretending the speech of the target person is fulfilled.

Exemplary devices

Having described the method of the exemplary embodiment of the present disclosure, the apparatus of the exemplary embodiment of the present disclosure will next be described with reference to fig. 11.

An embodiment of the present disclosure provides a speech recognition apparatus, and as shown in fig. 11, a speech recognition apparatus 1100 includes:

an obtaining module 1101 configured to obtain audio data to be identified of a target person;

the first feature extraction module 1102 is configured to input the audio data to be recognized into a first voice feature extraction submodel to obtain initial feature data of the audio data to be recognized, wherein the initial feature data comprises phase data of the audio data to be recognized, the first voice feature extraction submodel is a submodel of a pre-trained voice recognition model, and the voice recognition model further comprises a second voice feature extraction submodel and a voice recognition submodel;

a second feature extraction module 1103, configured to input the initial feature data into the second speech feature extraction submodel, to obtain depth feature data of the audio data to be recognized, where the depth feature data includes time domain feature data, frequency domain feature data, and feature weight data of the frequency domain feature data of the audio data to be recognized;

and the false distinguishing module 1104 is configured to input the depth feature data into the voice false distinguishing submodel to obtain a classification result of the audio data to be recognized.

To sum up, the voice recognition apparatus provided by the embodiment of the present disclosure may acquire audio data to be recognized of a target person, input the audio data to be recognized into the first voice feature extraction sub-model to obtain initial feature data of the audio data to be recognized, input the initial feature data into the second voice feature extraction sub-model to obtain depth feature data of the audio data to be recognized, and input the depth feature data into the voice recognition sub-model to obtain a classification result of the audio data to be recognized. Because the phase information and the time-frequency characteristic information of the forged voice and the real voice have great difference, the voice recognition model can classify whether the audio data to be recognized is the forged voice or not based on the initial characteristic data of the phase data containing the audio data to be recognized and the depth characteristic data of the time domain characteristic data, the frequency domain characteristic data and the characteristic weight data of the frequency domain characteristic data containing the audio data to be recognized, and the accuracy of the obtained classification result can be improved.

Optionally, the second speech feature extraction sub-model includes a plurality of time-frequency feature extraction networks, a time-frequency feature fusion network, and a feature weight construction network, and the second feature extraction module 1103 is configured to:

inputting a plurality of time domain feature data to be fused and a plurality of frequency domain feature data to be fused into a time-frequency feature fusion network, fusing the plurality of time domain feature data to obtain fused time domain feature data, and fusing the plurality of frequency domain feature data to obtain fused frequency domain feature data;

inputting the fused frequency domain feature data into a feature weight construction network, and determining feature weight data of the fused frequency domain feature data in a time sequence dimension;

determining the fused time domain feature data as time domain feature data, determining the fused frequency domain feature data as frequency domain feature data, and determining the feature weight data of the fused frequency domain feature data in a time sequence dimension as the feature weight data of the frequency domain feature data to obtain the depth feature data of the audio data to be identified.

Optionally, the audio data to be recognized is multiple frames of audio data to be recognized, the first speech feature extraction sub-model is a wav2vec model, the wav2vec model includes a feature extraction layer and a context coding layer, and the first feature extraction module 1102 is configured to:

inputting the audio data to be identified into a feature extraction layer to obtain shallow feature data of each frame of audio data to be identified, wherein the shallow feature data comprises phase data of each frame of audio data to be identified;

and inputting a plurality of shallow layer feature data into a context coding layer, and extracting associated feature data among all frames of the audio data to be identified to obtain initial feature data of the audio data to be identified.

Optionally, the authentication module 1104 is configured to:

inputting the depth characteristic data into a voice counterfeit distinguishing submodel to obtain a classification result value;

and if the classification result value is smaller than the preset threshold value, determining that the audio data to be recognized is the real voice of the target person.

Optionally, as shown in fig. 11, the speech recognition apparatus 1100 further includes a first model training module 1105 configured to:

inputting the initial characteristic data of the sample into a second voice characteristic extraction submodel to be trained to obtain sample depth characteristic data of the audio data of the sample;

inputting the sample depth characteristic data into a voice discrimination sub-model to obtain a prediction category classification value;

determining a first loss function value according to the prediction class classification value and the class label value of the sample audio data;

and updating parameters of the second voice feature extraction submodel and the voice identification submodel according to the first loss function value.

Optionally, the first model training module 1105 is configured to:

inputting the initial characteristic data of the sample into a plurality of time-frequency characteristic extraction networks to be trained which are connected in sequence to obtain time-domain characteristic data to be fused and frequency-domain characteristic data to be fused of the sample which are output by each time-frequency characteristic extraction network to be trained;

inputting a plurality of samples to-be-fused time domain feature data and a plurality of samples to-be-fused frequency domain feature data into a time-frequency feature fusion network to be trained, fusing the plurality of samples to-be-fused time domain feature data to obtain sample fused time domain feature data, and fusing the plurality of samples to-be-fused frequency domain feature data to obtain sample fused frequency domain feature data;

inputting the sample fusion frequency domain characteristic data into a characteristic weight construction network to be trained, and determining sample characteristic weight data of the sample fusion frequency domain characteristic data in a time sequence dimension;

determining the sample fusion time domain feature data as sample time domain feature data, determining the sample fusion frequency domain feature data as sample frequency domain feature data, and determining the sample fusion frequency domain feature data in the time sequence dimension as sample feature weight data of the sample frequency domain feature data to obtain sample depth feature data of the sample audio data.

Optionally, the first model training module 1105 is configured to:

inputting the prediction class classification value and the class label value of the sample audio data into a first loss function, and determining a first loss function value;

the first loss function includes:

a center representing the real voice feature data,

Optionally, the speech recognition model further includes a language classification sub-model and a gradient back-propagation layer, and the first model training module 1105 is further configured to:

sequentially inputting the sample depth characteristic data into a gradient reverse transmission layer and a language classification sub-model to obtain a predicted language classification value;

a first model training module configured to:

updating parameters of the voice identification submodel according to the first loss function value, and updating parameters of the language classification submodel according to the second loss function value;

and updating the parameters of the second voice feature extraction submodel according to the first loss function value, the second loss function value and the gradient back transmission layer.

Optionally, the first model training module 1105 is configured to:

transmitting the second loss function value to a gradient reverse transmission layer, and processing the second loss function value according to the gradient reverse transmission parameter to obtain a second loss function updating value;

processing the second loss function according to the gradient back propagation parameter to obtain an updated second loss function, and determining a combined loss function according to the first loss function and the updated second loss function;

determining gradient values of the combined loss function with respect to parameters of the second speech feature extraction submodel;

and obtaining the updated parameter of the second voice feature extraction submodel according to the gradient value, the first loss function value, the second loss function updating value and the model learning rate.

Optionally, as shown in fig. 11, the speech recognition apparatus 1100 further includes a second model training module 1106 configured to:

acquiring sample audio data of a plurality of target persons;

inputting sample audio data into a first voice feature extraction submodel to be trained to obtain sample initial feature data;

Optionally, as shown in fig. 11, the speech recognition apparatus 1100 further includes an audio recognition module 1107 configured to:

acquiring audio data to be processed;

In addition, other specific details of the embodiments of the present disclosure have been described in detail in the embodiments of the invention of the above method, and are not described herein again.

Exemplary storage Medium

Storage media of exemplary embodiments of the present disclosure are explained below.

In the present exemplary embodiment, the above-described method may be implemented by a program product, such as a portable compact disc read only memory (CD-ROM) and including program code, and may be executed on a device, such as a personal computer. However, the program product of the present disclosure is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

A computer readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RE, etc., or any suitable combination of the foregoing.

Program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a local area network (FAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

Exemplary electronic device

An electronic device of an exemplary embodiment of the present disclosure is explained with reference to fig. 12.

The electronic device 1200 shown in fig. 12 is only an example and should not bring any limitation to the function and the scope of use of the embodiments of the present disclosure.

As shown in fig. 12, electronic device 1200 is embodied in the form of a general-purpose computing device. The components of the electronic device 1200 may include, but are not limited to: at least one processing unit 1210, at least one memory unit 1220, a bus 1230 connecting various system components (including the memory unit 1220 and the processing unit 1210), and a display unit 1240.

Where the memory unit stores program code, the program code may be executed by the processing unit 1210 such that the processing unit 1210 performs the steps according to various exemplary embodiments of the present disclosure described in the above-mentioned "exemplary methods" section of this specification. For example, processing unit 1210 may perform the method steps shown, and the like.

The storage unit 1220 may include volatile storage units such as a random access memory unit (RAM)1221 and/or a cache memory unit 1222, and may further include a read only memory unit (ROM) 1223.

The storage unit 1220 may also include a program/utility 1224 having a set (at least one) of program sub-models 1225, such program sub-models 1225 including, but not limited to: an operating system, one or more application programs, other program sub-models, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

The bus 1230 may include a data bus, an address bus, and a control bus.

The electronic device 1200 may also communicate with one or more external devices 1300 (e.g., keyboard, pointing device, bluetooth device, etc.) via an input/output (I/O) interface 1250. The electronic device 1200 further comprises a display unit 1240 connected to the input/output (I/O) interface 1250 for displaying. Also, the electronic device 1200 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the Internet) via the network adapter 1260. As shown, the network adapter 1260 communicates with the other submodels of the electronic device 1200 via the bus 1230. It should be appreciated that although not shown in the figures, other hardware and/or software submodels may be used in conjunction with electronic device 1200, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

It should be noted that although in the above detailed description several submodels or submodels of the apparatus are mentioned, this division is only exemplary and not mandatory. Indeed, according to embodiments of the present disclosure, the features and functions of two or more unit/submodels described above may be embodied in one unit/submodel. Conversely, the features and functions of one unit/submodel described above may be further divided into embodiments that are embodied by multiple unit/submodels.

Further, while the operations of the disclosed methods are depicted in the drawings in a particular order, this does not require or imply that these operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.

While the spirit and principles of the present disclosure have been described with reference to several particular embodiments, it is to be understood that the present disclosure is not limited to the particular embodiments disclosed, nor is the division of aspects which is intended to be construed to be merely illustrative of the fact that features of the aspects may be combined to advantage. The disclosure is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims

1. A method of speech recognition, the method comprising:

acquiring audio data to be identified of a target person;

2. The method of claim 1, wherein the second speech feature extraction submodel comprises a plurality of time-frequency feature extraction networks, a time-frequency feature fusion network and a feature weight construction network, and the inputting the initial feature data into the second speech feature extraction submodel to obtain the depth feature data of the audio data to be recognized comprises:

3. The method according to claim 1, wherein the audio data to be recognized is a plurality of frames of audio data to be recognized, the first speech feature extraction submodel is a wav2vec model, the wav2vec model includes a feature extraction layer and a context coding layer, and the inputting the audio data to be recognized into the first speech feature extraction submodel to obtain initial feature data of the audio data to be recognized comprises:

4. The method as claimed in claim 1, wherein the inputting the depth feature data into the voice authentication submodel to obtain the classification result of the audio data to be recognized comprises:

5. The method of claim 1, wherein training the speech recognition model comprises:

6. The method of claim 5, wherein the speech recognition model further comprises a language classification submodel and a gradient back propagation layer, and after the sample initial feature data is input into the second speech feature extraction submodel to be trained to obtain sample depth feature data of the sample audio data, the method further comprises:

7. The method of claim 1, wherein prior to obtaining the audio data to be identified for the target person, the method further comprises:

acquiring audio data to be processed;

8. A speech recognition apparatus, characterized in that the apparatus comprises:

9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method of any one of claims 1 to 7.

10. An electronic device, comprising:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the method of any of claims 1 to 7 via execution of the executable instructions.