CN114444609B

CN114444609B - Data processing method, device, electronic equipment and computer readable storage medium

Info

Publication number: CN114444609B
Application number: CN202210118785.7A
Authority: CN
Inventors: 刘世兴; 王智圣; 郑磊
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-02-08
Filing date: 2022-02-08
Publication date: 2024-10-01
Anticipated expiration: 2042-02-08
Also published as: CN114444609A

Abstract

The embodiment of the application provides a data processing method, a data processing device, electronic equipment and a computer readable storage medium, and relates to the technical fields of artificial intelligence, multimedia, games and clouds. The method comprises the following steps: acquiring data to be processed, wherein the data to be processed is data of a first mode; extracting first data characteristics of data to be processed; and matching the first data characteristic with at least one second data characteristic in a target database, and determining target standard data matched with the data to be processed from candidate standard data according to a matching result corresponding to each second data characteristic, wherein the target database comprises at least one candidate standard data and the second data characteristic of each candidate standard data, and the candidate standard data is data of a second mode. Based on the method provided by the embodiment of the application, the matching between the data of different modes can be simply and quickly realized.

Description

Data processing method, device, electronic equipment and computer readable storage medium

Technical Field

The application relates to the fields of artificial intelligence, multimedia technology, games and cloud technology, in particular to a data processing method, a data processing device, electronic equipment and a computer readable storage medium.

Background

With development and popularization of voice recognition technology, application of voice recognition has appeared in various application scenarios, for example, currently, most electronic devices are installed with an artificial intelligence AI voice assistant, and the AI voice assistant can recognize collected voice data based on the voice recognition technology to obtain corresponding text content and can execute corresponding functions based on the recognized text content.

In the prior art, most of the voice recognition technology is realized through a complex voice recognition model, usually, the coding characteristics of voice data are obtained through a voice encoder, and the category of the voice data is predicted through a classification network.

Disclosure of Invention

The embodiment of the application provides a data processing method, a device, electronic equipment and a computer readable storage medium, and based on the method, the matching among data of different modes can be simply and rapidly realized. The technical scheme provided by the embodiment of the application is as follows:

in one aspect, an embodiment of the present application provides a data processing method, where the method includes:

Acquiring data to be processed, wherein the data to be processed is data of a first mode;

Extracting first data characteristics of data to be processed;

Matching the first data characteristic with at least one second data characteristic in a target database to obtain a matching result corresponding to each second data characteristic, wherein the target database comprises at least one candidate standard data and the second data characteristic of each candidate standard data, and the candidate standard data are data of a second mode;

And determining target standard data matched with the data to be processed from the candidate standard data according to the matching results corresponding to the second data features.

In another aspect, an embodiment of the present application provides a data processing apparatus, including:

the data processing module is used for processing the data to be processed, wherein the data to be processed is data of a first mode;

the characteristic acquisition module is used for extracting first data characteristics of the data to be processed;

The data identification module is used for matching the first data characteristic with at least one second data characteristic in the target database to obtain a matching result corresponding to each second data characteristic, and determining target standard data matched with the data to be processed from each candidate standard data according to the matching result corresponding to each second data characteristic;

The target database comprises at least one candidate standard data and a second data characteristic of each candidate standard data, and the candidate standard data is data of a second modality.

Optionally, the data identification module is further configured to: determining the data type of the data to be processed according to the first data characteristics; accordingly, the data identification module may be configured to:

and when the data type of the data to be processed is a specified type, matching the first data characteristic with at least one second data characteristic in the target database.

Optionally, the data of the first modality and the data of the second modality are data of different modalities, the data of the first modality includes at least one of text, voice, video or image, and the data of the second modality includes at least one of text, voice, video or image.

Optionally, the candidate standard data is a standard expression matched with first standard data in a standard database, the first standard data is data of a first modality, and one first standard data corresponds to at least one standard expression.

Optionally, the feature acquisition module is further configured to: when newly added first standard data exists in the standard database, at least one standard expression corresponding to the newly added first standard data is obtained; extracting second data characteristics of each standard expression corresponding to the newly added first standard data; and storing each standard expression corresponding to the newly added first standard data and the second data characteristic association corresponding to each standard expression into a target database.

Optionally, the first data feature is extracted through a first feature extraction network; the second data features of the candidate standard data are extracted through a second feature extraction network; the first feature extraction network and the second feature extraction network are obtained by training a model training module through the following modes:

acquiring a training data set, wherein the training data set comprises a first training set, and each first sample in the first training set comprises first data of a first mode and second data of a second mode matched with the first data;

Performing iterative training on an initial neural network model based on a training data set until a training total loss value meets a preset training ending condition, wherein the neural network model comprises a first network model and a second network model, the first network model when the training ending condition is met is used as a first feature extraction network, and the second network model when the training ending condition is met is used as a second feature extraction network; the training process comprises the following steps:

Inputting each first data into a first network model to obtain the characteristics of each first data, and inputting each second data into a second network model to obtain the characteristics of each second data;

Determining a first training loss value based on the degree of matching of the features of the first data and the features of the second data in each first sample and the degree of matching of the features of the first data and the features of the second data in each first negative example; wherein the first negative example comprises first data of one first sample and second data of another first sample;

and if the first training loss value does not meet the first preset condition, adjusting model parameters of the first network model and the second network model, wherein the step of meeting the preset training ending condition by the training total loss value comprises the step of meeting the first preset condition by the first training loss value.

Optionally, the model training module is configured to, when inputting each first data into the first network model to obtain the feature of each first data:

For each first data, performing the following operations on the first data through a first network model to obtain the characteristics of the first data:

Dividing the first data into at least two sub-data to obtain a sub-data sequence corresponding to the first data; extracting features of each piece of sub data in the sub data sequence based on a dictionary, and obtaining features of first data based on the features of each piece of sub data, wherein the dictionary comprises a plurality of data elements, the number of feature values included by the features of each piece of sub data is equal to the number of elements in the dictionary, and one feature value characterizes the probability that the data elements corresponding to the positions of the feature values in the dictionary are included in the sub data;

The model training module is also for: for each second data, determining, based on the dictionary, a data characteristic of the second data corresponding to the dictionary, the data characteristic characterizing a probability that the second data corresponds to a respective data element in the dictionary;

The model training module is configured to, when determining the first training loss value: the first training loss value is determined based on a degree of matching between the feature of each sub-data of the first data in each first sample and the feature of the second data corresponding to the dictionary, a degree of matching between the feature of the first data in each first sample and the feature of the second data, and a degree of matching between the feature of the first data and the feature of the second data in each first negative example.

Optionally, the model training module may be configured to, when determining the first training loss value:

Determining the degree of difference between the characteristics of the first data and the characteristics of the second data of each first sample to obtain a first loss value;

For each first data, determining a first similarity corresponding to the first data and a second similarity corresponding to the first data, wherein the first similarity is the similarity between the characteristics of the first data and the characteristics of the second data matched with the first data, and the second similarity is the similarity between the first data and the second data in the first negative example where the first data is located;

Obtaining reference labels corresponding to the first data, wherein the reference labels comprise similarity labels corresponding to the first similarity and similarity labels corresponding to the second similarity;

Determining a second loss value based on the predicted similarity and the reference label corresponding to each first data, wherein the predicted similarity comprises the first similarity and the second similarity, and the second loss value characterizes the difference between the predicted similarity and the reference label corresponding to each first data;

a first training loss value is determined based on the first loss value and the second loss value.

Optionally, the candidate standard data is a standard expression of a second modality corresponding to the first standard data of the specified type; the initial neural network model also includes a classification model; the training data set further comprises a second training set, each second sample in the second training set comprises third data of the first mode and fourth data of a second mode matched with the third data, wherein the third data in the second training data set comprises third data of a specified type and third data of a non-specified type, and each second sample further comprises a type tag of the third data in the sample; after obtaining the neural network model with the first training loss value satisfying the first preset condition, the model training module is further configured to perform the following training process:

Repeating the training operation on the neural network model based on the second training set until the second training loss value meets a second preset condition, wherein the training total loss value meets a preset training ending condition further comprises that the second training loss value meets the second preset condition; the training operation includes:

inputting each third data into the first network model to obtain the characteristics of each third data, inputting each fourth data into the second network model to obtain the characteristics of each fourth data, and inputting the characteristics of each third data into the classification model to obtain the prediction type corresponding to each third data;

Determining a second training loss value based on the degree of matching of the features of the third data and the features of the fourth data in each second sample, the degree of matching of the features of the third data and the features of the fourth data in each second negative example, and the degree of matching between the type label and the predicted type of each third data;

and if the second training loss value does not meet the second preset condition, adjusting the model parameters of the neural network model.

Optionally, the data of the first mode is voice, the data of the second mode is text, and the data elements are phonemes.

Optionally, the specified type is instruction type voice.

In another aspect, an embodiment of the present application further provides an electronic device, where the electronic device includes a memory and a processor, and the memory stores a computer program, and the processor executes the computer program to implement the method provided in any of the alternative embodiments of the present application.

In another aspect, embodiments of the present application also provide a computer readable storage medium having stored therein a computer program which, when executed by a processor, implements the method provided in any of the alternative embodiments of the present application.

In another aspect, embodiments of the present application also provide a computer program product comprising a computer program which, when executed by a processor, implements the method provided in any of the alternative embodiments of the present application.

The technical scheme provided by the embodiment of the application has the following beneficial effects:

The data processing method provided by the embodiment of the application provides a novel data processing thought, when the method is used for processing the data to be processed, after the data characteristics of the data to be processed are acquired, complex and cumbersome characteristic identification can be omitted, and the matching among different modes of data can be simply and quickly realized in a characteristic matching mode, so that the calculated amount can be greatly reduced, and the data processing efficiency is improved. In addition, as candidate standard data are stored in the target database, the identification accuracy can be well ensured by the standard data of the second mode which is determined by the method of the embodiment of the application and is matched with the data to be processed of the first mode.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings that are required to be used in the description of the embodiments of the present application will be briefly described below.

FIG. 1 is a schematic flow chart of a data processing method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a data processing system according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a training process of a neural network model according to an embodiment of the present application;

FIG. 4 is a schematic flow chart of a pre-training stage according to an embodiment of the present application;

FIG. 5 is a schematic flow chart of a fine tuning training phase according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a method for constructing an instruction vector library according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a voice command recognition flow according to an embodiment of the present application;

FIGS. 8a and 8b are schematic illustrations of a user interface provided in an example of the present application;

FIG. 9 is a schematic diagram of a data processing apparatus according to an embodiment of the present application;

Fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Embodiments of the present application are described below with reference to the drawings in the present application. It should be understood that the embodiments described below with reference to the drawings are exemplary descriptions for explaining the technical solutions of the embodiments of the present application, and the technical solutions of the embodiments of the present application are not limited.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and "comprising," when used in this specification, specify the presence of stated features, information, data, steps, operations, elements, and/or components, but do not preclude the presence or addition of other features, information, data, steps, operations, elements, components, and/or groups thereof, all of which may be included in the present specification. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. The term "and/or" as used herein indicates that at least one of the items defined by the term, e.g., "a and/or B" may be implemented as "a", or as "B", or as "a and B". In describing a plurality of (two or more) items, if a relationship between the plurality of items is not explicitly defined, the plurality of items may refer to one, more or all of the plurality of items, for example, the description of "the parameter a includes A1, A2, A3" may be implemented such that the parameter a includes A1 or A2 or A3, and may also be implemented such that the parameter a includes at least two of three items of the parameters A1, A2, A3.

It should be noted that, in the alternative embodiment of the present application, related data such as user information (e.g., voice data corresponding to a user) is required to obtain user permission or consent when the above embodiment of the present application is applied to a specific product or technology, and the collection, use and processing of related data is required to comply with related laws and regulations and standards of related countries and regions. That is, in the embodiment of the present application, if data related to the user is involved, the data needs to be obtained through the approval of the user and accords with the relevant laws and regulations and standards of the country and region.

Optionally, the data processing method provided by the embodiment of the application can be implemented based on artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) technology. For example, feature extraction of the data to be processed, feature extraction of candidate standard data, and feature extraction of data in the training dataset can be achieved through a trained neural network model. AI is a theory, method, technique, and application system that utilizes a digital computer or a digital computer-controlled machine to simulate, extend, and extend human intelligence, sense the environment, acquire knowledge, and use knowledge to obtain optimal results. With the research and progress of artificial intelligence technology, the research and application of artificial intelligence technology has been widely developed in a plurality of fields, and it is believed that with the development of technology, the artificial intelligence technology will be applied in more fields and become more and more valuable.

Optionally, the data processing according to the embodiment of the present application may be implemented based on Cloud technology (Cloud technology), for example, data calculation involved in training the neural network model and data calculation involved in processing the data to be processed may be implemented by using Cloud technology. Cloud technology refers to a hosting technology for unifying serial resources such as hardware, software, network and the like in a wide area network or a local area network to realize calculation, storage, processing and sharing of data. The cloud technology is based on the general names of network technology, information technology, integration technology, management platform technology, application technology and the like applied by the cloud computing business mode, can form a resource pool, and is flexible and convenient as required. Cloud computing technology will become an important support. Cloud computing refers to a delivery and use mode of an IT infrastructure, namely, obtaining required resources in an on-demand and easily-extensible manner through a network; generalized cloud computing refers to the delivery and usage patterns of services, meaning that the required services are obtained in an on-demand, easily scalable manner over a network. Such services may be IT, software, internet related, or other services. With the development of the internet, real-time data flow and diversification of connected devices, and the promotion of demands of search services, social networks, mobile commerce, open collaboration and the like, cloud computing is rapidly developed. Unlike the previous parallel distributed computing, the generation of cloud computing will promote the revolutionary transformation of the whole internet mode and enterprise management mode in concept.

In order to better understand and describe the solutions provided by the embodiments of the present application, some related technical terms related to the embodiments of the present application will be described below.

Classification cross entropy error: i.e. cross entropy loss, is an objective function/loss function in deep learning for measuring the similarity between the predicted outcome distribution (predicted outcome output by the neural network) and the true signature (i.e. sample signature), which error is calculated based on the sample predicted outcome (typically a probability between 0 and 1) and the true signature (0 or 1). Assuming that the probability of the sample's predicted outcome being true is y, the true label y', the corresponding error L can be expressed as: l= -y 'log (y) - (1-y') log (1-y).

Voice-translation text pairs: voice audio (i.e., voice signals/voice data) and corresponding translated text (text data) form a pair.

Voice-command pair: the instruction speech and the natural language text expressing the instruction intention, such as the speech "mark the article A" and the text "there is the article A" can be a pair, and the speech "mark the article A" and the text "mark the article A" can be a pair.

MFCC (Mel-Frequency Cepstral Coefficients, mel-cepstral coefficient) characteristics: MFCC features are an audio feature of speech processing that can be used for neural network model input.

Phonemes: the basic acoustic unit of the voice is a voice unit divided according to the natural attribute of the voice, and is analyzed according to the pronunciation actions in syllables, and one action forms a phoneme.

CNN (Convolutional Neural Networks, convolutional neural network): CNN is a deep learning network structure that can capture incoming local information.

Transformer network: a deep learning network structure based on an attention mechanism can be applied to sequence input of texts, voices and the like.

CTC (Connectionist Temporal Classifcation, sequential timing classification) error: may also be referred to as CTC loss, is an objective function used in deep learning to allow models to automatically learn alignment.

MSE (Mean Squared Error, mean square error): is a loss function in deep learning for calculating the distance of two vectors.

The technical scheme of various alternative embodiments provided by the application and the technical effects produced by the technical scheme of the application are described below. It should be noted that the following embodiments may be referred to, or combined with each other, and the description will not be repeated for the same terms, similar features, similar implementation steps, and the like in different embodiments.

Fig. 1 shows a flow chart of a data processing method according to an embodiment of the present application, where the method may be performed by any electronic device, for example, a user terminal or a server, or may be performed by interaction between the user terminal and the server. For example, the data to be processed can be a voice command of a user, and the user terminal can conveniently and rapidly identify the specific content (target standard data, namely, standard text expression of user intention) of the voice command of the user by executing the method provided by the embodiment of the application, and can execute corresponding operation according to the identification result. For another example, the method may also be executed by a server, where the server may receive a voice command of a user sent by a user terminal, and by executing the method provided by the embodiment of the present application, identify specific content of the voice command of the user and execute a corresponding operation. The user terminal includes, but is not limited to, a mobile phone, a computer, an intelligent voice interaction device, an intelligent household appliance, a vehicle-mounted terminal, a wearable electronic device, an AR/VR device and the like. The server may be a cloud server or a physical server, and may be a server or a server cluster.

The method provided by the embodiment of the application can be applied to any application scene in which the data of one mode is required to be identified according to the data of the other mode matched with the data, wherein the mode in the embodiment of the application refers to the form of the data, the data pattern presented to people, namely the type of the data, such as the voice data is the data of one mode, and the text data is the data of the other mode. For example, the method provided by the embodiment of the application can be implemented as a functional module/plug-in of an application program, for example, the application program can be a game application, by applying the data processing method provided by the embodiment of the application to the game application, a user can initiate a voice command when playing a game, the recognition result (target standard data) of the voice command of the user can be rapidly determined through the functional module, and the game server can execute corresponding operation according to the recognition result and can display the operation result to the user.

The method provided by the embodiment of the application can be suitable for games with different modal data matching requirements, which can include but are not limited to games of action types, adventure types, simulation types, role playing types, leisure types and the like, for example, the method can be used in tactical competitive games or competition games, in which players can collect various game resources (such as virtual game props) on a game map in a virtual game scene through operation, optionally, players can collect game resources by initiating voice instructions, and based on the scheme provided by the embodiment of the application, the collection of the corresponding voice instructions of players can be completed by matching the characteristics (namely first data characteristics) of the voice instructions of the players with the characteristics (namely the characteristics of standard text expressions corresponding to standard voice instructions) in a game database (namely the characteristics of the standard text expressions corresponding to the standard voice instructions), namely the second data characteristics in the embodiment of the application.

The data processing method provided by the embodiment of the present application is described below with reference to the flowchart shown in fig. 1. As shown in fig. 1, the data processing method provided by the embodiment of the present application may include the following steps S110 to S140.

Step S110: and acquiring data to be processed, wherein the data to be processed is the data of the first mode.

Step S120: first data characteristics of data to be processed are extracted.

The embodiment of the present application is not limited to what type of data is specifically the data of the first modality. Optionally, the data of the first modality may include, but is not limited to, at least one of text, voice, video, or images. For example, in some application scenarios, the data to be processed may be acquired voice data of the user, and in other application scenarios, the data to be processed may be text data input by the user. For another example, the data to be processed may include both text and image types of data.

The first data feature of the data to be processed can be extracted through a trained first feature extraction network, wherein the input of the first feature extraction network can be the data to be processed, or the data which accords with the requirement of the network input data format after the data to be processed is preprocessed. For example, the initial feature of the data to be processed may be extracted by preprocessing, and the extracted initial feature is input into the above-mentioned feature extraction network, through which the first data feature having a better feature expression capability is further extracted.

As an example, the data to be processed may be speech data, which may be first subjected to a time-frequency transformation to obtain audio features (such as mel-frequency spectrum features or MFCC features, etc.) of the speech data, which may be used as input to a first feature extraction network, through which a high-level feature representation (which may also be referred to as a speech vector representation or speech representation vector) of the speech data, i.e. the first data features in this example, is obtained.

Step S130: and matching the first data characteristic with at least one second data characteristic in the target database to obtain a matching result corresponding to each second data characteristic.

Step S140: and determining target standard data matched with the data to be processed from the candidate standard data according to the matching results corresponding to the second data features.

In the embodiment of the application, the target database comprises at least one candidate standard data and a second data characteristic of each candidate standard data, wherein the candidate standard data is data of a second modality.

Likewise, the embodiment of the present application is not limited to the specific form of the data of the second modality, and may include but is not limited to at least one of text, voice, video or image. It will be appreciated that the data of the first modality and the data of the second modality are not data of the same modality. For example, the data of the first modality may be speech data and the data of the second modality may be text data.

It should be noted that, in the embodiment of the present application, the data of the first modality and the data of the second modality may only include one type of data, the data of the first modality or the data of the second modality may also include two or more types of data, when at least one of the data of the first modality or the data of the second modality includes two types of data, the data of the first modality and the data of the second modality are different modalities, which may be understood that at least one type of data in the data of the first modality and the data of the second modality is different, for example, the data of the first modality may be data including text and image, and the data of the second modality may be voice data. Taking the example that the data to be processed includes two types of data as an example, the first data feature of the data to be processed may be obtained by fusing (such as splicing or adding) features of the two types of data included in the data to be processed, for example, the data to be processed includes voice data and text data, for the data to be processed, the feature of the voice data and the feature of the text data may be extracted respectively, and the data feature of the data to be processed may be obtained by fusing the features of the two parts of data.

The standard data (the candidate standard data and the first standard data hereinafter) in the embodiment of the present application may be understood as standard data or reference data, which is a standard expression of information, and the standard data may be preconfigured according to application requirements.

In the embodiment of the present application, the second data features of each candidate standard data may also be extracted by using a trained neural network model, and specifically, feature extraction may be performed on each candidate standard data by using a second feature extraction network, so as to obtain the second data features of each candidate standard data. Similarly, the input of the second feature extraction network may be candidate standard data, or may be that after the candidate standard data is preprocessed, the preprocessed data is input into the second feature extraction network to obtain the second data feature of the candidate standard data. For example, the candidate standard data may be text data, an initial feature representation of the text data may be obtained by word embedding (Embedding) or single-hot encoding, and the initial feature representation is input to a second feature extraction network to obtain a high-level feature representation, i.e., a second data feature, of the text data.

As an alternative, the candidate standard data in the target database may be a standard expression matched with each first standard data in the standard database, where the first standard data is data of a first modality, and one first standard data corresponds to at least one standard expression.

A first standard data and a standard representation of the standard data may be understood as a standard description/representation of two different data forms of an information, for example, the first standard data is data in speech form, the standard representation is data in text form, and the first standard data and the corresponding standard representation may be understood as speech representation and text representation of the same information. For instance, as one example, the information is "hello", the first standard data is voice data of "hello", and the corresponding standard expression is text content "hello".

As an example, a voice command library may be preconfigured in the game application, in which standard text expressions (which may also be understood as recognition results of voice commands) corresponding to various standard voice commands (first standard data in the application scenario) supported in the game application may be stored, a game player may input a voice command (data to be processed) at a client of the game application, and the game server may find a voice recognition result matching a voice command currently input by the user from among the standard text expressions corresponding to the standard voice commands by executing the above steps S120 to S140.

Optionally, the matching result corresponding to one second data feature may be the matching degree, such as a similarity, of the first data feature and the second data feature, after the matching result of the first data feature and each second data feature is obtained, candidate standard data corresponding to the second data feature with the highest matching degree may be used as target standard data, candidate standard data corresponding to a set number of matching degrees in a sequence from the highest matching degree to the lowest matching degree may be used as target standard data, or candidate standard data corresponding to each second data feature with a matching degree greater than a set value may be used as target standard data.

The data processing method provided by the embodiment of the application is a novel data processing mode, when the method is used for processing the data to be processed, complex data identification can be carried out without the data characteristics of the data to be processed, and the matching among different modes of data can be conveniently and rapidly realized by a mode of matching the characteristics. Furthermore, as candidate standard data are stored in the target database, the identification accuracy can be well ensured by the standard data of the second modality which is determined by the method of the embodiment of the application and is matched with the data to be processed of the first modality.

In addition, in practical application, when corresponding operation is required to be executed based on the data to be processed, since the standard data matched with the data to be processed is determined, the operation can be executed directly based on the standard data, and compared with the existing mode of further identifying the data characteristics to obtain the corresponding identification result, the identification result can be further processed into normalized data without further processing, and the practical application requirement can be better met.

In an alternative embodiment of the present application, the data processing method may further include:

When newly added first standard data exists in the standard database, at least one standard expression corresponding to the newly added first standard data is obtained;

extracting second data characteristics of each standard expression corresponding to the newly added first standard data;

And storing each standard expression corresponding to the newly added first standard data and the second data characteristic association corresponding to each standard expression into a target database.

When new first standard data exists in the standard database (for example, with the continuous updating and optimization of the game application, the game application has more functions and can support the input of more voice instructions), the standard expression corresponding to the new standard data can be obtained, the second data characteristic of the standard expression is extracted, and the standard expression (namely, the newly added candidate standard data) and the corresponding second data characteristic are associated and stored in the target database, so that the expansion of the target database is realized.

After the first data feature of the data to be processed is obtained, if the prior art (for example, the matching data of the second mode corresponding to the data to be processed is obtained by processing the first data feature by using a neural network model) is adopted, when new first standard data appears, the neural network model needs to be retrained, so that if the data to be processed is expressed by the standard corresponding to the newly added data, the neural network model can support the identification of the data to be processed, and the scheme is complex and has high cost. According to the scheme provided by the embodiment of the application, when the newly added first standard data appears, only the standard expression corresponding to the newly added first standard data and the second data characteristic of the standard expression are added into the target database, and the target standard data corresponding to the data to be processed can be determined from the candidate standard data by matching the first data characteristic of the data to be processed with the second data characteristic in the updated target database, so that the implementation is simple and the cost is low.

In an alternative embodiment of the present application, the data processing method may further include: determining the data type of the data to be processed according to the first data characteristics; at this time, in the step S130, matching the first data feature with at least one second data feature in the target database may include:

In some application scenarios, only certain processing may be required for certain or certain specified types of data. In order to meet the application requirement, as an alternative, before the data characteristics of the data to be processed and the data characteristics of the candidate standard data are matched, the data type of the data to be processed may be determined in advance, and when the data type of the data to be processed is a specified type, the matching process is performed again, so as to reduce unnecessary data processing and save computing resources. In this alternative, the candidate standard data in the target database may be a standard expression corresponding to each first standard data of the specified type. In practical application, the specified type can be one type or at least two types, and can be configured according to practical requirements.

It should be noted that, when the alternative scheme is actually implemented, the judgment of the data type of the data to be processed may be performed first, and when the data type is the specified type, the subsequent matching process is performed. Or the data type judgment and the matching processing are executed, and then the subsequent further processing mode is determined according to the matching result and the data type judgment result. For example, after extracting a first data feature of the data to be processed, a data type of the data to be processed may be determined based on the feature (and the first data feature is matched with a second data feature in a target database), and then target standard data corresponding to the data to be processed is determined according to the determined data type and a matching result corresponding to each second data feature.

The data type is not a specified type; the matching degree corresponding to each second data characteristic is smaller than the set value.

In the embodiment of the application, the data type of the data to be processed can be determined according to the first data characteristic of the data to be processed, or can be realized through a neural network model, for example, can be realized through a classification model. Specifically, the first data characteristic of the data to be processed may be input into a trained classification model, and a probability that the data type of the data to be processed belongs to the specified type may also be obtained through the model, and whether the data to be processed is of the specified type may be determined according to the probability. The classification model may be a classification model, that is, the classification category corresponding to the model includes two types, namely, an appointed type and a non-appointed type, the output of the model may include a first probability and a second probability that the data type of the data to be processed belongs to the appointed type and the non-appointed type, and whether the data type is appointed or not may be determined according to the first probability and the second probability, for example, the first probability is greater than the set probability, and the data type of the data to be processed is determined to be the appointed type.

As an alternative, the specified types may include at least two types, the classification model may be a multi-classification model, the classification category corresponding to the model includes a non-specified type and each specified type, if the specified types are two, and are marked as a first type and a second type, then the classification category corresponding to the classification model may be three types. The probability that the data to be processed respectively belong to the non-specified type, the first type and the second type can be predicted through the model, and the type corresponding to the maximum probability value can be determined as the data type of the data to be processed. Optionally, for this solution, the target database may include a plurality of sub-databases, each sub-database corresponds to a standard expression (data of the second modality) corresponding to a first standard data (data of the first modality) of a specific type, whether the data to be processed is of the specific type or not may be identified through the classification model, if the data is of the instruction type, which specific type may be identified, and accordingly, the first data feature of the data to be processed and the second data feature of the candidate standard data in the sub-database corresponding to the specific type may be matched, without matching the first data feature with the second data feature in each sub-database, and the data processing amount may be further reduced.

As can be seen from the foregoing description, in the embodiment of the present application, for the data (such as the data to be processed) of the first modality, the data features may be extracted through the first feature extraction network; for the data of the second modality (such as each candidate standard data), the data characteristics thereof can be extracted through a second characteristic extraction network; the first feature extraction network and the second feature extraction network are obtained by training the neural network model based on a training data set.

In the embodiment of the application, the neural network model comprises a first network model and a second network model, the first network model and the second network model can be iteratively trained based on a training data set, the trained first network model is used as the first characteristic extraction network, and the trained second network model is used as the second characteristic extraction network. The embodiment of the application is not limited to the model structures of the first network model and the second network model, and can be configured according to application requirements, for example, the first network model and the second network model can both adopt a model based on CNN. Alternatively, the model structure may be configured according to a form of data to be processed by the model, for example, the data of the first modality is voice data, the first network model may be a network structure capable of well extracting features of the voice data, for example, the Wac2vec model, and if the data of the second modality is text data, the second network model may be a network structure having a good effect on the text data, for example, a structure based on a transform network, for example, a structure based on a Bert (Bidirectional Encoder Representation from Transformers, transform-based bidirectional coding network) model may be used.

Optionally, the neural network model including the first network model and the second network model in the embodiment of the present application may be trained by:

obtaining a training data set, wherein the training data set comprises a first training set, and each first sample in the first training set comprises first data of a first mode and second data of a second mode matched with the first data;

Performing iterative training on the initial neural network model based on the training data set until the total training loss value meets the preset training ending condition, taking the first network model meeting the training ending condition as a first characteristic extraction network, and taking the second network model meeting the training ending condition as a second characteristic extraction network; the training process may include the steps of:

And if the first training loss value does not meet the first preset condition, adjusting model parameters of the first network model and the second network model, wherein the step of enabling the training total loss value to meet the preset training ending condition comprises enabling the first training loss value to meet the first preset condition.

When training the neural network model, the first data and the second data of each first sample in the first training set are data of two modes matched with each other, the first sample may also be referred to as a positive example, that is, a positive sample, and the first negative example (negative sample) is first data and second data in different first samples, that is, data of two modes not matched with each other, and for any first data, the data may respectively form a negative example by a plurality of other second data (second data except the second data matched with the first data). During training, a training loss value is determined based on the degree of matching between the data features of the positive samples and the degree of matching between the data features of the negative samples.

The embodiment of the application is not limited to the specific form of the loss function selected in the training process, and the purpose of model training is to make the similarity between the features of the first data and the second data which are matched with each other as large as possible and the similarity between the features of the first data and the second data which are not matched with each other as small as possible.

For the positive samples, the degree of difference (for example, 1 minus the similarity, or mean square error between the two features, etc.) between the features of the first data learned by the first network model and the second data learned by the second network model may be calculated, so as to obtain a corresponding training loss, for the negative samples, an alternative way may be to calculate the degree of matching between the features of the first data learned by the first network model and the second data features learned by the second network model, so as to obtain a corresponding training loss, and through continuous training learning, the degree of matching between the data features of the positive samples learned by the model may be made to be higher (i.e., the difference is smaller). The manner of calculating the degree of matching or the degree of difference is also different for different loss functions.

In an optional embodiment of the present application, the determining the first training loss value based on the matching degree of the feature of the first data in each first sample and the feature of the second data and the matching degree of the feature of the first data in each first negative example and the feature of the second data may include:

For each first data, determining a first similarity corresponding to the first data and a second similarity corresponding to the first data, wherein the first similarity is the similarity between the characteristic of the first data and the characteristic of the second data matched with the first data, and the second similarity is the similarity between the characteristic of the first data and the characteristic of the second data in the first negative example where the first data is located;

Alternatively, the first loss value may be a sum of mean square errors between the features of the first data and the features of the second data in each positive sample, or may be a sum of differences corresponding to each positive sample, obtained by calculating a similarity between the features of the first data and the features of the second data in each positive sample, and subtracting the similarity from 1 to obtain the difference. The first loss value may allow as close as possible between features of the data of both modalities in the positive sample learned by the model.

The second loss value may also be referred to as a matching error, and is used to constrain the similarity between the features of the two data in the positive sample learned by the model to be higher than the similarity between the features of the two data in the negative sample. When calculating the partial loss, the reference label is a real label during training, that is, a result that the model is expected to learn, specifically, for each first data, a similarity label corresponding to a first similarity in the corresponding real label refers to an ideal similarity between the first data and second data matched with the first data, for example, the ideal similarity may be 1 or a higher similarity, a second similarity in the real label refers to an ideal similarity between the second data and second data not matched with the second data, for example, the ideal similarity may be 0 or a smaller similarity, and the reference label may be pre-configured. Based on the characteristics of the first data and the characteristics of the second data output by the model, a first similarity and each second similarity corresponding to each first data can be calculated, the similarities can be formed into a similarity vector, a second loss value is obtained by calculating the difference between the similarity vector and a reference label, for example, the similarity vector can be used as probability distribution predicted by the model, the reference label is used as true probability distribution, namely the label, and the second loss value is obtained by calculating cross entropy loss between the similarity vector and the reference label.

In an optional embodiment of the present application, the inputting each first data into the first network model to obtain the feature of each first data may include:

Dividing the first data into at least two sub-data to obtain a sub-data series corresponding to the first data; extracting features of each piece of sub data in the sub data sequence based on a dictionary, wherein the dictionary comprises a plurality of data elements, the number of feature values included in the features of each piece of sub data is equal to the number of elements in the dictionary, and one feature value characterizes the probability that the data elements corresponding to the position of the feature value in the dictionary are included in the sub data; based on the characteristics of each sub data, obtaining the characteristics of the first data;

in this aspect, the data processing method may further include:

For each second data, determining, based on the dictionary, a data characteristic of the second data corresponding to the dictionary, the data characteristic characterizing a probability that the second data corresponds to a respective data element in the dictionary;

Accordingly, the determining the first training loss value may include:

The first training loss value is determined based on a degree of matching between the feature of each sub-data of the first data in each first sample and the feature of the second data corresponding to the dictionary, a degree of matching between the feature of the first data in each first sample and the feature of the second data, and a degree of matching between the feature of the first data and the feature of the second data in each first negative example.

It can be seen that in this alternative, the first training loss value further increases a loss (may be referred to as a third loss value) corresponding to a degree of matching (may be referred to as a third loss portion) between the feature of each sub-data of the first data in each first sample and the data feature of the second data corresponding to the dictionary, and based on this loss, the probability of obtaining the second data matching the first data can be predicted to be maximized, that is, the third loss value is to be able to constrain the first network model, so that the feature of each sub-data in the first data learned by the model can be predicted to obtain the second data.

Alternatively, the third loss component may employ CTC errors (also called CTC losses), which may allow the model to automatically learn the alignment between data of different modalities. In the embodiment of the present application, the data elements in the dictionary are data units that can be used to represent each sub-data and the second data of the first data, and the form of the data elements may be configured according to requirements, and optionally, the data elements may include but are not limited to pinyin or phonemes. Taking the phoneme as an example, the data elements in the dictionary include each phoneme and a blank (a pseudo-identifier in CTC penalty that is added to the automatic data for automatic purposes between them, also called blank). For each sub-data of the first data, the dimension of the characteristic of the sub-data (i.e. the length of the feature vector) is equal to the number of data elements in the dictionary, the positions of the respective data elements in the dictionary are fixed, the characteristic of the sub-data is characterized by the probability that the data element at each position is contained in the sub-data, as a schematic illustration, assuming that there are three elements a, b and c in the dictionary, the length of the characteristic of one sub-data is 3 and can be expressed as (p 1, p2, p 3), and p1, p2 and p3 respectively represent the probability that a occurs at the first position as p1, the probability that b occurs at the second position as p2 and the probability that c occurs at the third position as p3. For the second data, the data characteristic corresponding to the dictionary characterizes a probability that the data characteristic characterizes the second data as corresponding to each data element in the dictionary. When calculating the third loss value corresponding to each positive sample, the probability that the second data corresponds to the data feature of the dictionary can be obtained according to the feature sequence of each sub-data of the first data (namely, the feature vector formed by each feature value) can be determined based on the feature sequence of each sub-data of the first data, and the probability can be maximized based on the constraint of the third loss value, so that the semantic information of the second data can be contained in the feature of each sub-data of the first data learned by the first network model.

Alternatively, the first data may be speech data, the second data may be text data, and the data elements in the dictionary may be phonemes. The second data corresponds to the data features of the dictionary, and may be a phoneme sequence corresponding to the second data, that is, a sequence composed of phonemes constituting the second data, when feature extraction is performed on the speech data, feature extraction may be performed on each speech frame (i.e., sub-data) of the speech data based on the dictionary first to obtain a feature representation corresponding to each speech frame, when CTC loss is calculated, a phoneme sequence corresponding to text data is used as a tag, CTC loss is calculated according to the feature representation of each speech frame and the tag, and the value of the loss represents a probability of predicting the phoneme sequence of the text data according to the feature representation of each speech frame, and the larger the probability is, the smaller the value of the loss is.

In an optional embodiment of the present application, the candidate standard data may be a standard expression of a second modality corresponding to the first standard data of the specified type; the initial neural network model further comprises a classification model; at this time, the training data set further comprises a second training set, and each second sample in the second training set comprises third data of the first mode, fourth data of a second mode matched with the third data, and a type tag of the third data, wherein the third data in the second training data set comprises third data of a specified type and third data of a non-specified type; after obtaining the neural network model with the first training loss value satisfying the first preset condition, the training process of the model may further include:

repeating the training operation on the neural network model based on the second training set until the second training loss value meets a second preset condition, wherein the training total loss value meets a preset training ending condition further comprises that the second training loss value meets the second preset condition; the training operation may include:

As can be seen from the foregoing description, in some application scenarios, it is required to determine the data type of the data to be processed, and further processing may be performed when the data type is a specified type. In order to meet the application requirement, in this alternative embodiment of the present application, the neural network model may further include a classification model, in addition to the first network model and the second network model, where the classification model is cascaded with the first network model, and is configured to determine a type of data input to the first network model according to a feature output by the first network model. In this alternative embodiment, the foregoing process of training the neural network model based on the first training set may be referred to as pre-training, through which a first network model and a second network model that substantially meet the application requirement may be obtained, and after the neural network model that meets the first preset condition is obtained through pre-training (for convenience of description, the model is referred to as an intermediate model), fine-tuning training may be performed on the intermediate model based on the second training data set, so as to obtain a model that can better meet the specific task requirement.

In the fine-tuning training process, a part of training loss (matching loss) may be calculated according to the matching degree of the features of the third data and the features of the fourth data in each second sample and the matching degree of the features of the third data and the features of the fourth data in each second negative example, a part of training loss (classification loss) may be calculated according to the type label and the predictive label of each third data, and further training of the model may be constrained based on the training loss of both parts. The method of calculating the loss according to the degree of matching between the features of the third data and the features of the fourth data in each second sample and the degree of matching between the features of the third data and the features of the fourth data in each second negative example may be the method of calculating the matching loss (i.e., the second loss value) in the foregoing, but may be the method of calculating the first loss value and the second loss value in the foregoing.

For the classification loss, the loss value of the part represents the similarity between the type of the third data predicted by the classification model and the real type of the third data, namely, the type label of the third data, alternatively, the type label of the third data may be 1 or 0, for example, 1 represents that the third data is data of a specified type, 0 represents that the third data is not data of a specified type, the output of the classification model may comprise a first probability that the third data is of the specified type and a second probability that the third data is not of the specified type, the training loss part corresponding to the classification model may be calculated according to the two probabilities that the type label classification model of each third data outputs, alternatively, the loss part may be calculated by adopting a two-class cross entropy error, and the smaller the error value represents that the predicted type and the real type are closer.

After the neural network model meeting the training ending condition is obtained through the alternative embodiment, when the neural network model is applied, the data type of the data to be processed can be identified through the trained classification model, specifically, the data to be processed can be input into the trained first network model (namely the first feature extraction network) to obtain the first data feature of the data to be processed, the first data feature is input into the trained classification model to obtain the first probability that the data to be processed belongs to the data of the specified type and the second probability that the data not belongs to the data of the specified type, and whether the data to be processed is the data of the specified type or not can be determined according to the first probability and the second probability. For example, if the data to be processed is voice data and the instruction type is voice instruction, that is, the voice data of the designated type is voice of the instruction type, if the data to be processed is determined to be voice data of the instruction type through the classification model, the characteristics of the data to be processed can be matched with the characteristics of candidate standard data (such as text) in the target database, so that text expression matched with the voice data can be found, and the recognition result of the voice data can be understood.

The data processing scheme provided by the embodiment of the application provides a data matching method based on cross-mode retrieval, which can quickly find out data of another mode matched with the data to be processed, namely target standard data, from a target database by only utilizing the data characteristics of the data to be processed without further deep recognition of the data characteristics. Compared with the prior art, the scheme of the application can greatly and effectively reduce the data calculation amount, and the accuracy is obviously improved compared with the prior art.

The method provided by the embodiment of the application can be suitable for any processing scene needing to be processed among cross-mode data, for example, the method can be applied to an instruction recognition scene of an AI voice assistant, and semantic instructions of a user can be accurately and rapidly recognized; the method can also be applied to the voice question and answer of the AI robot, and the text expression matched with the voice input by the user can be found by the method, so that answer information corresponding to the text expression can be provided for the user; the method can be applied to a cross-modal data retrieval scene, for example, a search engine or various application programs, matched audio data can be found based on text data input by a user, for example, corresponding music can be found according to search text input by the user and provided for the user. In addition, the first feature extraction network and the second feature extraction network provided by the embodiment of the application can be applied to various scenes in which the data features need to be extracted, and the data features with better semantic expression capability can be extracted.

In practical implementation, in the embodiment of the present application, both the data to be processed and the candidate standard data may include at least one type of data, for example, the data of the first modality is voice, and the data of the second modality is text. It should be understood that, when the data to be processed or the data of the second modality may be data including two types, when the first feature extraction network and the second feature extraction network are trained, the data of the first modality (first data and third data) and the data of the second modality (second data and fourth data) in the training data set should include at least data of types corresponding to the data to be processed and candidate standard data, for example, the data to be processed may be data including the first type data and the second type data, and the candidate standard data is data of the third type data, and then at least part of the first data and the third data in the training data set should also include the first type data and the second type data, and the part of the second data corresponding to the third data should also be data of the third type, that is, the type of sample data in the training data set used when training the feature extraction network is trained should correspond to the type of data processed after the training of the network.

In order to better understand the method provided by the embodiment of the present application and the practical value of the method, the method provided by the embodiment of the present application is described below with reference to specific scene embodiments.

The application scene corresponding to the scene embodiment is a game scene, and the method provided by the embodiment of the application can be applied to an AI voice assistant in game application, and the A voice assistant can recognize a voice instruction input by a user. In a game scenario, a user may interact with an AI voice assistant through voice during game play, for example, the user speaks "mark P city" when playing a game in his user terminal, the purpose of which is to have the AI voice assistant mark the "P city" at the location "P city" of the map of the game virtual scenario, and the voice command "mark item a" is to have the AI voice assistant mark "item a" in the virtual game scenario.

Fig. 2 shows a schematic structural diagram of a data processing system applicable to this embodiment of the scenario of the present application, and as shown in fig. 2, the data processing system may include a user terminal 10, a game server 20 and a training server 30, where the user terminal 10 may be a user terminal of any game player, and the game server 20 is used to provide a game service for the player, and the embodiment of the present application is not limited in type of a game application, and may be a game application that requires a user to download and install, or a cloud game application, or a game application in an applet. Training server 30 may be communicatively coupled to game server 20 via a network, and training server 30 may be configured to perform training operations on the neural network model and provide the trained neural network model to game server 20.

The above-mentioned AI voice assistant may be deployed in the game server 20 or in the user terminal 10, and in order to reduce the computing resources of the user terminal 10, the AI voice assistant is deployed on the game server 20 side as an example.

In the application scenario, the data to be processed is voice data of a user, the first standard data is a pre-configured standard voice instruction, that is, a voice instruction supported by a game application, and the standard expression corresponding to the first standard data is text expression (that is, candidate standard data). An alternative flow of implementing the method provided by the present application in a game scenario is described below in connection with the data processing system shown in FIG. 2. The data processing flow in this embodiment may include the following steps 1 to 5:

step S1: training a neural network model.

This step may be performed by training server 30. A schematic diagram of the training principle of the neural network model in the present scenario is shown in fig. 3. As shown in fig. 3, the training process may include two phases, a pre-training and a fine-tuning training, fig. 4 shows a schematic diagram of the pre-training phase, and fig. 5 shows a schematic diagram of the fine-tuning training phase.

As shown in fig. 3, the neural network model in the present scene embodiment includes a speech encoding module (first network model), a text encoding module (second network model), and a classification module. The speech data may be passed through a speech coding module to obtain a corresponding speech representation vector (i.e., a feature of the speech data, which may also be referred to as a vector representation), and the text data may be passed through a text coding module to obtain a corresponding text representation vector (i.e., a feature of the text data). The application is not limited to the specific network structure of the module, and can be configured according to actual needs.

Alternatively, the speech coding module may employ a structure based on a Wav2vec2 model, for example, a pooling module (pooling operation shown in fig. 4) may be connected after the Wav2vec2 model (Wav 2vec2 shown in fig. 4) as the speech coding module, and the Wav2vec2 model is composed of a multi-layer CNN and a multi-layer transducer. When the speech coding module is used for feature extraction, the speech data is firstly coded by the Wav2vec2 model to obtain a vector sequence, each value of the sequence is a vector (which can be understood as a feature vector of a speech segment), and then the average value of the vector sequence can be calculated by pooling operation to obtain a vector, namely a speech representation vector. For the text encoding module, alternatively, the module may employ a Bert model-based structure, the model being composed of multiple layers of transformers. The training process of the two stages of pre-training and fine-tuning training of the neural network model is described below with reference to fig. 3 to 5.

Pre-training process

The pre-training is based on a first training set that includes a large amount of speech-to-translation text data (i.e., a first sample that includes speech data and text data that matches the speech data). For convenience of description, the speech data (first data in fig. 3) in the training data set will be referred to as sample speech, and text data (second data in fig. 3) corresponding to the speech data will be referred to as translation text, and the pre-training process will be described.

As shown in fig. 3 and 4, a training process may include: each sample voice is input to the voice coding module, a voice representation vector of each sample voice is obtained through the voice coding module, each translation text is input to the text coding module to obtain a text representation vector of each translation text, then a training loss (a first training loss value) in the pre-training stage, that is, a total error shown in fig. 4, can be calculated based on the voice representation vector of each sample voice and the text representation vector of each translation text, if the total error of this training meets a first preset condition, the pre-training stage can be ended to obtain an intermediate model, if the total error does not meet the first preset condition, model parameters of the voice coding module and the text coding module are adjusted (updated parameters shown in fig. 4), and the training process is repeated until the total error meets the first preset condition.

As shown in fig. 4, in the embodiment of the present scenario, the pretraining uses multiple training targets, that is, multiple training errors, that is, multiple training losses, specifically including CTC errors (third loss part), matching errors (first loss part), and distillation errors (second loss part), and the first training loss value in the pretraining stage is the sum of the errors of the three parts, where the meanings of the three part errors are as follows:

(1) CTC error: the error is calculated based on the intermediate vector (the feature of each sub-data) of the speech coding module and the label (the translated text corresponds to the data feature of the dictionary) that characterizes the similarity between the sample speech and the translated text in a positive sample, the label being generated from the translated text and being a sequence of pinyin for the corresponding text, the pinyin being used instead of the word in order to reduce the size of the dictionary, i.e. the number of data elements in the dictionary, i.e. the length of the label. The vector sequence output by the Wav2vec2 model is a feature of each voice segment of voice data, and the input CTC error is used as an input for calculating the CTC error together with a label, and the CTC error L _ctc corresponding to one positive sample can be expressed as follows:

In this embodiment, y represents a label, x represents a sample voice, c ₁,c₂,…,c_T represents a representation vector of each voice segment (voice frame) output after the sample voice is encoded by the Wav2vec2 model of the voice encoding module, pi represents a legal sequence corresponding to the sample voice x, it can be understood that a path of the label y (i.e. pinyin sequence) can be obtained according to the representation vector of each voice segment of the sample voice, and L _ctc represents probabilities corresponding to all paths that can be obtained according to the representation vector of each voice segment of the sample voice, that is, probabilities that can be obtained according to the representation vector of each voice segment of the sample voice.

For a speech-translated text pair, the error represents the likelihood that each speech segment of sample semantics that can be output according to the Wav2vec2 model corresponds to a feature vector prediction of the dictionary to yield the label y. In the training process, the purpose of the CTC error is to enable the vector sequence output by the Wav2vec2 model to predict pinyin of the translated text, and enable the vectors to contain semantic information.

(2) Matching error: as an input to the error, the vector representations corresponding to the sample speech and the translated text are matched for the purpose of making the similarity of the two vector representations of the speech-translated text (positive example) higher than the similarity of the two vector representations of the speech-negative example translated text (negative example), alternatively cosine similarity may be used for the calculation of the similarity. During training, a batch of voice-translation text data is calculated at a time, each voice-translation text pair in the batch belongs to a positive example, and other voice and translation texts in the batch form negative examples, namely voice-negative example translation texts.

Alternatively, the matching error may be a multi-classification cross entropy loss, specifically, after the speech vector representation of each sample speech and the text vector representation of each translation text are obtained through a model, for each sample speech, a first similarity between two vector representations of the positive examples to which the speech belongs and a second similarity between two vector representations of each negative example to which the speech belongs, that is, a similarity between the speech representation vector of the speech and the text representation vector of each other text (the text in each translation text other than the translation text matched with the speech) are obtained, as a schematic description, assuming that there are 10 speech-translation text pairs in a batch of data, for each sample speech, the second similarity is 9, the first similarity is one, and for example, the 10 similarities may be used as a prediction result distribution, the first similarity is used as a first value of the distribution, the prediction result distribution may be represented as a distribution [ p1, p2, 26, p10], the first similarity is represented as a first similarity between the speech representation vector of the speech and the text representation vectors of each other texts (the text in each translation text except the translation text, the translation text is matched with the speech), and the other sample speech is represented as a true label, and the matching label is obtained by calculating the matching error between the two samples of the sample label, and the sample label is 0.

In the training process, the similarity between the vector representations of the positive examples learned by the model can be higher than the similarity between the vector representations of the negative examples through the constraint of the matching error.

(3) Distillation error: the vector representation corresponding to the positive example is used as an input for the error, and the Mean Square Error (MSE) of the two vectors is calculated, wherein the purpose of the error is to make the two vector representations of the speech-translation text pair close in distance.

And calculating the three errors, averaging or summing the three error values to obtain an overall error, and if the overall error does not meet a first preset condition, updating model parameters of the speech coding module and the text coding module, and continuously repeating training.

After the intermediate model satisfying the first preset condition is obtained, training may be continued on each part of the intermediate model based on the second training set, i.e. fine-tuning training.

Fine tuning training process

Fig. 5 shows a schematic diagram of a fine tuning training process, as shown in fig. 3 and 5, in the fine tuning training stage, besides training the speech coding module and the text coding module, training the classification model (the rejection classification module shown in fig. 5) is required, through fine tuning training, a neural network model meeting a second preset condition is obtained, the speech coding module and the classification module at this time may be deployed in an AI voice assistant of the game server 20, to identify a class of voice data input by a game player, and extract a voice representation vector of the voice data, and optionally, the text coding module may also be deployed in the game server, to extract a text representation vector of a standard text representation corresponding to a standard voice instruction, or the text coding module may also be deployed in other devices, where the device extracts a text representation vector of a standard text representation corresponding to each standard voice instruction, and then provides the text representation vector to the game server. The application is not limited to the structure of the classification module, and the rejection classification module can be composed of a full-connection layer network. The process of fine tuning training is described below in conjunction with fig. 3 and 5.

In the fine-tuning training phase, since the rejection classification module determines whether the sample speech input to the speech encoding module is an instruction, the second samples in the second training set include a type tag of the sample speech in addition to the sample speech and the translated text corresponding to the sample speech, and each of the second samples includes a sample of a translated text composition of non-instruction speech-non-instruction speech in addition to a sample of a translated text composition of instruction speech-instruction speech. For example, a type tag being a tag of 1 indicates that it is an instruction voice, and 0 indicates that it is not an instruction voice.

As shown in fig. 5, in the fine tuning training phase, similarly, each sample voice (instruction voice and non-instruction voice) is respectively obtained by a voice coding module to obtain a voice representation vector of each sample voice, and the transition text of each sample voice is obtained by a text coding module to obtain a corresponding text representation vector. The speech expression vector of each sample speech obtains the probability that the sample speech is instruction speech through the refusal classification module. Alternatively, the objective function (i.e. the loss function) of this stage may have two types, namely, reject classification error and match error, where the reject classification error may use a two-type cross entropy error, the error is calculated by using the difference between the probability that the sample speech predicted by the classification model is command speech and the type label of the sample speech, and the match error may use the same match error as the pre-training stage. And then, taking an average value or summation of the two errors as a second training loss value (the total error in fig. 5), if the total error meets a second preset condition, obtaining a trained model, and if the total error does not meet the second preset condition, updating model parameters of the speech coding module, the text coding module and the classification model, and repeating the fine tuning training process.

Step S2: and constructing a target database.

Before the model application and the data to be processed are performed, a target database needs to be built, and the target database in the application scene comprises an instruction library and an instruction vector library as shown in fig. 2. The instruction library stores standard expressions (candidate standard data) in a text form corresponding to each standard voice instruction, and the instruction vector library stores text vector representations (second data features) of each standard expression in the instruction library. The instruction library construction may be constructed according to instruction intention categories, the instruction library containing a plurality of specific intentions (corresponding to standard voice instructions, each of which may be understood as one intention of the user), each intention being a category and being given a category id, while for each intention one or more natural language texts expressing the intention may be constructed as standard expressions of the intention. For example, the voice command "marked article a" can be used as a category, and the corresponding standard expressions can be various standard expressions such as "marked article a", "article a here", "marked article a" and the like.

For the instruction vector library, the text vector representation may be extracted by a trained text encoding module. As shown in fig. 6, for each standard expression in the instruction library, it may be input into a text encoding module, through which a corresponding text vector representation is obtained, and the text vector representation of the standard expression for each intent is stored into the instruction vector library.

According to the scheme provided by the embodiment of the application, the updating and expanding of the instruction database can be realized conveniently and rapidly, for example, when a new voice instruction is added, the text expression corresponding to the new voice instruction is only added into the instruction database, the corresponding text vector expression is extracted by the text coding module and is counted into the instruction vector database, a retraining model is not needed, and the application requirement of the target database when the instruction category (a standard voice instruction can be regarded as a category) is not fixed can be well met.

Alternatively, in the fine-tuning training phase, the instruction speech-instruction speech translation pair in the second training set may include a text pair composed of a standard speech instruction in the instruction library and a standard expression corresponding to the standard speech instruction.

Step S3: and processing the data to be processed.

In the application stage, for each voice input (i.e. voice instruction, that is, data to be processed in the application scenario) of the user, the processing flow shown in fig. 7 may be adopted for processing, specifically, a trained voice encoding module may be used to obtain a voice expression vector (i.e. a first data feature) of each voice input, the vector expression is used as a query vector, similarity (i.e. a matching score) of each vector expression of the query vector and the instruction vector library is calculated, a maximum matching score and a text vector expression (instruction and matching score shown in fig. 7) corresponding to the matching score may be selected, in addition, the voice expression vector is input to the rejection classification module to obtain a rejection score (e.g. a probability that the voice input is a voice instruction) through the rejection classification module, and then, according to the matching score and the rejection score, a rule judgment module may judge whether the standard expression corresponding to the text vector expression corresponding to the maximum similarity is the target standard data matched with the voice input according to a preconfigured judgment rule. For example, the judgment rule is that the matching score is greater than the first threshold value and the rejection score is greater than the second threshold value, if the condition is satisfied, the standard expression corresponding to the maximum similarity is determined as target standard data, that is, the voice input by the user at this time is considered to be the voice instruction (the instruction finally output in fig. 7), the action corresponding to the standard voice instruction corresponding to the standard expression can be executed according to the voice input of the user, for example, the standard expression is "marked article a", the game server can execute the corresponding marked action, and the marked result is displayed to the player through the user interface of the user terminal (i.e., the interactive interface of the game application).

As an example, fig. 8a and 8b show a schematic diagram of a user interface of a game scenario, where a player may perform a game operation by initiating a voice command or by manually operating the game during playing the game, and specifically, the player may perform a control operation on his player character through an input control device on his terminal device or external to the terminal device, as in fig. 8a, during the game, the player may click on a "voice assistant" control (control 81 in fig. 8 a) to turn on or off an AI voice assistant, and may initiate a voice command in a state where the AI voice assistant is turned on. In this game scenario, if the player wants to mark the virtual item a in the scenario, the player character (the virtual item designated by the character) can be controlled to aim at the item a, after aiming, a voice command for marking the item a can be issued, or the item a can be marked by clicking a "mark" control (control 81 in fig. 8 a), when in voice mode, the AI voice assistant can implement the provided data processing method (any optional embodiment corresponding to the steps S120 to 140) by executing the method, or send the voice command to the game server, and by executing the method, the game server determines that the real intention (i.e. the target standard data) of the user is that the "item a" needs to be marked, and the AI voice assistant can mark the item a in the game scenario. As shown in fig. 8b, after the item a is marked, corresponding marking information may be displayed, which includes, but is not limited to, attribute information (such as category, etc.) of the item a, marking prompt information (such as "abc marks the item a" in fig. 8b and a marking mark 83 suspended above the item a, abc being the name of the player in the game, that is, a nickname of the player), and the distance of the player character from the item currently in the game scene (5 meters shown in the figure), etc.

In practical applications, if a player is currently in a team game, after the player marks the item a, other players in the team on which the player is located can also see corresponding prompt information on the user interface, such as "abc marks the item a", information on the relative positions of player characters of other players and the item a in the game scene, and the like.

As an alternative, in order to accelerate the retrieval efficiency, when determining the standard expression with the highest matching score by calculating the similarity between the query vector and the vector representation of the instruction vector library, a Faiss retrieval method (a fast retrieval method) may be adopted, and correspondingly, after obtaining the text vector representation of each standard expression by the text encoding module, the instruction vector library may be constructed by adopting a feature vector library construction method in the Faiss retrieval method, so as to improve the retrieval efficiency.

Therefore, when the method provided by the embodiment of the application is adopted for identifying the voice data, a voice identification model is not needed, the text expression matched with the voice data can be quickly and accurately identified based on the similarity between the voice expression vector of the voice data and the text expression vector of the text data, and the standard voice instruction corresponding to the text expression can be used as the true intention of the voice data, and corresponding operation can be executed according to the intention. By adopting the method, the calculation cost can be effectively reduced, and the processing efficiency is improved. In addition, by means of a cross-modal retrieval mode of voice retrieval text, the problem of poor recognition accuracy under the condition of multiple instruction categories can be well solved, the actual application requirements of the instruction categories can be met, when a new instruction category exists, a model is not required to be retrained, and only a new instruction text is required to be added into a retrieval library (namely an instruction vector library). In addition, through the training mode of pre-training and fine-tuning training and the combination of the multi-target optimization scheme, the voice coding module can learn the semantic information of the voice well, and the recognition accuracy can be improved.

It should be noted that, in practical application, besides the above-mentioned staged training method provided by the embodiment of the present application, a method of combining two stages or alternatively performing two stages may be used.

In order to verify the effect of the method provided by the embodiment of the application, in a game scene, the method provided by the embodiment of the application and the existing scheme are compared and tested. In the test, 1469 pieces of data marked by manpower are adopted, the existing scheme adopts a method of automatic speech recognition (ASR, automatic Speech Recognition) and natural speech understanding (NLU, natural Language Processing), and adopts accuracy, recall and recall errors as effect evaluation indexes, wherein the recall errors refer to the proportion of instruction recognition results given by a model for non-instruction speech. The higher the accuracy and recall, the better, and the lower the false recall, the better. The following table 1 shows the test results of the scheme of the embodiment of the present application and the existing scheme, and it should be noted that, during the test, the test data adopts the screened data, and the non-instruction voice data is much like the voice instruction, so the recall rate of both schemes is higher.

Table 1 comparison of data effects on manual annotation

Model	Accuracy rate of	Recall rate of recall	Error rate
				Existing solutions	38.02％	61.97％	64.20％
The application is that	42.86％	76.40％	60.34％

Compared with the existing scheme, the accuracy and recall rate corresponding to the neural network model provided by the embodiment of the application are obviously improved compared with the existing scheme, and the recall rate is obviously reduced.

Based on the same principle as the method provided by the embodiment of the present application, the embodiment of the present application further provides a data processing apparatus, and as shown in fig. 9, the data processing apparatus 100 may include a data acquisition module 110 to be processed, a feature acquisition module 120, and a data identification module 130.

The data to be processed obtaining module 110 is configured to obtain data to be processed, where the data to be processed is data of a first modality;

the feature acquisition module 120 is configured to extract a first data feature of the data to be processed;

the data identification module 130 is configured to match the first data feature with at least one second data feature in the target database, obtain a matching result corresponding to each second data feature, and determine, according to the matching result corresponding to each second data feature, target data matching with the data to be processed from each candidate data;

Optionally, the data identification module is further configured to: determining the data type of the data to be processed according to the first data characteristics; accordingly, the data identification module is configured to, when matching the first data feature with at least one second data feature in the target database:

If the first training loss value does not meet the first preset condition, model parameters of the first network model and the second network model are adjusted, and the total training loss value meets the preset training ending condition, including that the first training loss value meets the first preset condition.

Optionally, the model training module is configured to perform an operation when inputting each first data into the first network model to obtain a feature of each first data:

for each first data, performing the following operations on the first data through a first network model to obtain the characteristics of the first data: dividing the first data into at least two sub-data to obtain a sub-data series corresponding to the first data; extracting features of each piece of sub data in the sub data sequence based on a dictionary, wherein the dictionary comprises a plurality of data elements, the number of feature values included in the features of each piece of sub data is equal to the number of elements in the dictionary, and one feature value characterizes the probability that the data elements corresponding to the position of the feature value in the dictionary are included in the sub data; based on the characteristics of each sub data, obtaining the characteristics of the first data;

The model training module may also be used to: for each second data, determining, based on the dictionary, a data characteristic of the second data corresponding to the dictionary, the data characteristic characterizing a probability that the second data corresponds to a respective data element in the dictionary;

The model training module may be configured to, when determining the first training loss value: the first training loss value is determined based on a degree of matching between the feature of each sub-data of the first data in each first sample and the feature of the second data corresponding to the dictionary, a degree of matching between the feature of the first data in each first sample and the feature of the second data, and a degree of matching between the feature of the first data and the feature of the second data in each first negative example.

Optionally, the model training module may be configured to, when determining the first training loss value based on a degree of matching between the feature of the first data and the feature of the second data in each first sample and a degree of matching between the feature of the first data and the feature of the second data in each first negative example:

Optionally, the candidate standard data is a standard expression of a second modality corresponding to the first standard data of the specified type; the initial neural network model also includes a classification model; the training data set further comprises a second training set, each second sample in the second training set comprises third data of the first mode, fourth data of a second mode matched with the third data, and a type tag of the third data, wherein the third data in the second training data set comprises third data of a specified type and third data of a non-specified type; after obtaining the neural network model with the first training loss value satisfying the first preset condition, the model training module is further configured to perform the following training process:

Optionally, the specified type is instruction type voice.

The device of the embodiment of the present application may perform the method provided by the embodiment of the present application, and its implementation principle is similar, and actions performed by each module in the device of the embodiment of the present application correspond to steps in the method of the embodiment of the present application, and detailed functional descriptions and beneficial effects of each module of the device may be specifically referred to the descriptions in the corresponding methods shown in the foregoing, which are not repeated herein.

An electronic device is also provided in an embodiment of the present application, where the electronic device includes a memory, a processor, and a computer program stored on the memory, where the processor executes the computer program to implement the steps of the method provided in any of the alternative embodiments of the present application.

Fig. 10 shows a schematic structural diagram of an electronic device to which an embodiment of the present application is applied, and as shown in fig. 10, the electronic device 4000 includes a processor 4001 and a memory 4003. Wherein the processor 4001 is coupled to the memory 4003, such as via a bus 4002. Optionally, the electronic device 4000 may further comprise a transceiver 4004, the transceiver 4004 may be used for data interaction between the electronic device and other electronic devices, such as transmission of data and/or reception of data, etc. It should be noted that, in practical applications, the transceiver 4004 is not limited to one, and the structure of the electronic device 4000 is not limited to the embodiment of the present application.

The Processor 4001 may be a CPU (Central Processing Unit ), general purpose Processor, DSP (DIGITAL SIGNAL Processor, data signal Processor), ASIC (Application SPECIFIC INTEGRATED Circuit), FPGA (Field Programmable GATE ARRAY ) or other programmable logic device, transistor logic device, hardware component, or any combination thereof. Which may implement or perform the various exemplary logic blocks, modules and circuits described in connection with this disclosure. The processor 4001 may also be a combination that implements computing functionality, e.g., comprising one or more microprocessor combinations, a combination of a DSP and a microprocessor, etc.

Bus 4002 may include a path to transfer information between the aforementioned components. Bus 4002 may be a PCI (PERIPHERAL COMPONENT INTERCONNECT, peripheral component interconnect standard) bus or an EISA (Extended Industry Standard Architecture ) bus, or the like. The bus 4002 can be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in fig. 10, but not only one bus or one type of bus.

Memory 4003 may be, but is not limited to, ROM (Read Only Memory) or other type of static storage device that can store static information and instructions, RAM (Random Access Memory ) or other type of dynamic storage device that can store information and instructions, EEPROM (ELECTRICALLY ERASABLE PROGRAMMABLE READ ONLY MEMORY ), CD-ROM (Compact Disc Read Only Memory, compact disc Read Only Memory) or other optical disk storage, optical disk storage (including compact discs, laser discs, optical discs, digital versatile discs, blu-ray discs, etc.), magnetic disk storage media, other magnetic storage devices, or any other medium that can be used to carry or store a computer program and that can be Read by a computer.

The memory 4003 stores therein a computer program for executing the method provided by the embodiment of the present application, and can be controlled to be executed by the processor 4001. The processor 4001, when executing the above-described computer program stored in the memory 4003, can implement the steps shown in any of the foregoing method embodiments of the present application.

The embodiment of the application also provides a computer readable storage medium, and the computer readable storage medium stores a computer program, and the computer program can realize the steps and corresponding contents of any one of the method embodiments of the application when being executed by a processor.

The embodiment of the application also provides a computer program product, which comprises a computer program, and the computer program can realize the steps and corresponding contents of any one of the method embodiments of the application when being executed by a processor.

It should be noted that the terms "first," "second," "third," "fourth," "1," "2," and the like in the description and claims of the present application and in the above figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate, such that the embodiments of the application described herein may be implemented in other sequences than those illustrated or otherwise described.

It should be understood that, although various operation steps are indicated by arrows in the flowcharts of the embodiments of the present application, the order in which these steps are implemented is not limited to the order indicated by the arrows. In some implementations of embodiments of the application, the implementation steps in the flowcharts may be performed in other orders as desired, unless explicitly stated herein. Furthermore, some or all of the steps in the flowcharts may include multiple sub-steps or multiple stages based on the actual implementation scenario. Some or all of these sub-steps or phases may be performed at the same time, or each of these sub-steps or phases may be performed at different times, respectively. In the case of different execution time, the execution sequence of the sub-steps or stages can be flexibly configured according to the requirement, which is not limited by the embodiment of the present application.

The foregoing is merely an optional implementation manner of some of the implementation scenarios of the present application, and it should be noted that, for those skilled in the art, other similar implementation manners based on the technical ideas of the present application are adopted without departing from the technical ideas of the scheme of the present application, and the implementation manner is also within the protection scope of the embodiments of the present application.

Claims

1. A method of data processing, comprising:

Acquiring data to be processed, wherein the data to be processed is data of a first mode, and the data of the first mode is voice data;

extracting first data features of the data to be processed through a first feature extraction network, wherein the first data features are voice feature vectors;

Determining the similarity between the first data feature and at least one second data feature in a target database to obtain a matching result corresponding to each second data feature, wherein the target database comprises at least one candidate standard data and the second data feature of each candidate standard data, the candidate standard data is data of a second mode, the data of the second mode is text data, and the second data feature is a text feature vector obtained by feature extraction of the candidate standard data through a second feature extraction network;

Determining target standard data matched with the data to be processed from the candidate standard data according to the matching result corresponding to each second data characteristic, wherein the target standard data is a voice recognition result of the data to be processed;

Wherein the first feature extraction network and the second feature extraction network are trained by:

acquiring a training data set comprising a first training set, each first sample in the first training set comprising first data of a first modality and second data of a second modality matched with the first data;

Performing iterative training on an initial neural network model comprising a first network model and a second network model until a total training loss value meets a preset training ending condition, and taking the first network model and the second network model when the conditions are met as a first feature extraction network and a second feature extraction network respectively; the training process comprises the following steps:

Inputting the first data into a first network model to obtain the characteristics of the first data, and inputting the second data into a second network model to obtain the characteristics of the second data;

determining the difference degree of the characteristic of the first data and the characteristic of the second data in each first sample to obtain a first loss value;

For each first data, determining a first similarity between the characteristic of the first data and the characteristic of second data matched with the first data and a second similarity between the characteristic of the first data and the characteristic of the second data in a first negative example where the first data is located, and obtaining a prediction probability distribution based on the first similarity and each second similarity; the first negative example comprises first data of one first sample and second data of another first sample;

Acquiring reference labels corresponding to the first data, wherein the reference labels comprise similarity labels corresponding to the first similarity and similarity labels corresponding to the second similarity;

Taking the reference tag as a real probability distribution, and determining a second loss value based on the difference between the predicted probability distribution corresponding to each piece of first data and the real probability distribution;

Determining a first training loss value according to the first loss value and the second loss value;

and if the first training loss value does not meet a first preset condition, adjusting model parameters of the first network model and the second network model, wherein the step of meeting the preset training ending condition by the training total loss value comprises the step of meeting the first preset condition by the first training loss value.

2. The method as recited in claim 1, further comprising:

Determining the data type of the data to be processed according to the first data characteristics;

the matching the first data feature with at least one second data feature in a target database includes:

And when the data type of the data to be processed is a specified type, matching the first data characteristic with at least one second data characteristic in a target database.

3. The method of claim 2, wherein the specified type is instructional speech.

4. A method according to any one of claims 1 to 3, wherein the candidate standard data is a standard representation matching first standard data in a standard database, the first standard data being data of a first modality, one of the first standard data corresponding to at least one standard representation.

5. The method as recited in claim 4, further comprising:

When newly added first standard data exist in the standard database, at least one standard expression corresponding to the newly added first standard data is obtained;

and storing each standard expression corresponding to the newly added first standard data and the second data characteristic corresponding to each standard expression in the target database in a correlated manner.

6. The method of claim 1, wherein said inputting each of said first data into a first network model results in a characteristic of each of said first data, comprising:

for each first data, performing the following operations on the first data through the first network model to obtain the characteristics of the first data:

dividing the first data into at least two sub-data to obtain a sub-data sequence corresponding to the first data;

extracting features of each sub-data in the sub-data sequence based on a dictionary, wherein the dictionary comprises a plurality of data elements, the number of feature values included in the features of each sub-data is equal to the number of elements in the dictionary, and one feature value characterizes the probability that the data element corresponding to the position of the feature value in the dictionary is included in the sub-data;

based on the characteristics of each sub data, obtaining the characteristics of the first data;

the training process further comprises:

For each of the second data, determining, based on the dictionary, a data characteristic of the second data corresponding to the dictionary, the data characteristic characterizing a probability that the second data corresponds to a respective data element in the dictionary;

the determining a first training loss value includes:

Determining a third loss value based on a degree of matching between features of respective sub-data of the first data in each of the first samples and data features of the second data corresponding to the dictionary;

And determining a first training loss value according to the first loss value, the second loss value and the third loss value.

7. The method according to claim 1 or 6, wherein the candidate standard data is a standard expression of a second modality corresponding to a first standard data of a specified type; the initial neural network model further comprises a classification model;

The training data set further comprises a second training set, each second sample in the second training set comprises third data of a first mode, fourth data of a second mode matched with the third data and a type label of the third data, wherein the third data in the second training set comprises third data of a specified type and third data of a non-specified type;

After obtaining the neural network model that the first training loss value meets the first preset condition, the training process further includes:

Repeating the training operation on the neural network model based on the second training set until a second training loss value meets a second preset condition, wherein the training total loss value meets a preset training ending condition, and the training total loss value also comprises that the second training loss value meets the second preset condition; the training operation includes:

inputting each third data into a first network model to obtain the characteristics of each third data, inputting each fourth data into a second network model to obtain the characteristics of each fourth data, and inputting the characteristics of each third data into a classification model to obtain the prediction type corresponding to each third data;

Determining a second training loss value based on the degree of matching of the features of the third data and the features of the fourth data in each second sample, the degree of matching of the features of the third data and the features of the fourth data in each second negative example, and the degree of matching between the type label and the prediction type of each third data;

8. A data processing apparatus, the apparatus comprising:

The device comprises a data acquisition module to be processed, a data processing module and a data processing module, wherein the data acquisition module is used for acquiring data to be processed, the data to be processed is data of a first mode, and the data of the first mode is voice data;

the feature acquisition module is used for extracting first data features of the data to be processed through a first feature extraction network, wherein the first data features are voice features;

The data identification module is used for determining the similarity between the first data characteristic and at least one second data characteristic in a target database, obtaining a matching result corresponding to each second data characteristic, and determining target standard data matched with the data to be processed from candidate standard data according to the matching result corresponding to each second data characteristic;

The target database comprises at least one candidate standard data and a second data feature of each candidate standard data, wherein the candidate standard data is data of a second mode, the data of the second mode is text data, and the second data feature is a text feature vector obtained by feature extraction of the candidate standard data through a second feature extraction network;

the data of the second mode is text data, the second data features are text features, and the target standard data is a voice recognition result of the data to be processed;

9. The apparatus of claim 8, wherein the data identification module, when matching the first data characteristic with at least one second data characteristic in a target database, is to:

10. The apparatus of claim 9, wherein the specified type is instructional speech.

11. The apparatus according to any one of claims 8 to 10, wherein the candidate criterion data is a criterion expression matching first criterion data in a criterion database, the first criterion data being data of a first modality, one of the first criterion data corresponding to at least one criterion expression.

12. The apparatus of claim 11, wherein the feature acquisition module is further configured to:

13. The apparatus of claim 8, wherein the characteristics of each of the first data are obtained by:

the training process further comprises:

The first training loss value is determined by:

14. The device according to claim 8 or 13, wherein,

The candidate standard data is a standard expression of a second mode corresponding to the first standard data of the appointed type; the initial neural network model further comprises a classification model;

15. An electronic device comprising a memory having a computer program stored therein and a processor executing the computer program to implement the method of any of claims 1 to 7.

16. A computer-readable storage medium, characterized in that the storage medium has stored therein a computer program which, when executed by a processor, implements the method of any one of claims 1 to 7.

17. A computer program product, characterized in that the computer program product comprises a computer program which, when executed by a processor, implements the method of any one of claims 1 to 7.