CN108335693B

CN108335693B - Language identification method and language identification equipment

Info

Publication number: CN108335693B
Application number: CN201710035625.5A
Authority: CN
Inventors: 张大威; 贲国生
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2017-01-17
Filing date: 2017-01-17
Publication date: 2022-02-25
Anticipated expiration: 2037-01-17
Also published as: CN108335693A

Abstract

The embodiment of the invention discloses a language identification method and a language identification device, wherein the method comprises the following steps: the method comprises the steps of extracting features of target audio and video data for offline training to obtain feature data corresponding to the target audio and video data, and sequentially performing iterative training on the feature data through N layers of long-term and short-term memory networks (LSTMs) which are arranged in a hierarchical manner and comprise training networks to obtain a target training model for language identification. The method can be applied to a large data set for use, the recognition accuracy is high and the speed is high in the process of performing language recognition through the target training module shown in the embodiment, and the current requirement on the speed of the language recognition can be met.

Description

Language identification method and language identification equipment

Technical Field

The invention relates to the technical field of computers, in particular to a language identification method and language identification equipment.

Background

With the trend of increasingly compact international communication, in various fields, such as information query service, alarm system, bank, stock exchange and emergency hotline service, the requirement for the speed of type identification is higher and higher, and for the information query service, for example, many information query systems can provide multilingual service, but only after the information query system determines the language type of the user, the service of the corresponding language can be provided in a targeted manner. Examples of such typical services include travel information, emergency services, and shopping.

Most of the language identification schemes in the current market adopt traditional shallow Model methods such as a hybrid Gaussian Mixture Model (GMM) or a Support Vector Machine (SVM).

However, the language identification scheme adopted in the prior art cannot be practically used on a large data set, and has low accuracy and low speed, so that the current speed requirement for language identification cannot be met.

Disclosure of Invention

The embodiment of the invention provides a language identification method and language identification equipment, which can be applied to a large data set for language identification and have high identification accuracy and high speed.

A first aspect of an embodiment of the present invention provides a language identification method, including:

acquiring target audio and video data for offline training;

extracting the characteristics of the target audio and video data to obtain characteristic data corresponding to the target audio and video data;

and sequentially carrying out iterative training on the characteristic data through N layers of long-term and short-term memory networks (LSTM) which are arranged in a hierarchy and included in the training network to obtain a target training model, wherein the target training model is used for language identification.

A second aspect of the embodiments of the present invention provides a language identification method, including:

acquiring first target audio and video data for online identification;

extracting the characteristics of the first target audio and video data to obtain first characteristic data corresponding to the first target audio and video data;

determining a target training model, wherein the target training model is obtained by training second target audio and video data by using a training network, the training network comprises N layers of long-term memory networks (LSTM) which are ordered according to levels, and N is a positive integer greater than or equal to 2;

acquiring a target score according to the target training model and the first characteristic data;

and determining language identification result information corresponding to the target score, wherein the language identification result information is used for indicating the language to which the first target audio and video data belongs.

A third aspect of the embodiments of the present invention provides a language identification device, including:

the first acquisition unit is used for acquiring target audio and video data for offline training;

the second acquisition unit is used for extracting the characteristics of the target audio and video data to acquire characteristic data corresponding to the target audio and video data;

the training unit is further configured to perform iterative training on the feature data sequentially through N layers of long-term and short-term memory networks LSTM included in the training network and ordered according to a hierarchy, so as to obtain a target training model, where the target training model is used for language identification.

A fourth aspect of the embodiments of the present invention provides a language identification device, including:

the first acquisition unit is used for acquiring first target audio and video data for online identification;

the first identification unit is used for extracting the characteristics of the first target audio and video data to acquire first characteristic data corresponding to the first target audio and video data;

the first determining unit is used for determining a target training model, wherein the target training model is obtained by training second target audio and video data by using a training network, the training network comprises N layers of long-time memory networks (LSTMs) which are ordered according to levels, and N is a positive integer greater than or equal to 2;

the second acquisition unit is used for acquiring a target score according to the target training model and the first characteristic data;

and the second determining unit is used for determining language identification result information corresponding to the target score, and the language identification result information is used for indicating the language to which the first target audio and video data belongs.

The method shown in this embodiment can perform feature extraction on target audio and video data for offline training to obtain feature data corresponding to the target audio and video data, and perform iterative training on the feature data in sequence through N layers of length-time memory networks LSTM included in a training network and sorted according to a hierarchy order to obtain a target training model for language recognition. The method can be applied to a large data set for use, the recognition accuracy is high and the speed is high in the process of performing language recognition through the target training module shown in the embodiment, and the current requirement on the speed of the language recognition can be met.

Drawings

FIG. 1 is a schematic structural diagram of a language identification device according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a language identification method according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a cycle of a recurrent neural network provided by the present invention;

FIG. 4 is a schematic structural diagram of an LSTM network provided by the present invention;

FIG. 5 is a schematic diagram of a training network according to the present invention;

FIG. 6 is a flowchart illustrating a language identification method according to another embodiment of the present invention;

FIG. 7 is a schematic structural diagram of another embodiment of a language identification device according to the present invention;

fig. 8 is a schematic structural diagram of another embodiment of the language identification device according to the present invention.

Detailed Description

The language identification method provided by the embodiment of the present invention can be applied to a language identification device with a computing function, and in order to better understand the language identification method provided by the embodiment of the present invention, an entity structure of the language identification device provided by the embodiment of the present invention is first described below with reference to fig. 1.

It should be understood that the following description of the entity structure of the language identification device provided in the embodiment of the present invention is an optional example, and is not limited to this, as long as the language identification method provided in the embodiment of the present invention can be implemented.

As shown in fig. 1, fig. 1 is a schematic structural diagram of a language identification device 100 according to an embodiment of the present invention, which may have a relatively large difference due to different configurations or performances, and may include one or more Central Processing Units (CPUs) 122 (e.g., one or more processors) and a memory 132, and one or more storage media 130 (e.g., one or more mass storage devices) for storing applications 142 or data 144. Memory 132 and storage medium 130 may be, among other things, transient or persistent storage. The program stored on the storage medium 130 may include one or more modules (not shown), each of which may include a series of instructions operating on a seed recognition device. Still further, the central processor 122 may be configured to communicate with the storage medium 130, and execute a series of instruction operations in the storage medium 130 on the language identification device 100.

The language identification device 100 may also include one or more power supplies 126, one or more wired or wireless network interfaces 150, one or more input-output interfaces 158, and/or one or more operating systems 141, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, and the like.

The Language IDentification apparatus 100 shown in fig. 1 can implement a Language IDentification (LID) technique for speech.

The LID is a process by which the language identification apparatus 100 can automatically identify the language to which the voice belongs.

The automatic language identification technology has important application in the fields of information retrieval, criminal investigation and military, along with the development of the internet technology, language identification plays an increasingly important role, along with the improvement of the technology, the barrier of human communication can be broken through all the day after all, and the language identification plays an important role in the technology. On a certain day in the future, people from different nationalities and different skin colors speaking different languages can realize free language communication by using technical means, and language identification technology is an important front-end processor in the technology. In the future, information query systems may provide multilingual services, for example, in the aspect of information services, many information query systems may provide multilingual services, and after determining the language category of a user, the information query systems provide services in the corresponding language. Examples of such typical services include travel information, emergency services, and shopping and banking, stock trading.

Automatic language identification technology can also be used in front-end processing of multilingual machine translation systems, as well as communication systems that directly convert one language to another.

In addition, the method can be used for monitoring or distinguishing the identity and nationality of the speaker in military. With the advent of the information age and the development of the internet, language identification shows more and more its application value.

Based on the language identification device shown in fig. 1, the following describes in detail the specific execution step flow of the language identification method provided by the embodiment of the present invention with reference to fig. 2, wherein fig. 2 is a flow chart of the steps of the language identification method provided by the embodiment of the present invention.

First, steps 201 to 207 shown in the embodiment are specific execution step flows of the offline training part:

step 201, obtaining a second audio/video file.

In the process of executing the offline training part, the language identification device may first acquire a second audio/video file for offline training.

The number of the audio-video data included in the second audio-video file is not limited in this embodiment.

And 202, decoding the second audio and video file through a decoder to generate second audio and video data.

The embodiment of the present invention is a multimedia video processing tool (full name: Fast Forward Mpeg, abbreviated as FFmpeg).

It should be clear that, in this embodiment, the description of the decoder is an optional example, and is not limited, as long as the decoder can decode the second audio/video file to generate the second audio/video data capable of performing language identification.

And 203, filtering the second audio and video data to generate second target audio and video data.

In order to reduce the duration of the execution of the offline training part, improve the language identification efficiency, and improve the language identification accuracy, the language identification device shown in this embodiment may filter the second audio/video data.

Specifically, the language identification device shown in this embodiment performs Detection by VAD (Voice Activity Detection), so as to filter the invalid silence segment in the second audio/video data to generate the second target audio/video data.

As can be seen, the data included in the second target audio/video data obtained in step 203 shown in this embodiment is all valid data, so that the time length and system resources wasted by the language identification device processing useless data are avoided.

And 204, performing feature extraction on the second target audio/video data to acquire second feature data.

Specifically, in this embodiment, the language identification device may perform feature extraction on the second target audio/video data, so as to obtain second feature data corresponding to the second target audio/video data.

The feature extraction method capable of extracting the feature of the second target audio/video data shown in this embodiment may be a spectral envelope method, a cepstrum method, an LPC interpolation method, an LPC root method, a hilbert transform method, a formant tracking algorithm, or the like.

The feature extraction method is not limited in this embodiment, as long as the second feature data of the second target audio data can be extracted.

And step 205, setting a target classification label in the second characteristic data.

The target classification tag is a tag for indicating a language of the target audio data.

The target classification label shown in this embodiment is a label corresponding to the second feature data.

In this embodiment, the target classification tag is set in the second feature data, so that the second feature data is classified according to different languages.

The classification shown in this embodiment is, in brief, classified into existing categories according to the language features or attributes of the second feature data.

As in natural language processing NLP, text classification is a classification problem, and general pattern classification methods can be used for text classification research.

Commonly used classification algorithms include: decision tree classification, naive Bayesian classifier (naive Bayesian classifier), a classifier based on a Support Vector Machine (SVM), a neural network method, a k-nearest neighbor method (kNN), a fuzzy classification method, and the like.

For example, taking a Tibetan language identification scenario as an example for illustration, the target classification label of the Tibetan language may be predetermined to be 1, and if the target classification label is used to distinguish the Tibetan language from other languages, the target classification label set to be 1 in the second feature data is set in step 205 shown in this embodiment.

And step 206, inputting the second feature data provided with the target classification label into the training network.

And step 207, performing iterative training on the second feature data provided with the target classification label through the training network to obtain the target training model.

Specifically, in this embodiment, the second feature data provided with the target classification label is sequentially iteratively trained through the N layers of long-term and short-term memory networks LSTM included in the training network, so as to obtain the target training model.

More specifically, in step 206 shown in this embodiment, the language identification device has sent the second feature data with the target classification label to the training network, so that the training network can perform iterative training on the second feature data with the target classification label in sequence through the N layers of long-and-short term memory networks LSTM included in the training network to obtain the target training model.

The following describes the N layers of long-term memory networks LSTM ordered in a hierarchical manner included in the training network shown in this embodiment:

human beings do not think about the problem from scratch every second, and human beings understand each word based on the previous word and do not discard all the contents and understand from scratch, and human thinking is persistent. This is not done by conventional neural networks, which is one of its major drawbacks. For example, to classify what is happening at each time point within a movie, conventional neural networks cannot apply reasoning about previous events to later events.

Recurrent Neural Networks (RNNs) solve this problem. They are networks with loops, with the ability to hold information. The RNN can be thought of as multiple copies of the same neural network, each neural network module passing a message to the next.

The following describes the cycle of the recurrent neural network RNN with reference to fig. 3, in which the neural network 301 in fig. 3 is a schematic diagram of the neural network in which the cycle is not yet expanded, and the neural network 302 in fig. 3 is a schematic diagram of the neural network in which the cycle is expanded.

It can be seen that the expanded neural network 302 includes a plurality of sequentially connected neural network modules a.

Specifically, in the neural network 301 and the neural network 302, the input of the neural network module a is X_tOutput is h_t。

In the neural network 302, the loop structure of each neural network module a that loops out causes information to pass from the previous step to the next step of the network. A recurrent neural network can be thought of as a multiple copy structure of the same network, each passing messages to its successor.

The RNN can learn to use past information and concatenate previous information into the present task, e.g., information from a previous frame of video data can be used to understand information from a current frame of video data.

Consider a language model that attempts to predict the next word based on the current word. If we try to predict the last word of the clouds area in the sky, we do not need any additional information-it is clear that the next word is "sky". In this case, the interval between the point of the target prediction and the point of its related information is small. At this point, we need to forget the context information.

But sometimes we need more context information. Consider predicting the last word of this sentence: i grew up in France, I spot fluent French. Recent information indicates that the next word appears to be the name of a language, but if we wish to narrow the scope of a certain language type, we need an earlier context as France. And the interval between the point to be predicted and its associated point becomes very large. At this point, we need to remember and rely on context information.

That is, depending on the specific situation, some users need to forget the context information, and some users need to remember the context information. The conventional RNN method cannot solve the long-term dependency problem, and the long-term dependency shown in the present embodiment means long-term memorization and dependency on context information. However, the LSTM shown in this embodiment may solve the problem of long-term dependence on context information.

The LSTM network is a special RNN that is able to learn long time dependencies. LSTM is specifically designed to avoid long term dependency problems. Remembering long-term information is the default behavior of LSTM, not what they are trying to learn. The LSTM has strong time sequence correlation, can well utilize context relationship, and can obtain better effect on tasks related to sequence input, such as voice and language.

The specific structure of the LSTM network is described below with reference to fig. 4:

the LSTM network has a forgetting gate structure 401, and can choose to rely on for a long time when the context needs to be memorized; and when the context relation needs to be forgotten, selecting to forget. Thus, the long-time dependence problem can be well solved.

Specifically, the LSTM network is provided with a block structure with three memory gates, an input gate402, an output gate403, and a forgetting gate forket gate 401.

The input gate402 filters the input and stores the filtered input in the memory cell404, so that the memory cell404 has both the previous state and the current state.

The cooperation of the three gates allows the LSTM network to store long-term information, e.g., information stored by the memory cell404 is not overwritten by an input at a later time, as long as the input gate402 remains closed.

Using the LSTM network, the memory cell404 can be used to record when the error propagates back from the output layer. The LSTM can remember information for a relatively long time.

More specifically, the input gate402 functions to control input information, the input of the gate is the output of the hidden node at the previous time point and the current input, and the multiplication of the output of the input gate402 and the output of the input node can function to control the information amount.

The forgetting gate forget gate401 plays a role in controlling internal state information, and the input of the gate is the output of the hidden node at the last time point and the current input.

The output gate403 plays a role in controlling output information, the input of the gate is the output of the hidden node at the previous time point and the current input, the activation function is sigmoid, and the output of the output gate403 is multiplied by the output of the internal state node to play a role in controlling information amount because the output of the sigmoid is between 0 and 1.

The specific structure of the training network shown in this embodiment is described in detail below with reference to fig. 5:

as shown in fig. 5, the training network shown in this embodiment includes N layers of long-term memory networks LSTM sorted in a hierarchical manner, and the specific number of N is not limited in this embodiment as long as N is a positive integer greater than or equal to 2.

In this embodiment, the example that N is equal to 2 is taken as an example to describe an optional example, that is, the embodiment takes that the training network includes two layers of LSTM as an example.

Specifically, in two layers of LSTM, the output of the LSTM of the previous layer serves as the input of the next layer, and it can be seen that data can be circulated among the multiple layers of LSTM.

The dual layer LSTM shown in this embodiment has more optimized performance relative to a single layer LSTM, and can use the parameters of the LSTM more efficiently.

Because the training network shown in this embodiment includes a plurality of LSTMs, the LSTM located at the lower layer can correct the iterative parameters input by the LSTM at the upper layer, and thus, the accuracy of language identification can be effectively improved by using the LSTM with multiple layers.

Optionally, in this embodiment, M iterations of the second feature data in the training network may be performed, and the training model generated in each iteration may be set as a candidate training model.

The language identification device shown in this embodiment can select the target training model from M candidate training models.

The specific manner of determining the target training model is not limited in this embodiment, for example, the language identification device shown in this embodiment may select the target training model from M candidate training models according to coverage, false-killing rate, average identification speed, accuracy rate, and the like.

The following steps 208 to 213 are specific implementation steps of the on-line identification part of the language identification method according to the embodiment of the present invention:

and step 208, acquiring a first audio/video file.

In this embodiment, the first audio/video file that needs to be language-recognized may be input to the language recognition device shown in this embodiment.

For example, the first audio/video file shown in this embodiment may include 4654 videos, and 4654 videos are input to the language identification device.

And 209, decoding the first audio and video file through a decoder to generate first audio and video data.

It should be clear that, in this embodiment, the description of the decoder is an optional example, and is not limited, as long as the decoder can decode the first audio/video file to generate the first audio/video data capable of performing language identification.

And 210, filtering the first audio and video data to generate first target audio and video data.

In order to reduce the execution duration of the online identification part, improve the language identification efficiency, and improve the language identification accuracy, the language identification device shown in this embodiment may filter the first audio/video data.

Specifically, the language identification device shown in this embodiment performs Detection by VAD (Voice Activity Detection), so as to filter an invalid silence segment in the first audio/video data to generate the first target audio/video data.

As can be seen, the data included in the first target audio/video data obtained in step 211 shown in this embodiment is all valid data, so that the time length and system resources wasted by the language identification device processing useless data are avoided.

And step 211, performing feature extraction on the first target audio and video data to obtain first feature data.

Specifically, in this embodiment, the language identification device may perform feature extraction on the first target audio/video data, so as to obtain first feature data corresponding to the first target audio/video data.

The feature extraction method capable of extracting the feature of the first target audio/video data shown in this embodiment may be a spectral envelope method, a cepstrum method, an LPC interpolation method, an LPC root method, a hilbert transform method, a formant tracking algorithm, or the like.

The feature extraction method is not limited in this embodiment, as long as the first feature data of the first target audio data can be extracted.

Step 212, determining a target training model.

In the process of executing step 212, the language identification device shown in this embodiment first needs to acquire the target training model acquired in step 207.

And step 213, acquiring a target score according to the target training model and the first characteristic data.

The language identification device shown in this embodiment can perform corresponding calculation according to the acquired target training model and the first feature data, so as to acquire the target score.

Specifically, the language identification device shown in this embodiment may calculate each parameter of the target training model and the first feature data to obtain the target score.

And 214, determining language identification result information corresponding to the target score.

Specifically, the language identification result information shown in this embodiment is used to indicate the language to which the first target audio/video data belongs.

More specifically, the language identification device shown in this embodiment is preset with corresponding relationships between different score ranges and different languages, and in the process of executing step 214 shown in this embodiment, the language identification device may first determine a target score range to which the target score belongs, and then the language identification device may determine the language identification result information corresponding to the target score range.

For example, taking the language corresponding to the first feature data described in this embodiment as a Tibetan language as an example, the language identification device shown in this embodiment may store a score range corresponding to the Tibetan language in advance, for example, between 0 and 1, and when the target score identified by the language identification device falls within the score range, the language identification device may identify that the file corresponding to the first feature data is a Tibetan language audio/video file. For example, if the language identification device identifies that the target score is 0.999, the language identification device can identify that the target score 0.999 is between score ranges 0 and 1, and the language identification device can identify that the file corresponding to the first feature data is a Tibetan audio/video file.

The method shown in this embodiment has the advantages that the language identification device shown in this embodiment does not need to analyze the content of the audio/video file, only needs to create a target training model capable of training the audio/video file to train the language to which the audio/video file belongs, and because the target training model is obtained by training second target audio/video data by using a training network, the training network includes N layers of length-time memory networks LSTM sorted according to the hierarchy, the process for identifying the language is high in efficiency and speed, the accuracy and the coverage rate are far better than those of a traditional shallow model method and a common DNN network, and the language to which the audio/video file belongs can be quickly and accurately identified.

To better illustrate the advantages of the method shown in this example, the method shown in this example was tested as follows;

in the test, the first audio/video file includes 79 Tibetan videos and 9604 non-Tibetan videos, wherein the maximum length of each video is 180 seconds.

In the process of determining the target training model, in an online training part, the target training model may be a training model of an iterative 4600 th round in the training network;

when the first audio and video file is trained according to the target training model to output language identification result information, it can be obtained that in the test, the coverage rate is 67/79-84.8%, the false killing rate is 1/(9064) -0.01%, the average Tibetan video identification speed is 1.6 s/piece, and the average normal video identification speed is 3.4 s/piece.

As another example, in another test, the first audio/video file includes 100 wiki videos and 9608 non-wiki videos, where each video has a maximum length of 180 seconds.

In the process of determining the target training model, in an online training part, the target training model may be a training model of the 3400 th iteration round in the training network;

when the first audio/video file is trained according to the target training model to output language identification result information, it can be found that, in the test, the coverage rate is 30/100-30.0%, the false killing rate is 10/9068-0.1%, the average identification speed of the dimensional language videos is 1.66 s/piece, and the average identification speed of the normal videos is 3.51 s/piece.

In order to better understand the method shown in the embodiment of the present invention, an application scenario to which the method shown in the embodiment of the present invention can be applied is exemplarily described as follows:

it should be clear that the following description of the scenario to which the method according to the embodiment of the present invention is applied is an optional example and is not limited.

Scene one: field of speech recognition

With the advent of the mobile internet era, speech assistants such as siri are popular, and users need to download speech assistants of different languages according to their own languages. Various speech-to-text tools in the market are also available, and corresponding tools need to be selected according to the spoken language, which is very inconvenient. By adopting the language identification method shown in the embodiment, the voice assistant of the corresponding language can be quickly positioned according to the language spoken by the user, which is convenient and quick.

Scene two: bank and stock exchange information service

In places such as banks and stock exchanges, when a minority of national customers who can not speak Mandarin is encountered, related administrative services are difficult to deal with, and staff who specially understand the language of the minority needs to be searched for to take care of. Until then, the language spoken by the customer could not be determined, wasting a lot of time. By using the language identification method shown in the embodiment, the Tibetan language audio can be quickly identified, the machine can be taught to listen to the sounds of the nationality siblings according to the content spoken by the user, the corresponding language category can be quickly identified, and relevant workers can be searched for reception.

Scene three: emergency hotline service

When emergency services such as emergency call 120 and alarm 110 of a minority sibling are processed, the time is short, and precious emergency time is delayed under the condition that the language of a speaker cannot be confirmed, so that the life of a rescuer is endangered. By utilizing the language identification technology method shown in the embodiment, the corresponding language category is quickly identified according to the audio frequency spoken by the user, the records of the workers understanding the corresponding language are searched, the precious time can be saved, and the life can be saved.

Scene four: riot and terrorist video identification

With the development of the mobile internet, many people like to publish videos on social software such as WeChat and QQ spaces, and the number of the videos uploaded on each day is hundreds of millions. The video can also contain a large number of malicious videos, which relate to politics, riots and the like, and are similar to the high-risk malicious videos of Tibetan alone and XingDu. The number of the videos is not large, the daily auditing amount of customer service personnel is fixed, the videos cannot be found effectively, and a large amount of time is wasted. By adopting the language identification method shown in the embodiment, suspected political riot videos in a large number of videos can be quickly positioned, for example, videos with the language of Tibetan language and the language of Tibet language are provided for customer service to be examined, so that the working efficiency is improved, and malicious videos can be accurately searched and killed.

Scene five: monitoring criminal suspects

When the military police monitors suspicious molecules, the identity, nationality and speaking content of speaking need to be identified, which needs a large amount of manpower and material resources to be carried out, resulting in low efficiency. By adopting the language identification method of the embodiment, the language information of the monitored person can be accurately judged, so that the information such as the identity, the nationality and the like can be judged.

The language identification device shown in this embodiment may be used to execute the language identification method shown in fig. 2 of this embodiment, and the language identification device shown in this embodiment may also execute the language identification method shown in fig. 6 of this embodiment, in fig. 6, the language identification device only needs to execute the offline training part in the language identification method.

Step 601, obtaining an audio and video file.

And step 602, decoding the audio and video file through a decoder to generate audio and video data.

And 603, filtering the audio and video data to generate target audio and video data.

And step 604, performing feature extraction on the target audio and video data to obtain feature data.

And step 605, setting a target classification label in the characteristic data.

And 606, inputting the feature data provided with the target classification label into the training network.

Step 607, performing iterative training on the feature data through the training network to obtain the target training model.

For a specific description of the audio/video file shown in this embodiment, please refer to a description of a second audio/video file shown in fig. 2 in detail, for a specific description of the target audio/video data shown in this embodiment, please refer to a description of a second target audio/video file shown in fig. 2 in detail, for a specific description of the feature data shown in this embodiment, please refer to a description of a second feature data shown in fig. 2 in detail, and details are not repeated in this embodiment.

The process shown in step 601 to step 607 in this embodiment is shown in step 201 to step 207 in fig. 2, and details thereof are not described in this embodiment.

The following describes a specific structure of the language identification device shown in this embodiment from the perspective of functional modules with reference to fig. 7:

the language identification device includes:

a third obtaining unit 701, configured to obtain the second target audio/video data;

specifically, the second obtaining unit 701 includes:

a second obtaining module 7011, configured to obtain a second audio/video file for offline training;

the second decoding module 7012 is configured to decode the second audio/video file through a decoder to generate second audio/video data;

a second filtering module 7013, configured to filter an invalid silence segment in the second audio/video data through voice activity detection VAD to generate the second target audio/video data.

A second identifying unit 702, configured to perform feature extraction on the second target audio/video data to obtain second feature data corresponding to the second target audio/video data;

a setting unit 703 configured to set a target classification tag in the second feature data, where the target classification tag is a tag used to indicate a language of the target audio data;

a training unit 704, configured to perform iterative training on the second feature data sequentially through the N layers of long-and-short term memory networks LSTM included in the training network to obtain the target training model;

the training unit 704 is further configured to sequentially perform iterative training on the second feature data with the target classification labels through the N layers of long-term and short-term memory networks LSTM included in the training network, so as to obtain the target training model.

A first obtaining unit 705, configured to obtain first target audio/video data for online identification;

specifically, the first obtaining unit 705 includes:

a first obtaining module 7051, configured to obtain a first audio/video file for online identification;

the first decoding module 7052 is configured to decode the first audio/video file through a decoder to generate first audio/video data;

a first filtering module 7053, configured to filter an invalid silence segment in the first audio/video data by voice activity detection VAD to generate the first target audio/video data.

The first identification unit 706 is configured to perform feature extraction on the first target audio/video data to obtain first feature data corresponding to the first target audio/video data;

a first determining unit 707, configured to determine a target training model, where the target training model is obtained by training second target audio/video data using a training network, the training network includes N layers of long-and-short memory networks LSTM sorted according to a hierarchy, and N is a positive integer greater than or equal to 2;

a second obtaining unit 708, configured to obtain a target score according to the target training model and the first feature data;

a second determining unit 708, configured to determine language identification result information corresponding to the target score, where the language identification result information is used to indicate a language to which the first target audio/video data belongs.

For a detailed process of the language identification device to execute the language identification method shown in this embodiment, please refer to fig. 2, which is not described in detail in this embodiment.

For a beneficial effect of the language identification device in the process of executing the language identification method shown in this embodiment, please refer to the embodiment shown in fig. 2 in detail, which is not described in detail in this embodiment.

The following describes a specific structure of the language identification device shown in this embodiment from the perspective of functional modules with reference to fig. 8, where the language identification device shown in fig. 8 can implement an offline training part in the language identification method.

Specifically, the language identification device includes:

a first obtaining unit 801, configured to obtain target audio/video data for offline training;

specifically, the acquiring unit 801 includes:

an acquiring module 8011, configured to acquire an audio and video file for offline training;

a decoding module 8012, configured to decode the audio and video file through a decoder to generate audio and video data;

a filtering module 8013 configured to filter inactive silence segments in the audio/video data by voice activity detection, VAD, to generate the target audio/video data.

A second obtaining unit 802, configured to perform feature extraction on the target audio/video data to obtain feature data corresponding to the target audio/video data;

a setting unit 803, configured to set a target classification label in the feature data, where the target classification label is a label indicating a language of the target audio data;

a training unit 804, configured to perform iterative training on the feature data sequentially through N layers of long-and-short term memory networks LSTM included in a training network and ordered according to a hierarchy, so as to obtain a target training model, where the target training model is used for language identification;

the training unit 804 is further configured to sequentially perform iterative training on the feature data provided with the target classification label through the N layers of long-term and short-term memory networks LSTM included in the training network, so as to obtain the target training model.

The specific process of the language identification device to execute the language identification method shown in this embodiment is shown in fig. 6 for details, which are not described in detail in this embodiment.

For details, please refer to the embodiment shown in fig. 6, which is not repeated in detail in this embodiment.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a language identification device, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A language identification method, comprising:

acquiring target audio and video data for offline training;

setting a target classification label in the feature data, wherein the target classification label is a label used for indicating the language of the target audio data;

sequentially carrying out iterative training on the characteristic data through an N-layer long-term memory network (LSTM) which is included in a training network and is ordered according to levels to obtain a target training model, wherein the target training model is used for carrying out language identification, and the language identification comprises the following steps: and acquiring a target score according to the target training model and the characteristic data of the audio and video data for online recognition, determining a preset target score range to which the target score belongs, and determining language recognition result information corresponding to the preset target score range.

2. The method of claim 1, wherein the iterative training of the feature data sequentially through N hierarchically ordered long-term memory networks LSTM comprised by a training network comprises:

and sequentially carrying out iterative training on the feature data provided with the target classification labels through the N layers of long-term and short-term memory networks (LSTM) included by the training network so as to obtain the target training model.

3. The method according to claim 1 or 2, wherein said obtaining of said target audio-visual data comprises:

acquiring an audio and video file for offline training;

decoding the audio and video file through a decoder to generate audio and video data;

and filtering invalid silence segments in the audio and video data through voice activation detection VAD to generate the target audio and video data.

4. A language identification method, comprising:

acquiring first target audio and video data for online identification;

determining a target training model, wherein the target training model is obtained by training second feature data by using a training network, the training network comprises N layers of long-term memory networks (LSTM) which are ordered according to a hierarchy, N is a positive integer greater than or equal to 2, the second feature data is obtained by performing feature extraction on the obtained second target audio/video data, a target classification label is set in the second feature data, and the target classification label is a label used for indicating the language of the target audio data;

determining a preset target score range to which the target score belongs, and determining language identification result information corresponding to the preset target score range, wherein the language identification result information is used for indicating the language to which the first target audio/video data belongs.

5. The method according to claim 4, wherein the obtaining of the first target audio-video data for online recognition comprises:

acquiring a first audio/video file for online identification;

decoding the first audio and video file through a decoder to generate first audio and video data;

and filtering an invalid silence segment in the first audio and video data through voice activity detection VAD to generate the first target audio and video data.

6. The method according to claim 4, wherein before the obtaining of the first target audio-video data for on-line recognition, the method further comprises:

and sequentially carrying out iterative training on the second characteristic data through the N layers of long-term and short-term memory networks (LSTMs) included by the training network to obtain the target training model.

7. The method of claim 6, wherein the iterative training of the second feature data sequentially through the N layers of long-term memory networks (LSTMs) comprised by the training network comprises:

and sequentially carrying out iterative training on the second feature data provided with the target classification label through the N layers of long-term and short-term memory networks (LSTM) included by the training network so as to obtain the target training model.

8. The method according to claim 6 or 7, wherein said obtaining second target audio-video data comprises:

acquiring a second audio/video file for offline training;

decoding the second audio and video file through a decoder to generate second audio and video data;

and filtering an invalid mute segment in the second audio and video data through voice activity detection VAD to generate the second target audio and video data.

9. A language identification apparatus, comprising:

a setting unit configured to set a target classification tag in the feature data, the target classification tag being a tag indicating a language of the target audio data;

a training unit, configured to perform iterative training on the feature data sequentially through N layers of long-and-short term memory networks LSTM included in a training network and ordered according to a hierarchy, so as to obtain a target training model, where the target training model is used to perform language identification, and the process of performing language identification includes: and acquiring a target score according to the target training model and the characteristic data of the audio and video data for online recognition, determining a preset target score range to which the target score belongs, and determining language recognition result information corresponding to the preset target score range.

10. The language identification device according to claim 9, wherein the training unit is further configured to perform iterative training on the feature data provided with the target classification label in sequence through the N layers of long-and-short term memory networks LSTM included in the training network to obtain the target training model.

11. The language identification device according to claim 9 or 10, wherein the first acquisition unit includes:

the acquisition module is used for acquiring audio and video files for offline training;

the decoding module is used for decoding the audio and video file through a decoder to generate audio and video data;

and the filtering module is used for filtering the invalid mute section in the audio and video data through voice activation detection VAD to generate the target audio and video data.

12. A language identification apparatus, comprising:

the first determining unit is used for determining a target training model, the target training model is obtained by training second feature data through a training network, the training network comprises N layers of long-term memory networks (LSTMs) which are arranged in a hierarchical manner, N is a positive integer greater than or equal to 2, the second feature data is obtained by performing feature extraction on the obtained second target audio and video data, a target classification label is set in the second feature data, and the target classification label is a label used for indicating the language of the target audio data;

and the second determining unit is used for determining a preset target score range to which the target score belongs and determining language identification result information corresponding to the preset target score range, wherein the language identification result information is used for indicating the language to which the first target audio and video data belongs.

13. The language identification device of claim 12, wherein the first obtaining unit includes:

the first acquisition module is used for acquiring a first audio/video file for online identification;

the first decoding module is used for decoding the first audio and video file through a decoder to generate first audio and video data;

the first filtering module is used for filtering an invalid mute section in the first audio and video data through voice activity detection VAD to generate the first target audio and video data.

14. The language identification device of claim 12, wherein said language identification device further comprises:

and the training unit is used for sequentially carrying out iterative training on the second characteristic data through the N layers of long-term and short-term memory networks (LSTMs) included in the training network so as to obtain the target training model.

15. The language identification device of claim 14, wherein the training unit is further configured to perform iterative training on the second feature data with the target classification labels sequentially through the N layers of long-term and short-term memory networks LSTM included in the training network to obtain the target training model.

16. The language identification device of claim 14 or 15, wherein the third acquisition unit includes:

the second acquisition module is used for acquiring a second audio/video file for offline training;

the second decoding module is used for decoding the second audio and video file through a decoder to generate second audio and video data;

and the second filtering module is used for filtering an invalid mute section in the second audio and video data through voice activation detection VAD to generate the second target audio and video data.

17. A data processing apparatus, characterized in that the apparatus comprises a processor and a memory:

the memory is used for storing program codes;

the processor is configured to execute the program code to implement the language identification method according to any one of claims 1 to 8.

18. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program for implementing the language identification method according to any one of claims 1-8 when executed by a processor.