CN115240640A

CN115240640A - Dialect voice recognition method, device, equipment and storage medium

Info

Publication number: CN115240640A
Application number: CN202210852125.1A
Authority: CN
Inventors: 胡莹莹; 孔常青; 万根顺; 潘嘉; 刘聪; 胡国平; 胡郁
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2022-07-20
Filing date: 2022-07-20
Publication date: 2022-10-25

Abstract

The application discloses a dialect speech recognition method, a device, equipment and a storage medium, wherein a dialect speech recognition model is configured in advance, the model is obtained by taking speech samples of various languages (including Mandarin and various dialects) as training samples and taking phone-level labeled texts carrying syntactic information after syntactic analysis and phone-level labeling of recognition texts of the training samples as label training. Furthermore, syntax information is introduced into the label, so that the model can learn the information of the syntax level of each dialect, the recognition effect of each dialect is further improved, and in addition, the readability of the model recognition text can be improved due to the addition of the syntax information.

Description

Dialect voice recognition method, device, equipment and storage medium

Technical Field

The present application relates to the field of speech recognition technologies, and in particular, to a dialect speech recognition method, apparatus, device, and storage medium.

Background

The Chinese has wide regions, the dialects of Chinese and minority nationality languages are numerous, the modern Chinese has various dialects, the distribution areas of the dialects are wide, the difference among the dialects of the modern Chinese is shown in all aspects of voice, vocabulary and grammar, and the voice aspect is particularly prominent. In view of the diversity and diversity of dialects, high requirements are put forward on the speech recognition technology.

The existing dialect speech recognition technology is mainly based on target dialect training data to carry out acoustic and language model training, a modeling unit generally carries out end-to-end modeling by using word level, and dialect recognition is carried out by extracting word level characteristics. However, unlike mandarin, for most dialects, the amount of training data carrying labels is small, and word-level end-to-end modeling relies on a large amount of data, which results in poor model performance on low-resource dialects and poor recognition accuracy for dialects.

Disclosure of Invention

In view of the foregoing, the present application is provided to provide a dialect speech recognition method, device, apparatus and storage medium, so as to solve the problem of low recognition accuracy of the existing dialect speech recognition technology.

The specific scheme is as follows:

a dialect speech recognition method, comprising:

acquiring a voice to be recognized of a target language type;

inputting the voice to be recognized into a preset dialect voice recognition model to obtain a phoneme phone-level labeled text which is output by the model and carries syntactic information;

the dialect voice recognition model is obtained by taking voice samples of various types of languages as training samples and taking phone-level labeled texts carrying syntax information after syntactic analysis and phone-level labeling of recognition texts of the training samples as labels for training; the languages comprise Mandarin and dialects of various types;

decoding the phone-level labeled text carrying the syntactic information by using a preset decoding network corresponding to the target language type to obtain a character-level labeled text carrying the syntactic information;

and carrying out text normalization on the character-level labeled text carrying the syntactic information to remove the syntactic information in the character-level labeled text to obtain a dialect voice recognition text.

Preferably, the dialect speech recognition model training process determines the labels of the training samples, and includes:

acquiring a recognition text corresponding to the training sample;

performing syntactic analysis on the recognition text, and labeling the analyzed syntactic information into the recognition text to obtain a character-level labeled text carrying the syntactic information;

determining a phone-level label corresponding to the recognition text by adopting a phone-level pronunciation dictionary matched with the language type of the training sample;

and replacing the characters in the character-level labeled text carrying the syntactic information by using the corresponding phone-level label to obtain the phone-level labeled text carrying the syntactic information.

Preferably, the process for determining the phone-level pronunciation dictionary matching each language type includes:

for the two language types of Mandarin and Guangdong dialect, the respective corresponding phone-level pronunciation dictionary is directly used;

for each language type except the Mandarin and Guangdong dialects, the phone-level pronunciation dictionary of the Mandarin is multiplexed to form the phone-level pronunciation dictionary matched with each language type.

Preferably, the multiplexing of the phone-level pronunciation dictionary of mandarin for each language type other than mandarin and cantonese to form a phone-level pronunciation dictionary matching each language type includes:

numbering the other language types respectively to obtain a number corresponding to each language type;

for any one of the remaining language types:

and adding the number corresponding to the language type to each phone in the phone-level pronunciation dictionary of the Mandarin Chinese to obtain the phone-level pronunciation dictionary matched with the language type.

Preferably, the dialect speech recognition model includes a coding module and a decoding module, where the coding module is configured to code an input speech to obtain a speech coding feature; and the decoding module is used for predicting the phone-level labeled text which carries syntactic information and corresponds to the input voice based on the voice coding characteristics.

Preferably, the coding module is obtained by training in a pre-training mode; wherein, the pre-training process of the coding module comprises the following steps:

acquiring a training data set, wherein the training data set comprises training voices of various types of languages, and the various types of languages comprise Mandarin and dialects of various types;

training the coding module by adopting a comparison learning strategy, wherein in the training process, the similarity between the speech coding features of the positive example sample pair is maximized, and the similarity between the speech coding features of the negative example sample pair is minimized until a set training end condition is reached, so as to obtain the trained coding module.

Preferably, the acquiring the training data set comprises:

acquiring an original training data set;

adjusting the proportion of the training voice data volume of each language in the original training data set so that the proportion of the training voice data volume of the mandarin to the training voice data volume of the rest dialects in the adjusted training data set does not exceed a set ratio, and the training voice data volume is kept consistent among the dialects;

and performing random on the adjusted training data set so that the training voice distribution of each type of language in the training data set has randomness.

Preferably, the process of establishing the decoding network corresponding to the target language type includes:

and training by utilizing the phone-level pronunciation dictionary matched with the target language type and the text corpus of the target language type carrying syntactic information to obtain a decoding network corresponding to the target language type.

A dialect speech recognition apparatus comprising:

the voice to be recognized acquiring unit is used for acquiring the voice to be recognized of the target language type;

the model processing unit is used for inputting the speech to be recognized into a preset dialect speech recognition model to obtain a phonemic phone-level labeled text which is output by the model and carries syntactic information;

the character decoding unit is used for decoding the phone-level labeled text carrying the syntactic information by utilizing a preset decoding network corresponding to the target language type to obtain the character-level labeled text carrying the syntactic information;

and the text normalization unit is used for performing text normalization on the character-level labeled text carrying the syntactic information so as to remove the syntactic information in the character-level labeled text and obtain a dialect voice recognition text.

A dialect speech recognition device comprising: a memory and a processor;

the memory is used for storing programs;

the processor is configured to execute the program to implement the steps of the dialect speech recognition method.

A storage medium having stored thereon a computer program which, when executed by a processor, carries out the steps of the dialect speech recognition method as described above.

By means of the technical scheme, the dialect voice recognition model is configured in advance, the model is obtained by taking voice samples of various languages (including Mandarin and various dialects) as training samples and taking phone-level labeled texts carrying syntactic information after syntactic analysis and phone-level labeling of the recognition texts of the training samples as label training, compared with the existing word-level end-to-end model, the dialect recognition model is modeled based on a phone level, and the method improves the distinctiveness of modeling units among the various dialects and between the dialects and the Mandarin and reduces the crosstalk among the dialects from a pronunciation level, so that the recognition effect of the dialects is improved.

Furthermore, on the basis of phone-level modeling, the method introduces syntactic information on label labeling of training data, so that the model can learn information of grammatical levels of all dialects, namely, the characteristics of the phone level and the grammatical level of the training data can be fully utilized, the recognition effect of all dialects is further improved, and in addition, the readability of the model recognition text can be improved due to the addition of the grammatical information.

On the basis of the trained dialect speech recognition model, recognizing the speech to be recognized of the target language type to obtain a phone-level labeled text which is output by the model and carries syntactic information, decoding the phone-level labeled text carrying the syntactic information by utilizing a preset decoding network corresponding to the target language type on the basis to obtain a character-level labeled text carrying the syntactic information, and carrying out text normalization on the character-level labeled text to remove the syntactic information to obtain the dialect speech recognition text.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

fig. 1 is a schematic flow chart of a dialect speech recognition method according to an embodiment of the present disclosure;

FIG. 2 illustrates an overall flow diagram of a dialect speech recognition method;

fig. 3 is a schematic structural diagram of a dialect speech recognition apparatus according to an embodiment of the present disclosure;

fig. 4 is a schematic structural diagram of a dialect speech recognition device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The dialect voice recognition scheme can be suitable for various scenes needing voice recognition, and particularly suitable for scenes for recognizing various dialect voices.

In an optional embodiment, the present application scheme may classify dialects of various types into seven categories according to a standard voice partition dividing manner, including:

the northern dialect, wu Fangyan, the Xiang dialect, the Jiang dialect, the Hakka dialect, the Guangdong dialect and the Min dialect. The detailed division is as follows:

class one, "northern dialect". It is a common basic dialect of Chinese nationality, and the using population accounts for 70 percent of the total population of Chinese nationality. The "northern dialects" are divided into four dialects, namely northeast, northwest, southwest and Jianghuai.

Class two, "Wu Fangyan". Distributed in Shanghai, jiangsu and most of the areas of Yangtze river, south and east of Zhejiang, represented by Suzhou dialect.

Class three, "xiang dialect". Distributed in most areas of Hunan province, represented by Changsha.

"Jiangxi dialect" of the fourth category. Distributed in most areas of Jiangxi province, as represented by Nanchang.

Class five, "Hakka dialect". Mainly distributed in east of Guangdong, west of Fujian, southeast of Jiangxi and south of Guangxi, as represented by Mei county of Guangdong province.

Class six, "Min Fang Yan". Spanning four provinces, including most of Fujian province, northeast and southern Zhejiang province of Guangdong province, and most of Han people's residential areas of Taiwan province.

Class seven, "Guangdong dialect". Mainly distributed in the central part of Guangdong province, south part and east part of Guangxi province, south part and Australia harboring region, as represented by Guangdong dialect.

Wherein, the Guangdong dialect has a standard pronunciation system, and the corresponding phone pronunciation dictionary comprises 88 phones.

Of course, the above is only an optional dialect division manner, and other dialect division manners may also be adopted in the present application to classify dialects.

The scheme can be realized based on a terminal with data processing capacity, and the terminal can be a mobile phone, a computer, a server, a cloud terminal and the like.

Next, as described in conjunction with fig. 1, the dialect speech recognition method of the present application may include the following steps:

and S100, acquiring the voice to be recognized of the target language type.

The target language type may be Mandarin or dialects of various types. The method and the device can determine the target language type of the obtained speech to be recognized.

And S110, inputting the speech to be recognized into a preset dialect speech recognition model to obtain a phonemic phone-level labeled text which is output by the model and carries syntactic information.

The dialect speech recognition model is obtained by training a speech sample of each type of language (which may include Mandarin and each type of dialect) as a training sample and a phone-level labeled text carrying syntax information after syntactic analysis and phone-level labeling of a recognition text of the training sample as a label.

In the embodiment, the phone-level modeling unit is adopted during model modeling, and compared with the character-level modeling unit, the method and the device have the advantages that the distinguishability of the modeling units among dialects of various types and between the dialects and the mandarin is improved from the pronunciation level, the crosstalk degree among the dialects is reduced, and accordingly the recognition effect of the dialects is improved. Meanwhile, the phone-level modeling unit may have lower requirements on the amount of training data than the character-level modeling unit.

On this basis, the method further introduces syntax information into the label marking, namely, after the identification text of the training sample is analyzed by the text syntax, the obtained syntax information is added into the label marking, so that the model can learn the information of the syntax level of each dialect, the identification effect of each dialect is further improved, and in addition, the readability of the model identification text can be improved due to the addition of the syntax information.

The process of parsing the recognition text of the training sample may be to label part of speech and syntax information of the recognition text, for example, the recognition text may be parsed by using a parsing tool such as standard Parser.

Further, the process of phone-level labeling the recognized text may be to perform phone-level labeling by using a corresponding phone-level pronunciation dictionary according to the language type of the training sample.

And step S120, decoding the phone-level labeled text carrying the syntactic information by using a preset decoding network corresponding to the target language type to obtain a character-level labeled text carrying the syntactic information.

The decoding network corresponding to the target language type can be obtained by utilizing a phone-level pronunciation dictionary matched with the target language type and text corpus training of the target language type carrying syntactic information. For cantonese, the corresponding decoding network can be obtained by utilizing a phone-level pronunciation dictionary of cantonese and cantonese text corpus training which carries syntactic information; for dialects other than Guangdong and Mandarin, the Chinese language can be obtained by using phone-level pronunciation dictionaries of other dialects and Mandarin and dialects or Mandarin text corpora training which carry syntactic information.

And decoding the result output by the model in the last step by using a decoding network, decoding the phone-level label into a character-level label, and further obtaining a character-level label text carrying syntactic information.

And S130, carrying out text normalization on the character-level labeled text carrying the syntactic information to remove the syntactic information in the character-level labeled text to obtain a dialect voice recognition text.

Specifically, for the user, the user may not need to know the syntax information, so that the character-level labeled text carrying the syntax information obtained in the previous step may be subjected to text-normalization processing to remove the syntax information contained therein, so as to obtain the remaining dialect speech recognition text.

The dialect speech recognition method provided by the embodiment of the application is characterized in that a dialect speech recognition model is configured in advance, the model is obtained by taking speech samples of various types of languages (including Mandarin and various types of dialects) as training samples and taking phone-level labeled texts carrying syntactic information after syntactic analysis and phone-level labeling of the recognition texts of the training samples as label training, and compared with the existing word-level end-to-end model, the dialect speech recognition method is based on phone level modeling, so that the distinctiveness of modeling units among various types of dialects and between dialects and Mandarin is improved from a pronunciation level, the crosstalk among the dialects is reduced, and the recognition effect of each dialect is improved.

Furthermore, on the basis of phone-level modeling, syntax information is introduced to label marks of training data, so that the model can learn information of grammar levels of all dialects, the characteristics of the phone-level and grammar levels of the training data can be fully utilized, the recognition effect of all dialects is further improved, and in addition, the readability of a model recognition text can be improved due to the addition of the grammar information.

In some embodiments of the present application, a training process for the dialect speech recognition model described above is described.

The training data of the model comprises voice samples of various languages as training samples, and phone-level labeled texts carrying syntactic information after syntactic analysis and phone-level labeling of the recognition texts of the training samples as sample labels.

First, the process of determining a sample label is introduced:

s1, obtaining a recognition text corresponding to a training sample.

And S2, carrying out syntactic analysis on the recognition text, and marking the analyzed syntactic information into the recognition text to obtain a character-level marked text carrying the syntactic information.

For example, the following recognized text { great family } is parsed, and the resulting syntax tree structure is as follows:

the labeling result after identifying the text labeling syntax information may be represented as:

{ < IP > < NP > < PN > everybody < PN > < NP > < VP > < VA > everybody < VA > < VP > }

For ease of understanding, the present application introduces several syntax information representations, as follows:

IP denotes simple clauses, PP denotes preposition phrases, P denotes prepositions, LCP denotes azimuth phrases, NP denotes noun phrases, NN denotes spoken nouns, LC denotes azimuth words, PN denotes pronouns, AD denotes adverbs, VV denotes emotional verbs, VP denotes verb phrases, and ADVP denotes adverb phrases.

Taking the above-described example of the syntax information expression manner as an example, there are 12 syntax information corresponding modeling units in total.

And S3, determining a phone-level label corresponding to the recognized text by adopting a phone-level pronunciation dictionary matched with the language type of the training sample.

Wherein, for the two language types of Mandarin and Guangdong dialect, the corresponding phone-level pronunciation dictionary is directly used.

For other dialects, the phone-level pronunciation dictionary of Mandarin can be multiplexed to form phone-level pronunciation dictionaries matched with respective language types.

Wherein, mandarin has two pronunciation systems, which are main sound system and consonant and vowel system. The main vowel system comprises 81 phones, the initial consonant and final consonant system comprises 177 phones, and the initial consonant and final consonant system is the pinyin annotation recognized by people. Taking "Dajiahao" as an example, in the pronunciation dictionary of the initial and final system, the pronunciation is labeled as { d a j ia1 h ao3}, and in the pronunciation dictionary of the vowel system, the pronunciation is labeled as { d a j y a1 h aw3}. In the present application, the pronunciation dictionary of mandarin chinese can be either of the above two systems, and the following description will be given by taking the initial and final parent system as an example.

For other dialects except for the mandarin chinese and the cantonese dialects, the process of multiplexing the phone-level pronunciation dictionary of the mandarin chinese and forming the phone-level pronunciation dictionary matched with the respective language types can comprise the following steps:

and S31, numbering the rest language types respectively to obtain the number corresponding to each language type.

S32, for any one of the rest language types:

Taking "Dajiahao" as an example, if the labeling result in the phone-level pronunciation dictionary in Mandarin Chinese is { d a j ia1 h ao3}, the labeling result in the matching phone-level pronunciation dictionary is { d _ N a _ N j _ N ia1_ N h _ N ao3_ N }, corresponding to the Nth-class dialect.

And S4, replacing the characters in the character-level labeled text carrying the syntactic information by using the corresponding phone-level label to obtain the phone-level labeled text carrying the syntactic information.

Specifically, taking the character-level labeling text { < IP > < NP > < PN > everybody < PN > < NP > < VP > < VA > well < VA > < VP > < IP > } carrying syntax information obtained in the above step S2 as an example, respectively, the following results after phone-level labeling replacement is introduced when the training sample is mandarin or the other 7 dialects are:

class 0 mandarin: { < sos > < IP > < NP > < PN > d a j ia1< PN > < NP > < VP > < VA > h ao3< VA > < VP > < IP > < eos > }

Class 1 northern dialects: { < sos > < IP > < NP > < PN > d _1a4_1j _1ia1_1 < PN > < NP > < VP > < VA > h _1aO3_1< VA > < VP > < IP > < eos > }

Class 2 Wu dialects are labeled { < sos > < IP > < NP > < PN > d _2a4_2j _2ia1_2 < PN > < NP > < VP > < VA > h _2ao3_2< VA > < VP > < IP > < eos > }

…

Class 7 Guangdong dialects are labeled { < sos > < IP > < NP > < PN > d _ 7a6 _7iT \u7 g \u7 aa1 u 7 yarn bundles PN > < NP > < VP > < VA > h _7o2 \ u 7uT _7yarn bundles VA > < VP > < IP > < eos > }

The Guangdong dialect adopts the phone-level pronunciation dictionary corresponding to the Guangdong dialect to carry out phone-level labeling, and the phone-level pronunciation dictionary of Mandarin is not required to be multiplexed.

Wherein < sos >, < eos > are the start and stop symbols of the set label.

With the above description, the phone-level pronunciation dictionary of cantonese dialect includes 88 phones, the phone-level pronunciation dictionary under the initial and final system of mandarin includes 177 phones, 7 dialects except cantonese, and the remaining 6 dialects multiplex the phone-level pronunciation dictionary of mandarin to include 177 phones, and there are 12 modeling units when syntax information is labeled, and there is a silent note sil with two start and stop characters < sos >, < eos >, so the number of modeling units that can be commonly used in the modeling process of the dialect speech recognition model is: 88+177+ 6+12+1+2= 1342.

In some embodiments of the present application, the dialect speech recognition model may include two parts, an encoding module and a decoding module.

The encoding module is used for encoding input voice to obtain voice encoding characteristics; and the decoding module is used for predicting the phone-level labeled text which carries syntactic information and corresponds to the input voice based on the voice coding characteristics.

It will be appreciated that Mandarin training data is relatively simple to acquire and may accumulate very large levels of supervised training data. But for dialects, supervised dialects have a smaller amount of training data. Therefore, in this embodiment, a pre-training scheme is provided, in which a coding module is pre-trained, so that the coding module is pre-trained by using unsupervised training data, and then the coding module can be fine-tuned by using supervised training data.

Next, the coding module pre-training process is explained:

s1, acquiring a training data set.

The training data set comprises training voices of various types of languages, wherein the various types of languages comprise Mandarin and dialects of various types.

In an alternative manner, the process of obtaining the training data set may include:

and S11, acquiring an original training data set.

S12, adjusting the proportion of the training voice data volume of each language in the original training data set, so that the proportion of the training voice data volume of the mandarin to the training voice data volume of the rest of the dialects in the adjusted training data set does not exceed a set ratio, and the training voice data volume is kept consistent among the dialects.

The ratio of the amount of the training voice data of the Mandarin to each dialect can not exceed 2:1, so that the data balance of the voice of each type of language in the training data set is guaranteed on the basis of fully utilizing the training voice data of the Mandarin.

The specific adjustment manner may include enhancement processing such as noise addition, fast speech speed, tfmask, and the like on the training speech of each dialect, so as to expand the data volume of the dialect training speech.

And S13, performing random on the adjusted training data set so that the training voice distribution of various languages in the training data set has randomness.

Specifically, in order to ensure that the training speech data of each language type can be extracted as the training data when the pre-training of the coding model is performed based on the training data set, random may be performed on the adjusted training data set in this step to disorder the order of the training speech of each language type in the set, so that the training speech distribution of each language type has randomness.

And S2, training voices of the same type of language form a positive example sample pair, training voices of different types of languages form a negative example sample pair, and the coding module is trained by adopting a comparison learning strategy.

Specifically, in the training process, the similarity between the speech coding features of the positive example sample pair is maximized, and the similarity between the speech coding features of the negative example sample pair is minimized, until a set training end condition is reached, so as to obtain the trained coding module.

In the embodiment, the comparison learning strategy is adopted to train the coding module, so that the mandarin resources can be fully utilized to identify and optimize the dialect with low resources, the identification effect of the dialect with low resources is improved, and the training data cost is reduced.

Meanwhile, the coding module can learn the difference between different language types.

Wherein the training loss function may be as follows:

where N represents the number of training voices in the training data set, it is needless to say that N may be represented by the number of training voices contained in one batch if training is performed in units of batch. z is a radical of _i 、 z _j And z _k Are respectively provided withRepresenting the coding characteristics of the i, j and k training voices after being coded by the coding module, sim () representing the similarity calculation, when lid _j ≠lid _i When the training speech is in the same language type, the i-th training speech and the j-th training speech belong to the same language type, and at the moment

A value of 1 when lid _k≠ ≠lid _i When the training speech is not the same as the i-th training speech, the language type of the training speech is different from that of the k-th training speech

The value is 0, and tau is the set network parameter.

From the above formula, it can be seen that the numerator portion only calculates the similarity between the positive example sample pairs, and the denominator portion only calculates the similarity between the negative example sample pairs, and the larger the similarity between the positive example sample pairs is, the smaller the similarity between the negative example sample pairs is, the smaller the overall loss is.

The embodiment provides a pre-training mode for comparison learning based on mandarin and dialects, which can improve the feature characterization capability of a coding module on different types of dialects and mandarin on an acoustic level, so that the aim of improving the identification effect of each dialect by utilizing mandarin data under the background of low resources of the dialects is fulfilled.

Next, an alternative overall flow diagram of the dialect speech recognition method of the present application will be described with reference to fig. 2.

For dialect speech recognition models, it includes an encoding module encoder and a decoding module decoder.

The dialect speech recognition method and the dialect speech recognition system can pre-train the coding module by adopting a comparison learning strategy based on dialect speech training data and mandarin speech training data, and form a dialect speech recognition model by the pre-trained coding module and the pre-trained decoding module.

In the training process of the dialect speech recognition model, speech samples of various languages are used as training samples, and phone-level labeled texts carrying syntactic information after syntactic analysis and phone-level labeling of recognition texts of the training samples are used as sample labels. The process of determining the sample label is shown in fig. 2, and a word-level labeled text with syntactic information is obtained after syntactic analysis is performed on the recognized text.

For each type of dialect except for the cantonese, the phone-level pronunciation dictionary of the mandarin can be multiplexed, and the cantonese and the mandarin can directly use the respective corresponding phone-level pronunciation dictionaries.

And according to the language type to which the voice sample corresponding to the recognized text belongs, performing phone-level labeling on the recognized text by adopting a phone-level pronunciation dictionary of the language type, and replacing characters in the word-level labeled text with the syntactic information by using the phone-level labeling to obtain the phone-level labeled text with the syntactic information.

And training a dialect speech recognition model consisting of a coding module and a decoding module by taking the speech samples corresponding to the recognized text and the phone-level labeled text with syntactic information as training data, wherein a cross entropy loss function and the like can be adopted in the training process.

After a dialect speech recognition model is obtained through training, inputting a target language type speech to be recognized into the model to obtain a phone-level labeled text carrying syntactic information which is output after decoding, inputting the phone-level labeled text carrying the syntactic information into a decoding network corresponding to the target language type for decoding to obtain a decoded character-level labeled text carrying the syntactic information, and obtaining the character-level recognition text by text normalization and removing the syntactic information.

The dialect speech recognition device provided in the embodiment of the present application is described below, and the dialect speech recognition device described below and the dialect speech recognition method described above may be referred to in correspondence with each other.

Referring to fig. 3, fig. 3 is a schematic structural diagram of a dialect speech recognition apparatus disclosed in the embodiment of the present application.

As shown in fig. 3, the apparatus may include:

a to-be-recognized voice acquiring unit 11, configured to acquire a to-be-recognized voice of a target language type;

the model processing unit 12 is configured to input the speech to be recognized to a preset dialect speech recognition model, and obtain a phonemic phone-level labeled text which is output by the model and carries syntactic information;

the dialect voice recognition model is obtained by taking voice samples of various languages as training samples and taking phone-level labeled texts carrying syntactic information, which are obtained by syntactic analysis and phone-level labeling of recognition texts of the training samples, as labels for training; the languages comprise Mandarin and dialects of various types;

the character decoding unit 13 is configured to decode the phone-level labeled text carrying the syntax information by using a preset decoding network corresponding to the target language type, so as to obtain a character-level labeled text carrying the syntax information;

and the text normalization unit 14 is configured to perform text normalization on the character-level labeled text carrying the syntax information to remove the syntax information therein, so as to obtain a dialect speech recognition text.

Optionally, the apparatus of the present application further includes a model training unit, configured to train to obtain a dialect speech recognition model, where the determination process of the label of the training sample in the dialect speech recognition model training process may include:

acquiring a recognition text corresponding to the training sample;

carrying out syntactic analysis on the recognition text, and marking the analyzed syntactic information into the recognition text to obtain a character-level marked text carrying the syntactic information;

Optionally, the process of determining the phone-level pronunciation dictionary matched with each language type by the model training unit may include:

for the two language types of Mandarin and Guangdong dialects, the corresponding phone-level pronunciation dictionary is directly used;

for all the language types except the common Chinese and Guangdong dialects, the phone-level pronunciation dictionary of the common Chinese is multiplexed to form the phone-level pronunciation dictionary matched with the respective language types.

Optionally, the process of the model training unit multiplexing the phone-level pronunciation dictionary of mandarin for the remaining language types except mandarin and cantonese, and forming the phone-level pronunciation dictionary matched with the respective language types may include:

for any one of the remaining language types:

Optionally, the dialect speech recognition model may include an encoding module and a decoding module, where the encoding module is configured to encode an input speech to obtain a speech encoding feature; and the decoding module is used for predicting the phone-level labeled text which carries syntactic information and corresponds to the input voice based on the voice coding characteristics.

Optionally, the apparatus of the present application may further include a pre-training unit, configured to pre-train the coding module, where the pre-training process of the coding module may include:

training the coding module by adopting a comparison learning strategy, wherein in the training process, the similarity between the speech coding characteristics of the positive example sample pair is maximized, and the similarity between the speech coding characteristics of the negative example sample pair is minimized until a set training end condition is reached, so as to obtain the trained coding module.

Optionally, the process of acquiring the training data set by the pre-training unit may include:

acquiring an original training data set;

Optionally, the apparatus of the present application may further include a decoding network establishing unit, configured to establish a decoding network corresponding to the target language type, where the process may include:

The dialect speech recognition device provided by the embodiment of the application can be applied to dialect speech recognition equipment, such as a terminal: mobile phones, computers, etc. Alternatively, fig. 4 is a block diagram illustrating a hardware structure of the dialect speech recognition device, and referring to fig. 4, the hardware structure of the dialect speech recognition device may include: at least one processor 1, at least one communication interface 2, at least one memory 3 and at least one communication bus 4;

in the embodiment of the application, the number of the processor 1, the communication interface 2, the memory 3 and the communication bus 4 is at least one, and the processor 1, the communication interface 2 and the memory 3 complete mutual communication through the communication bus 4;

the processor 1 may be a central processing unit CPU, or an Application Specific Integrated Circuit ASIC (Application Specific Integrated Circuit), or one or more Integrated circuits or the like configured to implement an embodiment of the present invention;

the memory 3 may include a high-speed RAM memory, and may further include a non-volatile memory (non-volatile memory) or the like, such as at least one disk memory;

wherein the memory stores a program and the processor can call the program stored in the memory, the program for:

acquiring a voice to be recognized of a target language type;

the dialect voice recognition model is obtained by taking voice samples of various types of languages as training samples and taking phone-level labeled texts carrying syntax information after syntactic analysis and phone-level labeling of recognition texts of the training samples as labels for training; the languages of various types comprise Mandarin and dialects of various types;

Alternatively, the detailed function and the extended function of the program may be as described above.

Embodiments of the present application further provide a storage medium, where a program suitable for execution by a processor may be stored, where the program is configured to:

acquiring a voice to be recognized of a target language type;

inputting the speech to be recognized into a preset dialect speech recognition model to obtain a phoneme phone-level labeled text which is output by the model and carries syntactic information;

and carrying out text normalization on the character-level labeled text carrying the syntactic information so as to remove the syntactic information in the character-level labeled text and obtain a dialect speech recognition text.

Finally, it should also be noted that, in this document, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, method, article, or apparatus that comprises the element.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, the embodiments may be combined as needed, and the same and similar parts may be referred to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A dialect speech recognition method, comprising:

acquiring a voice to be recognized of a target language type;

2. The method of claim 1, wherein the dialect speech recognition model training process trains a determination process of labels of samples, comprising:

acquiring a recognition text corresponding to the training sample;

adopting a phone-level pronunciation dictionary matched with the language type of the training sample to determine a phone-level label corresponding to the recognition text;

3. The method of claim 2, wherein the determining of the phone-level pronunciation dictionary for each language type match comprises:

4. The method as claimed in claim 3, wherein the multiplexing of the phone-level pronunciation dictionary of Mandarin for each language type other than Mandarin and Guangdong dialects to form a phone-level pronunciation dictionary matching respective language types comprises:

for any one of the remaining language types:

5. The method of claim 1, wherein the dialect speech recognition model comprises an encoding module and a decoding module, and the encoding module is configured to encode the input speech to obtain the speech coding features; and the decoding module is used for predicting the phone-level labeled text which carries syntactic information and corresponds to the input voice based on the voice coding characteristics.

6. The method of claim 5, wherein the coding module is obtained by training in a pre-training manner; wherein, the pre-training process of the coding module comprises the following steps:

7. The method of claim 6, wherein the obtaining a training data set comprises:

acquiring an original training data set;

8. The method according to any one of claims 1 to 7, wherein the establishing process of the decoding network corresponding to the target language type comprises:

9. A dialect speech recognition apparatus, comprising:

the character decoding unit is used for decoding the phone-level labeled text carrying the syntactic information by utilizing a preset decoding network corresponding to the target language type to obtain a character-level labeled text carrying the syntactic information;

10. A dialect speech recognition device, comprising: a memory and a processor;

the memory is used for storing programs;

the processor, configured to execute the program, and implement the steps of the dialect speech recognition method according to any one of claims 1 to 8.

11. A storage medium having stored thereon a computer program, characterized in that the computer program, when being executed by a processor, carries out the steps of the dialect speech recognition method according to any one of claims 1 to 8.