CN105895103B

CN105895103B - Voice recognition method and device

Info

Publication number: CN105895103B
Application number: CN201510883295.6A
Authority: CN
Inventors: 田伟森; 赵恒艺
Original assignee: Leshi Zhixin Electronic Technology Tianjin Co Ltd
Current assignee: Leshi Zhixin Electronic Technology Tianjin Co Ltd
Priority date: 2015-12-03
Filing date: 2015-12-03
Publication date: 2020-01-17
Anticipated expiration: 2035-12-03
Also published as: CN105895103A

Abstract

The embodiment of the invention provides a voice recognition method and a voice recognition device, wherein acoustic characteristic information of voice information is obtained by receiving the voice information sent by a terminal; sequentially inputting the acoustic characteristic information into an acoustic model and a language model, and acquiring initial text information obtained by recognizing the voice information through the acoustic model and the voice model; and correcting the initial text information according to pre-stored user information to generate final text information. By adopting the technical scheme of the embodiment of the invention, the initial text information which is obtained by identification is corrected so as to correct the error in the initial text information, and the final text information generated after correction is sent to the terminal, so that the terminal can provide more accurate service for the user according to the more accurate final text information.

Description

Voice recognition method and device

Technical Field

The embodiment of the invention relates to the technical field of voice information processing, in particular to a voice recognition method and a voice recognition device.

Background

Speech recognition technology is a technology that lets a machine convert a speech signal into a corresponding command or text through a recognition and understanding process. At present, the voice recognition technology is widely applied to voice interaction products such as voice control, voice translation and the like.

At present, various terminals have a voice input function, and various application software installed on the terminals need to execute corresponding operations based on voice recognition results, so that information required by a user is generated and presented to the user. When the voice recognition of the terminal is better, the voice information input by the user can be accurately recognized, and the service provided for the user can be ensured to be more accurate. For example, the terminal includes a map application software, and the user can obtain a route from the current position to a desired place through the map application software; for example, when a user wants to go to a Beijing xx restaurant, the terminal receives voice information input by the user, namely the voice information input by the user is identified to obtain text information of the Beijing xx restaurant, the map application software searches the text information of the Beijing xx restaurant on a map, and plans a route from the current position of the user to the Beijing xx restaurant according to the current position of the user; however, when beijing contains at least two restaurant names and pronunciations are all pinyin corresponding to the 'xx restaurant', the map application software presents a plurality of identification results of text information, or the map application software presents the 'xx restaurant' closest to the current position of the user by default, at this time, the user needs to manually screen the presented search results, and the map application software plans a route according to the results manually screened by the user, or the terminal presents an incorrect route.

Therefore, the problem of high error rate of the current voice recognition result exists.

Disclosure of Invention

The embodiment of the invention provides a voice recognition method and a voice recognition device, which are used for solving the problem of high error rate of the current voice recognition result.

The embodiment of the invention provides the following specific technical scheme:

the embodiment of the invention provides a voice recognition method, which comprises the following steps:

receiving a voice data packet sent by a terminal; wherein, the voice data packet contains voice information;

acquiring acoustic characteristic information of the voice information; wherein the acoustic feature information is information representing sound characteristics of the voice information;

sequentially inputting the acoustic characteristic information into a preset acoustic model and a preset language model, and acquiring initial text information obtained by recognizing the voice information;

according to the pre-stored user information, correcting the initial text information to generate final text information;

and sending the final text information to the terminal.

An embodiment of the present invention provides a speech recognition apparatus, including:

the receiving unit is used for receiving the voice data packet sent by the terminal; wherein, the voice data packet contains voice information;

an acoustic feature information acquisition unit, configured to acquire acoustic feature information of the voice information; wherein the acoustic feature information is information representing sound characteristics of the voice information;

the initial text information acquisition unit is used for sequentially inputting the acoustic characteristic information into a preset acoustic model and a preset language model and acquiring initial text information obtained by identifying the voice information;

the final text information generating unit is used for correcting the initial text information according to pre-stored user information to generate final text information;

and the sending unit is used for sending the final text information to the terminal.

Drawings

FIG. 1 is a diagram illustrating an exemplary speech recognition system architecture;

FIG. 2 is a flow chart of speech recognition according to an embodiment of the present invention;

FIG. 3 is a flow chart of database establishment according to a second embodiment of the present invention;

fig. 4 is a schematic structural diagram of a speech recognition device according to a third embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiments of the present invention will be described in further detail with reference to the accompanying drawings.

Fig. 1 is a schematic diagram of a voice recognition system according to an embodiment of the present invention, where the voice recognition system includes a terminal and a server; the terminal is a terminal with a communication function, the terminal is a terminal with a human-computer interaction interface, for example, the terminal is a personal computer, a tablet computer, a mobile phone and the like, the terminal can bear various operating systems such as a microsoft operating system, an android operating system, an ios operating system and the like, and the terminal can bear various application software compatible with the operating system installed in the terminal, such as map application software, chat tool application software and the like; the server is provided with a voice recognition component and a voice recognition correction component, wherein the voice recognition component is used for recognizing the voice information sent by the terminal, and the voice recognition correction component is used for correcting the recognition result of the voice recognition component; further, the server further includes a voiceprint service component, a TTS (Text To Speech; Text To language), a data service component, a user database, and the like, wherein the voiceprint service component is configured To analyze the voice information sent by the terminal To obtain initial user information, the TTS is configured To convert final Text information into voice information, the data service component is configured To analyze the initial user information obtained by the voiceprint service component To obtain final user information, and the database is configured To store the user information obtained by the analysis of the data service component and a terminal identifier corresponding To the user information.

Example one

Referring to fig. 2, in the embodiment of the present invention, a process of performing voice recognition by a server includes:

step 200: receiving a voice data packet sent by a terminal; wherein, the voice data packet contains voice information.

In the embodiment of the invention, the terminal calls an SDK (Software Development Kit) to obtain the voice information input by a user through a voice acquisition component; the terminal generates a voice data packet according to the voice information; and sending the voice data packet to the server.

Optionally, a wireless communication network is included between the terminal and the server, and the terminal sends the voice data packet including the voice information to the server through the wireless communication network.

Further, after the server receives the voice data packet sent by the terminal, the noise removal processing is performed on the collected voice information to remove interference factors in the voice information, wherein the interference factors are background music or background noise when the user inputs the voice information, and therefore the accuracy of the obtained final text information is guaranteed.

Step 210: acquiring acoustic characteristic information of the voice information; wherein the acoustic feature information is information characterizing sound characteristics of the speech information.

In the embodiment of the invention, a voice recognition component in a server analyzes the voice information to obtain acoustic characteristic information contained in the voice information; the acoustic feature information is a series of frequency spectrum information, and since the pronunciation reaction of each word or word is a section of frequency spectrum acoustically, and the frequency spectrums corresponding to different pronounced words are different, the frequency spectrum information can represent the information of the sound characteristics of the voice information.

Step 220: and sequentially inputting the acoustic characteristic information into a preset acoustic model and a preset language model, and acquiring initial text information obtained by identifying the voice information.

In the embodiment of the invention, a voice recognition component in the server sequentially inputs the acoustic characteristic information into a preset acoustic model and a language model, and obtains initial text information obtained by the recognition of the language model.

Optionally, the voice recognition component in the server inputs the acoustic feature information into a preset acoustic model, and obtains a pronunciation template identifier output by the acoustic model; and inputting the pronunciation template identification into the language model to obtain initial text information output by the language model. The acoustic model and the voice model are obtained by training a large number of training samples according to a dynamic time adjustment principle, a hidden Markov principle or a vector quantization principle.

Specifically, the acoustic model matches the acoustic feature information with each pronunciation template included in the acoustic model, and obtains a distance between the acoustic feature information and each pronunciation template included in the acoustic model, where the acoustic template includes a character pronunciation model, a semisyllable model, or a prime model; the acoustic model acquires a pronunciation template with the minimum pronunciation distance from each pronunciation template contained in the acoustic characteristic information from all pronunciation templates; because a mapping relation exists between a pronunciation template in the acoustic model and a text in the language model, the identification of the pronunciation template is input into the language model, and the language model can acquire the text corresponding to the identification of the pronunciation template;

optionally, the language model includes a plurality of tree structures, each tree structure takes each word or each pronunciation as a root node, and each child node is a phrase that can be formed by each word; since each pronunciation may correspond to multiple texts, the language model performs the following operations for each pronunciation template identification output by the acoustic model: inquiring each tree structure corresponding to the pronunciation template mark, and acquiring a text corresponding to the pronunciation template mark and a mark corresponding to the pronunciation template mark after the pronunciation template mark according to the pronunciation template mark after the pronunciation template mark; and by analogy, all texts corresponding to the voice information are obtained, and initial text information is generated according to all the texts. The language model may output one initial text message or a plurality of initial text messages.

By adopting the technical scheme, the acoustic model and the language model are obtained by scientifically training a large amount of voice information, so that the voice information is input into the acoustic model and the language model, and more accurate initial text information can be obtained.

Step 230: and correcting the initial text information according to pre-stored user information to generate final text information.

In the embodiment of the invention, a voice recognition correction component in the server extracts pre-stored user information from the user database; correcting the initial text information according to pre-stored user information; the user information is uploaded by a user through a terminal, and/or the user information is obtained by the server according to recognition training of voice information of a large number of users.

Optionally, the method for acquiring the pre-stored user information includes: the server acquires the identifier of the terminal contained in the voice data packet; searching user information corresponding to the identifier of the terminal from a user information set; wherein the user information includes a location of the user at a historical point in time, an age of the user, or a gender of the user; the user information set comprises the identification of the terminal and the corresponding relation of the user information.

Optionally, according to pre-stored user information, the initial text information is modified to generate final text information, and the method specifically includes: dividing the initial text information to obtain each word segmentation; searching the historical time point matched with the current time point from the user information aiming at the position participle in the participle, and acquiring the position of the user at the searched historical time point, wherein if the acquired position of the user is not completely or partially matched with the position participle, and the similarity between the pronunciation of the position participle and the acquired position pronunciation of the user reaches a preset threshold value, the acquired position of the user is used for replacing the position participle; aiming at special participles in the participles, correcting the special participles according to the user age or the user gender contained in the user information; wherein, the special participle is a participle with different synonyms of same tone.

Optionally, the matching between the current time point and the historical time point means that the time difference between the current time point and the historical time point is smaller than a preset time difference range; the preset time difference range is set according to a specific application scene.

For example, when the initial text message is "how to go to full focus road conditions", since beijing contains multiple full focus, the server first obtains the position word "full focus" contained in the initial text message, and the server obtains that the current time is 18 pm: 00, the server detects that the user has been located at the Henmen Hope Congress shop three times at about 18:10, so the server defaults to the fact that the user searches for the 'Henmen Hope Congress', and the server revises the initial text information into 'how to go to the Henmen Hope Congress road condition'.

For another example, when the initial text information is "how the traffic condition is", the server defaults that the initial text information contains the position word, and the server acquires that the current time is 18 pm: 00, the server detects that the user is located in the "xx cell" at about the time point, and therefore, the server corrects the initial text information to "how the traffic condition of the xx cell is".

For example, when the initial text information is "how like yuxi", the server acquires the age and sex of the user because the "yuxi" has the homophone "feather", and when the age of the user is 20 to 26 and the sex of the user is female, the server corrects the initial text information to "how like feather".

Further, when the number of the processed text messages is multiple, the server may select the most accurate initial text message from the initial text messages in the above manner, and correct the selected initial text message.

Further, the server may also modify the initial text information according to a type of application software that sends the voice data packet; for example, when the voice information input by the user is "what is like a feather", since the application software being run by the terminal is the map application software, "where the feather is not a place name", the server corrects the initial text information to "what is like a yuxi".

Further, according to the pre-stored user information, the initial text information is modified to generate final text information, and the method further includes: when the local user information corresponding to the identification of the terminal is not contained, determining the age and the gender of the user providing the voice information according to the acoustic characteristic information; and according to the determined age and sex of the user providing the voice information, correcting the initial text information to generate final text information.

Optionally, the determining, by the acoustic feature information, the age and gender of the user providing the voice information specifically includes: extracting biological characteristic data in the acoustic characteristic information by the voiceprint service component, wherein the biological characteristic data comprises tone color, tone quality, tone, speech speed and the like; and the voiceprint service component acquires the age and the gender of the user according to the biological characteristic data and the acoustic model.

Step 240: and sending the final text information to the terminal.

In the embodiment of the invention, the server sends the final text information to the terminal through a wireless communication network.

Further, after generating the final text information, the server may convert the final text information into voice information; and sending the voice information to the terminal, and playing the final text information by the terminal.

Further, after generating the final text information, the server may obtain the service requested by the user according to the final text information, generate a data packet corresponding to the service requested by the user, and send the data packet to the terminal. The data packet may be in a text form or a voice form.

By adopting the technical scheme, the initial text information which is obtained by identification is corrected according to the personalized information of the user, so that the error in the initial text information is corrected, and the accuracy of voice identification is improved; and the final text information generated after correction is sent to the terminal, so that the terminal can provide more accurate service for the user according to the more accurate final text information.

Example two

Referring to fig. 3, in the embodiment of the present invention, a process of generating user information included in a database of a server includes:

step 300: receiving a voice data packet sent by a terminal; wherein, the voice data packet contains voice information.

Step 310: and acquiring acoustic feature information contained in the voice information.

Step 320: according to the acoustic feature information, determining the age and the gender of a user providing the voice information and final text information; according to the determined age and gender of the user providing the voice information.

Optionally, the server may further obtain environmental data, such as time and user action range, according to the acoustic feature information.

Step 330: and analyzing the determined age and sex of the user and the final text information, and generating user information according to the analysis result.

Optionally, the server may further generate user information according to the environment data.

Step 340: and establishing the identifier of the terminal and the corresponding relation between the generated user information, and storing the corresponding relation into the user information set.

EXAMPLE III

Based on the above technical solution, referring to fig. 4, in an embodiment of the present invention, a memory space cleaning device is provided, including a receiving unit 40, an acoustic feature information obtaining unit 41, an initial text information obtaining unit 42, a final text information generating unit 43, and a sending unit 44, where:

a receiving unit 40, configured to receive a voice data packet sent by a terminal; wherein, the voice data packet contains voice information;

an acoustic feature information obtaining unit 41 configured to obtain acoustic feature information of the voice information; wherein the acoustic feature information is information representing sound characteristics of the voice information;

an initial text information obtaining unit 42, configured to sequentially input the acoustic feature information into a preset acoustic model and a preset language model, and obtain initial text information obtained by recognizing the voice information;

a final text information generating unit 43, configured to modify the initial text information according to pre-stored user information, and generate final text information;

a sending unit 44, configured to send the final text information to the terminal.

Further, the voice data packet also contains a terminal identifier; further comprising a pre-stored information obtaining unit 45 for: searching user information corresponding to the identifier of the terminal from a user information set; wherein the user information includes a location of the user at a historical point in time, an age of the user, or a gender of the user; the user information set comprises the identification of the terminal and the corresponding relation of the user information.

Optionally, the initial text information obtaining unit 42 specifically includes: inputting the acoustic characteristic information into a preset acoustic model, and acquiring a pronunciation template identifier output by the acoustic model; and inputting the pronunciation template identification into the language model to obtain initial text information output by the language model.

Optionally, the final text information generating unit 43 is specifically configured to: dividing the initial text information to obtain each word segmentation; searching the historical time point matched with the current time point from the user information aiming at the position participle in the participle, and acquiring the position of the user at the searched historical time point, wherein if the acquired position of the user is not completely or partially matched with the position participle, and the similarity between the pronunciation of the position participle and the acquired position pronunciation of the user reaches a preset threshold value, the acquired position of the user is used for replacing the position participle; aiming at special participles in the participles, correcting the special participles according to the user age or the user gender contained in the user information; wherein, the special participle is a participle with different synonyms of same tone.

Further, the final text information generating unit 43 is further configured to: when the local user information corresponding to the identification of the terminal is not contained, determining the age and the gender of the user providing the voice information according to the acoustic characteristic information; and according to the determined age and sex of the user providing the voice information, correcting the initial text information to generate final text information.

Further, a processing unit 46 is included for: after the final text information is generated, analyzing the determined age and sex of the user and the final text information, and generating user information according to an analysis result; and establishing the identifier of the terminal and the corresponding relation between the generated user information, and storing the corresponding relation into the user information set.

In summary, in the embodiment of the present invention, acoustic feature information of voice information is obtained by receiving the voice information sent by a terminal; sequentially inputting the acoustic characteristic information into an acoustic model and a language model, and acquiring initial text information obtained by recognizing the voice information through the acoustic model and the voice model; and correcting the initial text information according to pre-stored user information to generate final text information. By adopting the technical scheme of the embodiment of the invention, the initial text information which is obtained by identification is corrected so as to correct the error in the initial text information, and the final text information generated after correction is sent to the terminal, so that the terminal can provide more accurate service for the user according to the more accurate final text information.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the embodiments of the present invention, and not to limit the same; although embodiments of the present invention have been described in detail with reference to the foregoing embodiments, those skilled in the art will understand that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A speech recognition method, comprising:

sending the final text information to the terminal;

the voice data packet also comprises a terminal identifier;

the method for acquiring the pre-stored user information comprises the following steps:

searching user information corresponding to the identifier of the terminal from a user information set; wherein the user information includes a location of the user at a historical point in time, an age of the user, or a gender of the user; the user information set comprises the identifier of the terminal and the corresponding relation of the user information;

inputting the acoustic feature information into a preset acoustic model and a preset language model in sequence, and acquiring initial text information obtained by recognizing the voice information, wherein the method specifically comprises the following steps:

inputting the acoustic characteristic information into a preset acoustic model, and acquiring a pronunciation template identifier output by the acoustic model;

inputting the pronunciation template identification into the language model to obtain initial text information output by the language model;

according to the pre-stored user information, the initial text information is corrected to generate final text information, and the method specifically comprises the following steps:

dividing the initial text information to obtain each word segmentation; searching a historical time point matched with the current time point from the user information aiming at the position participle in the participles, and acquiring the position of the user at the searched historical time point, wherein if the acquired position of the user is not completely or partially matched with the position participle, and the similarity between the pronunciation of the position participle and the acquired position pronunciation of the user reaches a preset threshold value, the acquired position of the user is used for replacing the position participle; aiming at special participles in the participles, correcting the special participles according to the user age or the user gender contained in the user information; wherein, the special participle is a participle with different synonyms of same tone.

2. The method of claim 1, wherein the initial text information is modified according to pre-stored user information to generate final text information, further comprising:

when the local user information corresponding to the identification of the terminal is not contained, determining the age and the gender of the user providing the voice information according to the acoustic characteristic information;

and according to the determined age and sex of the user providing the voice information, correcting the initial text information to generate final text information.

3. The method of claim 2, wherein after generating the final textual information, the method further comprises:

analyzing the determined age and gender of the user and the final text information, and generating user information according to an analysis result;

and establishing the identifier of the terminal and the corresponding relation between the generated user information, and storing the corresponding relation into the user information set.

4. A speech recognition apparatus, comprising:

a sending unit, configured to send the final text information to the terminal;

the voice data packet also comprises a terminal identifier;

the device further comprises a pre-stored information acquisition unit used for:

the initial text information obtaining unit is specifically configured to:

the final text information generating unit is specifically configured to:

dividing the initial text information to obtain each word segmentation;

searching a historical time point matched with the current time point from the user information aiming at the position participle in the participles, and acquiring the position of the user at the searched historical time point, wherein if the acquired position of the user is not completely or partially matched with the position participle, and the similarity between the pronunciation of the position participle and the acquired position pronunciation of the user reaches a preset threshold value, the acquired position of the user is used for replacing the position participle;

aiming at special participles in the participles, correcting the special participles according to the user age or the user gender contained in the user information; wherein, the special participle is a participle with different synonyms of same tone.

5. The apparatus of claim 4, wherein the final text information generating unit is further configured to:

6. The apparatus of claim 5, further comprising a processing unit to:

after the final text information is generated, analyzing the determined age and sex of the user and the final text information, and generating user information according to an analysis result;