WO2020091123A1

WO2020091123A1 - Method and device for providing context-based voice recognition service

Info

Publication number: WO2020091123A1
Application number: PCT/KR2018/013280
Authority: WO
Inventors: 황명진; 강민호; 지창진
Original assignee: 주식회사 시스트란인터내셔널
Priority date: 2018-11-02
Filing date: 2018-11-02
Publication date: 2020-05-07
Also published as: CN113016029A; KR20210052563A

Abstract

The present invention relates to a method for recognizing a voice and a device therefor. More particularly, a voice recognition device according to the present invention can acquire voice information from a user and convert the acquired voice information into voice data. Afterward, a voice recognition model can generate a first voice recognition result by recognizing the converted voice data with a first voice recognition model, generate a second voice recognition result by recognizing the converted voice data with a second voice recognition model, and select a specific voice recognition result from the first voice recognition result and the second voice recognition result through a specific determination procedure.

Description

Method and apparatus for providing context-based speech recognition service

The present invention relates to a method and apparatus for recognizing a user's voice. More specifically, it relates to a method and apparatus for improving speech recognition accuracy based on context in a method for recognizing speech obtained from a user.

Automatic voice recognition (hereinafter referred to as voice recognition) is a technology that converts voice to text using a computer. Speech recognition has recently improved dramatically.

However, although the overall recognition rate is improved, performance differences occur depending on the structure of the data or the structure of the data used when learning the language model or the acoustic model.

An object of the present invention is to provide a method for selecting a speech recognition result with high accuracy among a plurality of speech recognition results when speech is recognized using a plurality of speech recognition models.

In addition, an object of the present invention is to provide a method for selecting a speech recognition model for speech recognition using context information.

The technical problems to be achieved in the present invention are not limited to the technical problems mentioned above, and other technical problems that are not mentioned will be clearly understood by those skilled in the art from the following description. Will be able to.

A method of recognizing a voice according to the present invention includes obtaining voice information from a user; Converting the acquired voice information into voice data; Recognizing the converted speech data with a first speech recognition model and generating a first speech recognition result; Generating a second speech recognition result by recognizing the converted speech data with a second speech recognition model; And selecting a specific speech recognition result from the first speech recognition result and the second speech recognition result through a specific determination procedure.

In addition, in the present invention, the specific determination procedure includes: extracting context information from the first speech recognition result and the second speech recognition result; Comparing the context information with a first characteristic of the first speech recognition model and a second characteristic of the second speech recognition model, respectively; And selecting one of the first speech recognition result and the second speech recognition result based on the comparison result.

In addition, in the present invention, the context information may include at least one of a portion of the speech information or information that can be obtained from the first speech recognition result and the second speech recognition result or information related to a user who has spoken speech. have.

In addition, in the present invention, the first voice recognition model and the second voice recognition model are one of a plurality of voice recognition models for recognizing the voice information obtained from the user.

In addition, the present invention, further comprising the step of recognizing the converted speech data with the plurality of speech recognition models to generate a plurality of speech recognition results, wherein the specific speech recognition result is the first speech recognition result, the The second speech recognition result and the plurality of speech recognition results are selected.

In addition, in the present invention, the specific determination procedure is a procedure for determining a speech recognition result based on the context included in the context information.

In addition, the present invention, obtaining the voice information from the user; Converting the acquired voice information into voice data; Generating a first speech recognition result by recognizing the speech data with the first speech recognition model; Selecting a second speech recognition model for recognizing the speech data among a plurality of speech recognition models based on the first speech recognition result; And recognizing the voice data using the second voice recognition model to generate a second voice recognition result.

In addition, the present invention, the step of extracting context information from the first speech recognition result; And comparing the context information with a predetermined characteristic of the plurality of speech recognition models, wherein the second speech recognition model is selected based on the comparison result.

In addition, in the present invention, the first speech recognition model is a speech recognition model for extracting the context information.

In addition, the present invention, obtaining the voice information from the user; Converting the acquired voice information into voice data; And recognizing the voice data with a specific voice recognition model selected from a plurality of voice recognition models to generate a voice recognition result.

In addition, the present invention, setting the context information for speech recognition; And selecting the specific speech recognition model whose characteristics are most suitable for the context information among the plurality of speech recognition models.

According to an embodiment of the present invention, when a plurality of results are generated using a plurality of speech recognition models when recognizing a speech input, the accuracy of speech recognition is selected by selecting a recognition result of a food recognition model having high accuracy among them. Can increase

In addition, by selecting the speech recognition model according to the context information, each of the plurality of speech recognition models can be used according to the purpose.

In addition, it is possible to select an appropriate voice recognition model in a service for a large user or in an environment in which the physical and situational environment in which the user is located changes frequently.

In addition, by selecting an appropriate speech recognition model, misrecognition due to similar vocabulary that may occur while applying a large language model may be reduced, and misrecognition due to unregistered vocabulary that may occur while applying a small language model may be reduced. .

BRIEF DESCRIPTION OF THE DRAWINGS The accompanying drawings, which are included as part of the detailed description to aid understanding of the present invention, provide embodiments of the present invention, and describe the technical features of the present invention together with the detailed description.

1 is a block diagram of a voice recognition device according to an embodiment of the present invention.

2 and 3 are views showing an example of a speech recognition apparatus according to an embodiment of the present invention.

4 and 5 are views showing another example of a speech recognition apparatus according to an embodiment of the present invention.

6 is a view showing another example of a speech recognition apparatus according to an embodiment of the present invention.

7 is a flowchart illustrating an example of a speech recognition method according to an embodiment of the present invention.

8 is a flowchart illustrating another example of a speech recognition method according to an embodiment of the present invention.

9 is a flowchart illustrating another example of a speech recognition method according to an embodiment of the present invention.

Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. DETAILED DESCRIPTION The following detailed description, together with the accompanying drawings, is intended to describe exemplary embodiments of the present invention, and is not intended to represent the only embodiments in which the present invention may be practiced. The following detailed description includes specific details to provide a thorough understanding of the present invention. However, one skilled in the art knows that the present invention may be practiced without these specific details.

In some cases, in order to avoid obscuring the concept of the present invention, well-known structures and devices may be omitted, or block diagrams centered on core functions of each structure and device may be illustrated.

Referring to FIG. 1, the voice recognition device 100 for recognizing a user's voice may include an input unit 110, a storage unit 120, a control unit 130, and / or an output unit 140.

The components shown in FIG. 1 are not essential, so an electronic device having more or fewer components may be implemented.

Hereinafter, the components will be described in turn.

The input unit 110 may receive audio information, video signals, or voice information (or voice signals) and data from a user.

The input unit 110 may include a camera and a microphone to receive an audio signal or a video signal. The camera processes a video frame such as a still image or video obtained by an image sensor in a video call mode or a shooting mode.

The image frame processed by the camera may be stored in the storage unit 120.

The microphone receives an external sound signal by a microphone in a call mode, a recording mode, or a voice recognition mode, and processes it as electrical voice data. Various noise reduction algorithms for removing noise generated in the process of receiving an external sound signal may be implemented in the microphone.

When the user's uttered voice is input through a microphone or a microphone, the input unit 110 may convert it into an electrical signal and transmit it to the control unit 130.

The controller 130 may acquire a user's voice data by applying a speech recognition algorithm or a speech recognition engine to the signal received from the input unit 110.

At this time, the signal input to the control unit 130 may be converted into a more useful form for voice recognition, and the control unit 130 converts the input signal from an analog form to a digital form, and detects the start and end points of the voice. By doing so, the actual voice section / data included in the voice data can be detected. This is called EPD (End Point Detection).

Then, the control unit 130 within the detected section Cepstrum (Cepstrum), linear predictive coding (Linear Predictive Coefficient: LPC), Mel frequency Cepstral (Mel Frequency Cepstral Coefficient: MFCC) or filter bank energy (Filter Bank) Energy) to extract feature vector of signal.

The memory 120 may store a program for the operation of the controller 130 and may temporarily store input / output data. A sample file for a symbol-based malicious code detection model can be stored from a user, and analysis results of the malicious code can be stored.

The memory 120 may store various data related to the recognized voice, and in particular, may store information and feature vectors related to the end point of the voice data processed by the controller 130.

The memory 120 includes flash memory, hard disc, memory card, ROM (Read-Only Memory), RAM (Random Access Memory), memory card, EEPROM (Electrically Erasable Programmable Read) It may include at least one storage medium of -Only Memory), PROM (Programmable Read-Only Memory), magnetic memory, magnetic disk, or optical disk.

Then, the control unit 130 may obtain a recognition result through comparison between the extracted feature vector and the trained reference pattern. To this end, a speech recognition model for modeling and comparing signal characteristics of speech and a language model for modeling linguistic order relationships such as words or syllables corresponding to recognized vocabulary may be used.

The speech recognition model can be divided into a direct comparison method that sets the recognition target as a feature vector model and compares it with the feature vector of speech data, and a statistical method that statistically processes the feature vector of the recognition target.

The direct comparison method is a method of setting units of words, phonemes, and the like to be recognized as feature vector models and comparing how similar the input voices are to each other. A representative method is vector quantization. According to the vector quantization method, a feature vector of the input speech data is mapped to a codebook, which is a reference model, and encoded as a representative value, thereby comparing these code values.

The statistical model method is a method of constructing a unit for a recognition object into a state sequence and using the relationship between the state columns. The status column may consist of a plurality of nodes. The methods of using the relationship between the state columns are dynamic time warping (DTW), hidden markov model (HMM), and neural network.

Dynamic time warping is a method of compensating for differences in the time axis when compared with the reference model by considering the dynamic characteristics of the voice whose signal length varies with time even if the same person pronounces the same, and the Hidden Markov model makes the speech state transition probability. And after assuming the Markov process having the observation probability of the node (output symbol) in each state, estimates the state transition probability and the observation probability of the node through the learning data, and calculates the probability that the input voice will occur in the estimated model It is a recognition technology.

On the other hand, a language model that models a linguistic order relationship such as a word or a syllable can reduce acoustic ambiguity and reduce errors in recognition by applying the order relationship between units constituting language to units obtained in speech recognition. The language model includes a statistical language model and a model based on the Finite State Automata (FSA), and the statistical language model uses chain probabilities of words such as Unigram, Bigram, and Trigram.

The controller 130 may use any of the above-described methods in recognizing the voice. For example, a speech recognition model to which the Hidden Markov model is applied may be used, or an N-best search method incorporating a speech recognition model and a language model may be used. The N-best search method can improve recognition performance by selecting up to N recognition candidates using speech recognition model and language model, and re-evaluating the ranking of these candidates.

The controller 130 may calculate a confidence score (or may be abbreviated as 'reliability') to secure the reliability of the recognition result.

The reliability score is a measure of how reliable the result is for speech recognition results. It can be defined as the relative value of the probability that the word is spoken from other phonemes or words for the recognized phoneme or word. have. Therefore, the reliability score may be expressed as a value between 0 and 1, or may be expressed as a value between 0 and 100. When the reliability score is greater than a preset threshold, the recognition result may be recognized, and if the reliability score is small, the recognition result may be rejected.

In addition to this, the reliability score can be obtained according to various conventional reliability score acquisition algorithms.

The control unit 130 may be implemented in a computer-readable recording medium using software, hardware, or a combination thereof. According to the hardware implementation, ASICs (Application Specific Integrated Circuits), DSPs (Digital Signal Processors), DSPDs (Digital Signal Processing Devices), PLDs (Programmable LogicDevices), FPGAs (Field Programmable Gate Arrays), processors (processor), microcontrollers It may be implemented using at least one of electrical units such as (microcontrollers) and micro-processors.

According to the software implementation, it may be implemented together with a separate software module that performs at least one function or operation, and the software code may be implemented by a software application written in an appropriate program language.

The control unit 130 implements the functions, processes, and / or methods proposed in FIGS. 2 to 6, which will be described later, and hereinafter, for convenience of description, the control unit 130 is identified with the voice recognition device 100 and described. do.

The output unit 140 is for generating output related to vision, hearing, and the like, and outputs information processed by the device 100.

For example, the output unit 140 may output the recognition result of the voice signal processed by the controller 130 so that the user can recognize it through visual or hearing.

The voice recognition model described below may recognize voice information input from a user through the same method as the voice recognition model described in FIG. 1.

2 and 3, the speech recognition apparatus recognizes speech data obtained from a user as a plurality of speech recognition models, and selects one of the results recognized from the speech recognition models based on context information to recognize speech. Can provide services.

Specifically, the voice recognition apparatus may convert voice information input from a user into an electrical signal, and convert the analog signal, which is the converted electrical signal, into a digital signal to generate voice data.

Thereafter, the voice recognition model may recognize voice data using the first voice recognition model 2010 and the second voice recognition model 2020, respectively.

The speech recognition device uses two basic voice recognition models and a user voice recognition model, respectively, to obtain two voice recognition results (voice recognition result 1 (2030) and voice recognition result 2 (2040)) from voice data converted from a voice signal input from a user. ).

The speech recognition apparatus applies the first speech recognition result and the second speech recognition result to a first specific determination procedure (for example, a first context-based appropriate speech recognition model determination technique) to perform the first speech recognition result and the second speech recognition result. Among the voice recognition results, a more suitable voice recognition result 2050 may be selected and output.

That is, the speech recognition apparatus may select a speech recognition result that is more suitable for the purpose of speech recognition among the first speech recognition result and the second speech recognition result through the first specific determination procedure, and output the selected speech recognition result. .

For example, when context information extracted from the first speech recognition result and the second speech recognition result is related to address search, a speech recognition model more suitable for address search is selected from the first speech recognition model and the second speech recognition model, The speech recognition result of the selected speech recognition model may be provided as a speech recognition service.

Hereinafter, a specific determination procedure will be described with reference to FIG. 3.

3 is a flowchart illustrating an example of a first specific determination procedure for determining an appropriate speech recognition model based on the context of the first speech recognition result and the second speech recognition result.

As illustrated in FIG. 3, in the first specific determination procedure, when the first speech recognition result 3010 and the second speech recognition result 3020 are generated from the first speech recognition model and the second speech recognition model, respectively, Based on the context 3032 extracted from the first speech recognition result 3010 and the second speech recognition result 3020, the purpose of speech recognition among the first speech recognition result 3010 and the second speech recognition result 3020 is further added. A suitable speech recognition model can be selected (3034).

Thereafter, the speech recognition apparatus may select 3036 a speech recognition result generated from the selected speech recognition model, and output 3040 the result.

For example, in FIG. 3, in the first voice recognition result 'tell me a self-address address' and the second voice recognition result 'tell me an address of Gil-dong', the speech recognition device determined 'tell me some address' as context information.

Specifically, the speech recognition device may extract the context information 'tell me an address' (3032) from 'tell me an address of an e-cylinder' and 'tell me an address of a Gil-dong'.

Thereafter, the speech recognition apparatus compares the extracted contextual information with the characteristics of the first speech recognition model (first characteristic) and the characteristics of the second speech recognition model (second characteristic) as a speech recognition model more suitable for the purpose of speech recognition. A first voice recognition model may be selected (3034).

Thereafter, the speech recognition apparatus may select a first speech recognition result of the selected first speech recognition model (3036), and may output 'tell me a base address', which is the selected first speech recognition result.

At this time, in addition to the part of the recognition sentence recognized from the voice data as the context information, all information that can be inferred through the recognition result or the like may be used as the context information.

For example, at least one of information related to the user, such as the user's location, the user's weather, user habits, the user's previous speech context, the user's career, the user's position, the user's financial status, the current time, and the user's language. Can be used as contextual information.

4 and 5, the speech recognition apparatus recognizes speech data obtained from a user as a plurality of speech recognition models, and selects one of the results recognized from the speech recognition models based on context information to recognize speech. Can provide services.

Specifically, the voice recognition device may generate the first voice recognition result 4020 by recognizing the voice information input from the user as the first voice recognition model 4010.

At this time, the first voice recognition model is a voice information model for extracting context from voice information obtained from a user, and may be configured to use only small resources according to the purpose of the voice information model according to the purpose of voice recognition.

The voice recognition apparatus is a specific voice most suitable for recognizing voice information input from a user among a plurality of preset voice recognition models using a second specific determination procedure (for example, a second context-based appropriate voice recognition model determination technique). A recognition model may be selected (4030).

That is, the speech recognition apparatus may select a specific speech recognition model from among a plurality of speech recognition models based on the first speech recognition result according to the purpose and purpose of speech recognition.

For example, when context information extracted from the first speech recognition result is related to address search, a speech recognition model that is most suitable for address search among a plurality of candidate speech recognition models may be selected as a specific speech recognition model.

At this time, the second specific determination procedure includes a process of extracting context information from the first speech recognition result to select a specific speech recognition model, and selecting a specific speech recognition model using the extracted context information.

Thereafter, the voice recognition procedure may re-recognize the voice data converted from the voice information input from the user using the selected specific voice recognition model to finally generate the voice recognition result 4040.

Hereinafter, the second specific determination procedure will be described with reference to FIG. 5.

5 is a flowchart illustrating an example of a second specific determination procedure for determining an appropriate speech recognition model based on the context of the first speech recognition result and the second speech recognition result.

Specifically, the speech recognition device generates (or receives) the first speech recognition result 5010 that recognizes the user's speech information by using the first speech recognition model described with reference to FIG. 4, and the generated speech recognition result Based on the second specific determination procedure, a specific speech recognition model that is most suitable for the purpose of speech recognition may be selected from a plurality of (eg, N) speech recognition models as shown in FIG. 4 (5020).

For example, from the first speech recognition model recognized through the first speech recognition result, 'tell me the address of Lee Gil-dong', 'tell me the address' can be extracted as context information.

At this time, the first voice recognition model is a voice information model for extracting context from voice information obtained from a user as described above, and may be configured to use only small resources according to the purpose of the voice information model according to the purpose of voice recognition. have.

In the context information, in addition to a part of the sentence recognized through the speech recognition model, all information that can be inferred through the recognition result may be used as the context information.

Thereafter, the speech recognition apparatus may select a specific speech recognition model that is most suitable for the purpose of speech recognition, as shown in FIG. 4, from the plurality of speech recognition models using the extracted context information 'tell me the address'.

Through this method, the speech recognition apparatus may extract context information through a speech recognition model for obtaining context information, and select a specific speech recognition model that is most suitable for the purpose of speech recognition.

Referring to FIG. 6, the speech recognition apparatus may select a specific speech recognition model among a plurality of speech recognition models in advance by setting context information for speech recognition, and use the speech recognition results recognized through the selected speech recognition model To provide a speech recognition service.

Specifically, the speech recognition apparatus may select a specific speech recognition model, which is determined to be most suitable for speech recognition, from among a plurality of speech recognition models according to predetermined context information (6010).

For example, when the purpose and purpose of the speech recognition service is address search, the speech recognition apparatus may select a speech recognition model preset for use in address search among a plurality of speech recognition models as a specific speech recognition model.

Thereafter, the voice recognition model may generate voice recognition results 6020 by recognizing voice data obtained from a user through the selected specific food model.

In this case, the voice data may mean data in which voice information obtained from a user is changed into an electrical signal, and an analog signal, which is a changed electrical signal, is converted into a digital signal.

Referring to FIG. 7, as described in FIGS. 2 and 3, the speech recognition device generates a speech recognition result through a plurality of speech recognition devices, and selects the most suitable speech recognition result among the generated speech recognition results to recognize the speech Can provide services.

Specifically, the speech recognition device may acquire speech information from a user and convert the acquired speech information into speech data (S7010).

For example, the voice recognition device may convert voice information obtained from a user into an electrical signal, and convert an analog signal, which is a changed electrical signal, into voice data, which is a digital signal.

Thereafter, the voice recognition apparatus may recognize the voice data using the first voice recognition model and the second voice recognition model, respectively, and generate first voice recognition results and second voice recognition results (S7020, S7030).

Thereafter, the speech recognition apparatus provides a speech recognition service by selecting a speech recognition result that is more suitable for the purpose of speech recognition from the first speech recognition result and the second speech recognition result through the first specific determination procedure illustrated in FIGS. 2 and 3. It can be done (S7040).

For example, the speech recognition apparatus extracts context information from the first speech recognition result and the second speech recognition result, and extracts the extracted context information from the first characteristics of the first speech recognition model and the second speech recognition model. Each can be compared with 2 characteristics.

Thereafter, the speech recognition apparatus may select a speech recognition model suitable for the purpose and / or purpose of speech recognition among the first speech recognition model and the second speech recognition model based on the comparison result.

Thereafter, the speech recognition apparatus may select a second speech recognition result generated by the selected second speech recognition model as a speech recognition result, and provide a speech recognition service based on the selected second speech recognition result.

Referring to FIG. 8, the speech recognition model may extract context information through speech data and provide a speech recognition service based on the extracted context information.

First, step S8010 is the same as step S7010 in FIG. 7, and thus description thereof will be omitted.

Thereafter, the speech recognition device recognizes the speech data using the first speech recognition model to generate a first speech recognition result (S8020).

At this time, the first voice recognition model is a voice information model for extracting context from voice information obtained from a user as described in FIG. 4, and is configured to use only small resources according to the purpose of the voice information model according to the purpose of voice recognition. Can be.

The speech recognition device may extract context information from the first speech recognition result (S8030).

The context information may mean all information that can be inferred through a recognition result, etc., in addition to a part of a sentence recognized through a voice recognition model.

Thereafter, the voice recognition apparatus may select a specific voice recognition model most suitable for recognizing voice information input from a user from among a plurality of preset voice recognition models using the second specific determination procedure described with reference to FIGS. 4 and 5 (S8040). ).

For example, when context information extracted from the first speech recognition result is related to address search, a speech recognition model most suitable for address search among a plurality of candidate speech recognition models may be selected as a specific food recognition model.

Thereafter, the voice recognition procedure may re-recognize the voice data converted from the voice information input from the user using the selected specific voice recognition model to finally generate a voice recognition result (S8040).

Thereafter, the voice recognition apparatus may provide a voice recognition service based on a voice recognition result of recognizing voice data through a specific voice recognition model.

Referring to FIG. 9, the voice recognition apparatus may select a specific voice recognition model from among a plurality of voice recognition models based on context information before receiving voice information from the user, and voice information input from the user through the selected voice recognition model Can recognize.

Specifically, the speech recognition device may preset context information for speech recognition.

Thereafter, the speech recognition apparatus selects a specific speech recognition model according to the purpose / use of speech recognition among a plurality of speech recognition models based on context information (S9020).

For example, in the case of an address search, the speech recognition apparatus may select a speech recognition model preset for an address search among a plurality of speech recognition models as a specific speech recognition model.

Thereafter, when the voice information is obtained from the user, the voice recognition apparatus may convert the obtained voice information into voice data (S9010).

Thereafter, the speech recognition device may generate speech recognition results by recognizing speech data using the selected specific speech recognition model (S9050).

Embodiments according to the present invention may be implemented by various means, for example, hardware, firmware, software, or a combination thereof. For implementation by hardware, one embodiment of the invention includes one or more application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), FPGAs ( field programmable gate arrays), processors, controllers, microcontrollers, microprocessors, etc.

In the case of implementation by firmware or software, an embodiment of the present invention may be implemented in the form of a module, procedure, function, etc. that performs the functions or operations described above. The software code can be stored in memory and driven by a processor. The memory is located inside or outside the processor, and can exchange data with the processor by various known means.

It is apparent to those skilled in the art that the present invention may be embodied in other specific forms without departing from the essential features of the present invention. Therefore, the above detailed description should not be construed as limiting in all respects, but should be considered illustrative. The scope of the present invention should be determined by rational interpretation of the appended claims, and all changes within the equivalent scope of the present invention are included in the scope of the present invention.

The present invention can be applied to various voice recognition technology fields, and the present invention can provide a method for selecting an optimal voice recognition model based on context.

Due to this feature, it is possible to derive the best voice recognition result when an unspecified voice input is received from a service using multiple voice recognition models having different strengths for each field.

This feature can be applied not only to voice recognition, but also to other artificial intelligence services.

Claims

In the voice recognition method,

Obtaining voice information from a user;

Converting the acquired voice information into voice data;

Recognizing the converted speech data with a first speech recognition model and generating a first speech recognition result;

Generating a second speech recognition result by recognizing the converted speech data with a second speech recognition model; And

And selecting a specific speech recognition result from the first speech recognition result and the second speech recognition result through a specific determination procedure.
The method of claim 1, wherein the specific determination procedure,

Extracting context information from the first speech recognition result and the second speech recognition result;

Comparing the context information with a first characteristic of the first speech recognition model and a second characteristic of the second speech recognition model, respectively; And

And selecting one of the first speech recognition result and the second speech recognition result based on the comparison result.
According to claim 2,

The context information includes at least one of a part of the speech information or information that can be obtained from the first speech recognition result and the second speech recognition result or information related to a user who has spoken.
According to claim 1,

The first voice recognition model and the second voice recognition model are one of a plurality of voice recognition models for recognizing the voice information obtained from the user.
According to claim 1,

Further comprising the step of recognizing the converted speech data with the plurality of speech recognition models to generate a plurality of speech recognition results,

The specific speech recognition result is selected from the first speech recognition result, the second speech recognition result, and the plurality of speech recognition results.
According to claim 1,

The specific determination procedure is a method of determining a speech recognition result based on a context included in context information.
In the voice recognition method,

Obtaining voice information from a user;

Converting the acquired voice information into voice data;

Generating a first speech recognition result by recognizing the speech data with the first speech recognition model;

Selecting a second speech recognition model for recognizing the speech data among a plurality of speech recognition models based on the first speech recognition result; And

And recognizing the voice data with the second voice recognition model to generate a second voice recognition result.
The method of claim 7,

Extracting context information from the first speech recognition result; And

Comprising the step of comparing the context information and a predetermined specific of the plurality of speech recognition models,

The second voice recognition model is selected based on the comparison result.
The method of claim 8,

The first speech recognition model is a speech recognition model for extracting the context information.
In the voice recognition method,

Obtaining voice information from a user;

Converting the acquired voice information into voice data; And

And recognizing the speech data using a specific speech recognition model selected from a plurality of speech recognition models to generate a speech recognition result.
The method of claim 10,

Setting context information for speech recognition; And

And selecting the specific speech recognition model in which the characteristics of the speech recognition model are best suited to the context information among the plurality of speech recognition models.