WO2020091123A1 - Method and device for providing context-based voice recognition service - Google Patents
Method and device for providing context-based voice recognition service Download PDFInfo
- Publication number
- WO2020091123A1 WO2020091123A1 PCT/KR2018/013280 KR2018013280W WO2020091123A1 WO 2020091123 A1 WO2020091123 A1 WO 2020091123A1 KR 2018013280 W KR2018013280 W KR 2018013280W WO 2020091123 A1 WO2020091123 A1 WO 2020091123A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- speech recognition
- voice
- speech
- model
- recognition result
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 74
- 239000013598 vector Substances 0.000 description 10
- 239000000284 extract Substances 0.000 description 6
- 230000006870 function Effects 0.000 description 5
- 238000010586 diagram Methods 0.000 description 3
- 230000005236 sound signal Effects 0.000 description 3
- 238000003491 array Methods 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 238000013139 quantization Methods 0.000 description 2
- 230000007704 transition Effects 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 238000013179 statistical model Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/183—Speech classification or search using natural language modelling using context dependencies, e.g. language models
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L2015/221—Announcement of recognition results
Definitions
- the present invention relates to a method and apparatus for recognizing a user's voice. More specifically, it relates to a method and apparatus for improving speech recognition accuracy based on context in a method for recognizing speech obtained from a user.
- voice recognition is a technology that converts voice to text using a computer. Speech recognition has recently improved dramatically.
- An object of the present invention is to provide a method for selecting a speech recognition result with high accuracy among a plurality of speech recognition results when speech is recognized using a plurality of speech recognition models.
- an object of the present invention is to provide a method for selecting a speech recognition model for speech recognition using context information.
- a method of recognizing a voice includes obtaining voice information from a user; Converting the acquired voice information into voice data; Recognizing the converted speech data with a first speech recognition model and generating a first speech recognition result; Generating a second speech recognition result by recognizing the converted speech data with a second speech recognition model; And selecting a specific speech recognition result from the first speech recognition result and the second speech recognition result through a specific determination procedure.
- the specific determination procedure includes: extracting context information from the first speech recognition result and the second speech recognition result; Comparing the context information with a first characteristic of the first speech recognition model and a second characteristic of the second speech recognition model, respectively; And selecting one of the first speech recognition result and the second speech recognition result based on the comparison result.
- the context information may include at least one of a portion of the speech information or information that can be obtained from the first speech recognition result and the second speech recognition result or information related to a user who has spoken speech. have.
- the first voice recognition model and the second voice recognition model are one of a plurality of voice recognition models for recognizing the voice information obtained from the user.
- the present invention further comprising the step of recognizing the converted speech data with the plurality of speech recognition models to generate a plurality of speech recognition results, wherein the specific speech recognition result is the first speech recognition result, the The second speech recognition result and the plurality of speech recognition results are selected.
- the specific determination procedure is a procedure for determining a speech recognition result based on the context included in the context information.
- the present invention obtaining the voice information from the user; Converting the acquired voice information into voice data; Generating a first speech recognition result by recognizing the speech data with the first speech recognition model; Selecting a second speech recognition model for recognizing the speech data among a plurality of speech recognition models based on the first speech recognition result; And recognizing the voice data using the second voice recognition model to generate a second voice recognition result.
- the present invention the step of extracting context information from the first speech recognition result; And comparing the context information with a predetermined characteristic of the plurality of speech recognition models, wherein the second speech recognition model is selected based on the comparison result.
- the first speech recognition model is a speech recognition model for extracting the context information.
- the present invention obtaining the voice information from the user; Converting the acquired voice information into voice data; And recognizing the voice data with a specific voice recognition model selected from a plurality of voice recognition models to generate a voice recognition result.
- the present invention setting the context information for speech recognition; And selecting the specific speech recognition model whose characteristics are most suitable for the context information among the plurality of speech recognition models.
- the accuracy of speech recognition is selected by selecting a recognition result of a food recognition model having high accuracy among them.
- each of the plurality of speech recognition models can be used according to the purpose.
- misrecognition due to similar vocabulary that may occur while applying a large language model may be reduced, and misrecognition due to unregistered vocabulary that may occur while applying a small language model may be reduced.
- FIG. 1 is a block diagram of a voice recognition device according to an embodiment of the present invention.
- FIGS. 2 and 3 are views showing an example of a speech recognition apparatus according to an embodiment of the present invention.
- FIGS. 4 and 5 are views showing another example of a speech recognition apparatus according to an embodiment of the present invention.
- FIG. 6 is a view showing another example of a speech recognition apparatus according to an embodiment of the present invention.
- FIG. 7 is a flowchart illustrating an example of a speech recognition method according to an embodiment of the present invention.
- FIG. 8 is a flowchart illustrating another example of a speech recognition method according to an embodiment of the present invention.
- FIG. 9 is a flowchart illustrating another example of a speech recognition method according to an embodiment of the present invention.
- FIG. 1 is a block diagram of a voice recognition device according to an embodiment of the present invention.
- the voice recognition device 100 for recognizing a user's voice may include an input unit 110, a storage unit 120, a control unit 130, and / or an output unit 140.
- FIG. 1 The components shown in FIG. 1 are not essential, so an electronic device having more or fewer components may be implemented.
- the input unit 110 may receive audio information, video signals, or voice information (or voice signals) and data from a user.
- the input unit 110 may include a camera and a microphone to receive an audio signal or a video signal.
- the camera processes a video frame such as a still image or video obtained by an image sensor in a video call mode or a shooting mode.
- the image frame processed by the camera may be stored in the storage unit 120.
- the microphone receives an external sound signal by a microphone in a call mode, a recording mode, or a voice recognition mode, and processes it as electrical voice data.
- Various noise reduction algorithms for removing noise generated in the process of receiving an external sound signal may be implemented in the microphone.
- the input unit 110 may convert it into an electrical signal and transmit it to the control unit 130.
- the controller 130 may acquire a user's voice data by applying a speech recognition algorithm or a speech recognition engine to the signal received from the input unit 110.
- the signal input to the control unit 130 may be converted into a more useful form for voice recognition, and the control unit 130 converts the input signal from an analog form to a digital form, and detects the start and end points of the voice. By doing so, the actual voice section / data included in the voice data can be detected. This is called EPD (End Point Detection).
- EPD End Point Detection
- control unit 130 within the detected section Cepstrum (Cepstrum), linear predictive coding (Linear Predictive Coefficient: LPC), Mel frequency Cepstral (Mel Frequency Cepstral Coefficient: MFCC) or filter bank energy (Filter Bank) Energy) to extract feature vector of signal.
- Cepstrum Linear Predictive Coefficient: LPC
- Mel frequency Cepstral Mel Frequency Cepstral Coefficient: MFCC
- filter bank energy Filter Bank Energy
- the memory 120 may store a program for the operation of the controller 130 and may temporarily store input / output data.
- a sample file for a symbol-based malicious code detection model can be stored from a user, and analysis results of the malicious code can be stored.
- the memory 120 may store various data related to the recognized voice, and in particular, may store information and feature vectors related to the end point of the voice data processed by the controller 130.
- the memory 120 includes flash memory, hard disc, memory card, ROM (Read-Only Memory), RAM (Random Access Memory), memory card, EEPROM (Electrically Erasable Programmable Read) It may include at least one storage medium of -Only Memory), PROM (Programmable Read-Only Memory), magnetic memory, magnetic disk, or optical disk.
- control unit 130 may obtain a recognition result through comparison between the extracted feature vector and the trained reference pattern.
- a speech recognition model for modeling and comparing signal characteristics of speech and a language model for modeling linguistic order relationships such as words or syllables corresponding to recognized vocabulary may be used.
- the speech recognition model can be divided into a direct comparison method that sets the recognition target as a feature vector model and compares it with the feature vector of speech data, and a statistical method that statistically processes the feature vector of the recognition target.
- the direct comparison method is a method of setting units of words, phonemes, and the like to be recognized as feature vector models and comparing how similar the input voices are to each other.
- a representative method is vector quantization. According to the vector quantization method, a feature vector of the input speech data is mapped to a codebook, which is a reference model, and encoded as a representative value, thereby comparing these code values.
- the statistical model method is a method of constructing a unit for a recognition object into a state sequence and using the relationship between the state columns.
- the status column may consist of a plurality of nodes.
- the methods of using the relationship between the state columns are dynamic time warping (DTW), hidden markov model (HMM), and neural network.
- Dynamic time warping is a method of compensating for differences in the time axis when compared with the reference model by considering the dynamic characteristics of the voice whose signal length varies with time even if the same person pronounces the same, and the Hidden Markov model makes the speech state transition probability. And after assuming the Markov process having the observation probability of the node (output symbol) in each state, estimates the state transition probability and the observation probability of the node through the learning data, and calculates the probability that the input voice will occur in the estimated model It is a recognition technology.
- a language model that models a linguistic order relationship such as a word or a syllable can reduce acoustic ambiguity and reduce errors in recognition by applying the order relationship between units constituting language to units obtained in speech recognition.
- the language model includes a statistical language model and a model based on the Finite State Automata (FSA), and the statistical language model uses chain probabilities of words such as Unigram, Bigram, and Trigram.
- FSA Finite State Automata
- the controller 130 may use any of the above-described methods in recognizing the voice.
- a speech recognition model to which the Hidden Markov model is applied may be used, or an N-best search method incorporating a speech recognition model and a language model may be used.
- the N-best search method can improve recognition performance by selecting up to N recognition candidates using speech recognition model and language model, and re-evaluating the ranking of these candidates.
- the controller 130 may calculate a confidence score (or may be abbreviated as 'reliability') to secure the reliability of the recognition result.
- the reliability score is a measure of how reliable the result is for speech recognition results. It can be defined as the relative value of the probability that the word is spoken from other phonemes or words for the recognized phoneme or word. have. Therefore, the reliability score may be expressed as a value between 0 and 1, or may be expressed as a value between 0 and 100. When the reliability score is greater than a preset threshold, the recognition result may be recognized, and if the reliability score is small, the recognition result may be rejected.
- the reliability score can be obtained according to various conventional reliability score acquisition algorithms.
- the control unit 130 may be implemented in a computer-readable recording medium using software, hardware, or a combination thereof.
- ASICs Application Specific Integrated Circuits
- DSPs Digital Signal Processors
- DSPDs Digital Signal Processing Devices
- PLDs Programmable LogicDevices
- FPGAs Field Programmable Gate Arrays
- processors processor
- microcontrollers It may be implemented using at least one of electrical units such as (microcontrollers) and micro-processors.
- the software implementation it may be implemented together with a separate software module that performs at least one function or operation, and the software code may be implemented by a software application written in an appropriate program language.
- the control unit 130 implements the functions, processes, and / or methods proposed in FIGS. 2 to 6, which will be described later, and hereinafter, for convenience of description, the control unit 130 is identified with the voice recognition device 100 and described. do.
- the output unit 140 is for generating output related to vision, hearing, and the like, and outputs information processed by the device 100.
- the output unit 140 may output the recognition result of the voice signal processed by the controller 130 so that the user can recognize it through visual or hearing.
- the voice recognition model described below may recognize voice information input from a user through the same method as the voice recognition model described in FIG. 1.
- FIGS. 2 and 3 are views showing an example of a speech recognition apparatus according to an embodiment of the present invention.
- the speech recognition apparatus recognizes speech data obtained from a user as a plurality of speech recognition models, and selects one of the results recognized from the speech recognition models based on context information to recognize speech. Can provide services.
- the voice recognition apparatus may convert voice information input from a user into an electrical signal, and convert the analog signal, which is the converted electrical signal, into a digital signal to generate voice data.
- the voice recognition model may recognize voice data using the first voice recognition model 2010 and the second voice recognition model 2020, respectively.
- the speech recognition device uses two basic voice recognition models and a user voice recognition model, respectively, to obtain two voice recognition results (voice recognition result 1 (2030) and voice recognition result 2 (2040)) from voice data converted from a voice signal input from a user. ).
- the speech recognition apparatus applies the first speech recognition result and the second speech recognition result to a first specific determination procedure (for example, a first context-based appropriate speech recognition model determination technique) to perform the first speech recognition result and the second speech recognition result.
- a first specific determination procedure for example, a first context-based appropriate speech recognition model determination technique
- a more suitable voice recognition result 2050 may be selected and output.
- the speech recognition apparatus may select a speech recognition result that is more suitable for the purpose of speech recognition among the first speech recognition result and the second speech recognition result through the first specific determination procedure, and output the selected speech recognition result.
- a speech recognition model more suitable for address search is selected from the first speech recognition model and the second speech recognition model,
- the speech recognition result of the selected speech recognition model may be provided as a speech recognition service.
- FIG. 3 is a flowchart illustrating an example of a first specific determination procedure for determining an appropriate speech recognition model based on the context of the first speech recognition result and the second speech recognition result.
- the purpose of speech recognition among the first speech recognition result 3010 and the second speech recognition result 3020 is further added.
- a suitable speech recognition model can be selected (3034).
- the speech recognition apparatus may select 3036 a speech recognition result generated from the selected speech recognition model, and output 3040 the result.
- the speech recognition device determined 'tell me some address' as context information.
- the speech recognition device may extract the context information 'tell me an address' (3032) from 'tell me an address of an e-cylinder' and 'tell me an address of a Gil-dong'.
- the speech recognition apparatus compares the extracted contextual information with the characteristics of the first speech recognition model (first characteristic) and the characteristics of the second speech recognition model (second characteristic) as a speech recognition model more suitable for the purpose of speech recognition.
- a first voice recognition model may be selected (3034).
- the speech recognition apparatus may select a first speech recognition result of the selected first speech recognition model (3036), and may output 'tell me a base address', which is the selected first speech recognition result.
- At least one of information related to the user such as the user's location, the user's weather, user habits, the user's previous speech context, the user's career, the user's position, the user's financial status, the current time, and the user's language. Can be used as contextual information.
- FIGS. 4 and 5 are views showing another example of a speech recognition apparatus according to an embodiment of the present invention.
- the speech recognition apparatus recognizes speech data obtained from a user as a plurality of speech recognition models, and selects one of the results recognized from the speech recognition models based on context information to recognize speech. Can provide services.
- the voice recognition device may generate the first voice recognition result 4020 by recognizing the voice information input from the user as the first voice recognition model 4010.
- the first voice recognition model is a voice information model for extracting context from voice information obtained from a user, and may be configured to use only small resources according to the purpose of the voice information model according to the purpose of voice recognition.
- the voice recognition apparatus is a specific voice most suitable for recognizing voice information input from a user among a plurality of preset voice recognition models using a second specific determination procedure (for example, a second context-based appropriate voice recognition model determination technique).
- a recognition model may be selected (4030).
- the speech recognition apparatus may select a specific speech recognition model from among a plurality of speech recognition models based on the first speech recognition result according to the purpose and purpose of speech recognition.
- a speech recognition model that is most suitable for address search among a plurality of candidate speech recognition models may be selected as a specific speech recognition model.
- the second specific determination procedure includes a process of extracting context information from the first speech recognition result to select a specific speech recognition model, and selecting a specific speech recognition model using the extracted context information.
- the voice recognition procedure may re-recognize the voice data converted from the voice information input from the user using the selected specific voice recognition model to finally generate the voice recognition result 4040.
- FIG. 5 is a flowchart illustrating an example of a second specific determination procedure for determining an appropriate speech recognition model based on the context of the first speech recognition result and the second speech recognition result.
- the speech recognition device generates (or receives) the first speech recognition result 5010 that recognizes the user's speech information by using the first speech recognition model described with reference to FIG. 4, and the generated speech recognition result Based on the second specific determination procedure, a specific speech recognition model that is most suitable for the purpose of speech recognition may be selected from a plurality of (eg, N) speech recognition models as shown in FIG. 4 (5020).
- the second specific determination procedure includes a process of extracting context information from the first speech recognition result to select a specific speech recognition model, and selecting a specific speech recognition model using the extracted context information.
- 'tell me the address of Lee Gil-dong' 'tell me the address' can be extracted as context information.
- the first voice recognition model is a voice information model for extracting context from voice information obtained from a user as described above, and may be configured to use only small resources according to the purpose of the voice information model according to the purpose of voice recognition. have.
- context information in addition to a part of the sentence recognized through the speech recognition model, all information that can be inferred through the recognition result may be used as the context information.
- At least one of information related to the user such as the user's location, the user's weather, user habits, the user's previous speech context, the user's career, the user's position, the user's financial status, the current time, and the user's language. Can be used as contextual information.
- the speech recognition apparatus may select a specific speech recognition model that is most suitable for the purpose of speech recognition, as shown in FIG. 4, from the plurality of speech recognition models using the extracted context information 'tell me the address'.
- the speech recognition apparatus may extract context information through a speech recognition model for obtaining context information, and select a specific speech recognition model that is most suitable for the purpose of speech recognition.
- FIG. 6 is a view showing another example of a speech recognition apparatus according to an embodiment of the present invention.
- the speech recognition apparatus may select a specific speech recognition model among a plurality of speech recognition models in advance by setting context information for speech recognition, and use the speech recognition results recognized through the selected speech recognition model To provide a speech recognition service.
- the speech recognition apparatus may select a specific speech recognition model, which is determined to be most suitable for speech recognition, from among a plurality of speech recognition models according to predetermined context information (6010).
- the speech recognition apparatus may select a speech recognition model preset for use in address search among a plurality of speech recognition models as a specific speech recognition model.
- the voice recognition model may generate voice recognition results 6020 by recognizing voice data obtained from a user through the selected specific food model.
- the voice data may mean data in which voice information obtained from a user is changed into an electrical signal, and an analog signal, which is a changed electrical signal, is converted into a digital signal.
- FIG. 7 is a flowchart illustrating an example of a speech recognition method according to an embodiment of the present invention.
- the speech recognition device generates a speech recognition result through a plurality of speech recognition devices, and selects the most suitable speech recognition result among the generated speech recognition results to recognize the speech Can provide services.
- the speech recognition device may acquire speech information from a user and convert the acquired speech information into speech data (S7010).
- the voice recognition device may convert voice information obtained from a user into an electrical signal, and convert an analog signal, which is a changed electrical signal, into voice data, which is a digital signal.
- the voice recognition apparatus may recognize the voice data using the first voice recognition model and the second voice recognition model, respectively, and generate first voice recognition results and second voice recognition results (S7020, S7030).
- the speech recognition apparatus provides a speech recognition service by selecting a speech recognition result that is more suitable for the purpose of speech recognition from the first speech recognition result and the second speech recognition result through the first specific determination procedure illustrated in FIGS. 2 and 3. It can be done (S7040).
- the speech recognition apparatus extracts context information from the first speech recognition result and the second speech recognition result, and extracts the extracted context information from the first characteristics of the first speech recognition model and the second speech recognition model. Each can be compared with 2 characteristics.
- the speech recognition apparatus may select a speech recognition model suitable for the purpose and / or purpose of speech recognition among the first speech recognition model and the second speech recognition model based on the comparison result.
- the speech recognition apparatus may select a second speech recognition result generated by the selected second speech recognition model as a speech recognition result, and provide a speech recognition service based on the selected second speech recognition result.
- FIG. 8 is a flowchart illustrating another example of a speech recognition method according to an embodiment of the present invention.
- the speech recognition model may extract context information through speech data and provide a speech recognition service based on the extracted context information.
- step S8010 is the same as step S7010 in FIG. 7, and thus description thereof will be omitted.
- the speech recognition device recognizes the speech data using the first speech recognition model to generate a first speech recognition result (S8020).
- the first voice recognition model is a voice information model for extracting context from voice information obtained from a user as described in FIG. 4, and is configured to use only small resources according to the purpose of the voice information model according to the purpose of voice recognition. Can be.
- the speech recognition device may extract context information from the first speech recognition result (S8030).
- the context information may mean all information that can be inferred through a recognition result, etc., in addition to a part of a sentence recognized through a voice recognition model.
- At least one of information related to the user such as the user's location, the user's weather, user habits, the user's previous speech context, the user's career, the user's position, the user's financial status, the current time, and the user's language. Can be used as contextual information.
- the voice recognition apparatus may select a specific voice recognition model most suitable for recognizing voice information input from a user from among a plurality of preset voice recognition models using the second specific determination procedure described with reference to FIGS. 4 and 5 (S8040). ).
- the speech recognition apparatus may select a specific speech recognition model from among a plurality of speech recognition models based on the first speech recognition result according to the purpose and purpose of speech recognition.
- a speech recognition model most suitable for address search among a plurality of candidate speech recognition models may be selected as a specific food recognition model.
- the second specific determination procedure includes a process of extracting context information from the first speech recognition result to select a specific speech recognition model, and selecting a specific speech recognition model using the extracted context information.
- the voice recognition procedure may re-recognize the voice data converted from the voice information input from the user using the selected specific voice recognition model to finally generate a voice recognition result (S8040).
- the voice recognition apparatus may provide a voice recognition service based on a voice recognition result of recognizing voice data through a specific voice recognition model.
- FIG. 9 is a flowchart illustrating another example of a speech recognition method according to an embodiment of the present invention.
- the voice recognition apparatus may select a specific voice recognition model from among a plurality of voice recognition models based on context information before receiving voice information from the user, and voice information input from the user through the selected voice recognition model Can recognize.
- the speech recognition device may preset context information for speech recognition.
- the context information may mean all information that can be inferred through a recognition result, etc., in addition to a part of a sentence recognized through a voice recognition model.
- At least one of information related to the user such as the user's location, the user's weather, user habits, the user's previous speech context, the user's career, the user's position, the user's financial status, the current time, and the user's language. Can be used as contextual information.
- the speech recognition apparatus selects a specific speech recognition model according to the purpose / use of speech recognition among a plurality of speech recognition models based on context information (S9020).
- the speech recognition apparatus may select a speech recognition model preset for an address search among a plurality of speech recognition models as a specific speech recognition model.
- the voice recognition apparatus may convert the obtained voice information into voice data (S9010).
- the voice recognition device may convert voice information obtained from a user into an electrical signal, and convert an analog signal, which is a changed electrical signal, into voice data, which is a digital signal.
- the speech recognition device may generate speech recognition results by recognizing speech data using the selected specific speech recognition model (S9050).
- the voice recognition apparatus may provide a voice recognition service based on a voice recognition result of recognizing voice data through a specific voice recognition model.
- Embodiments according to the present invention may be implemented by various means, for example, hardware, firmware, software, or a combination thereof.
- one embodiment of the invention includes one or more application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), FPGAs ( field programmable gate arrays), processors, controllers, microcontrollers, microprocessors, etc.
- ASICs application specific integrated circuits
- DSPs digital signal processors
- DSPDs digital signal processing devices
- PLDs programmable logic devices
- FPGAs field programmable gate arrays
- processors controllers, microcontrollers, microprocessors, etc.
- an embodiment of the present invention may be implemented in the form of a module, procedure, function, etc. that performs the functions or operations described above.
- the software code can be stored in memory and driven by a processor.
- the memory is located inside or outside the processor, and can exchange data with the processor by various known means.
- the present invention can be applied to various voice recognition technology fields, and the present invention can provide a method for selecting an optimal voice recognition model based on context.
- This feature can be applied not only to voice recognition, but also to other artificial intelligence services.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention relates to a method for recognizing a voice and a device therefor. More particularly, a voice recognition device according to the present invention can acquire voice information from a user and convert the acquired voice information into voice data. Afterward, a voice recognition model can generate a first voice recognition result by recognizing the converted voice data with a first voice recognition model, generate a second voice recognition result by recognizing the converted voice data with a second voice recognition model, and select a specific voice recognition result from the first voice recognition result and the second voice recognition result through a specific determination procedure.
Description
본 발명은 사용자의 음성을 인식하기 위한 방법 및 장치에 관한 것이다. 보다 구체적으로, 사용자로부터 획득된 음성을 인식하기 위한 방법에 있어서 문맥을 기반으로 음성인식 정확도를 향상시키기 위한 방법 및 장치에 관한 것이다. The present invention relates to a method and apparatus for recognizing a user's voice. More specifically, it relates to a method and apparatus for improving speech recognition accuracy based on context in a method for recognizing speech obtained from a user.
자동 음성인식은(이하 음성인식이라 호칭한다.) 컴퓨터를 이용하여 음성을 문자로 변환해주는 기술이다. 이러한 음성인식은 최근 들어 급격한 인식 율 향상을 이뤘다. Automatic voice recognition (hereinafter referred to as voice recognition) is a technology that converts voice to text using a computer. Speech recognition has recently improved dramatically.
하지만, 전체적으로 인식율은 향상되었지만 언어모델이나 음향모델 학습 시 사용하는 데이터의 구성이나 모델의 구조에 따라 성능의 차이가 발생한다.However, although the overall recognition rate is improved, performance differences occur depending on the structure of the data or the structure of the data used when learning the language model or the acoustic model.
본 발명의 목적은, 복수의 음성인식모델을 이용하여 음성을 인식하는 경우, 복수의 음성 인식 결과 중 정확도가 높은 음성 인식 결과를 선택하기 위한 방법을 제공함에 그 목적이 있다.An object of the present invention is to provide a method for selecting a speech recognition result with high accuracy among a plurality of speech recognition results when speech is recognized using a plurality of speech recognition models.
또한, 문맥 정보를 이용하여 음성 인식을 위한 음성 인식 모델을 선택하기 위한 방법을 제공함에 그 목적이 있다.In addition, an object of the present invention is to provide a method for selecting a speech recognition model for speech recognition using context information.
본 발명에서 이루고자 하는 기술적 과제들은 이상에서 언급한 기술적 과제들로 제한되지 않으며, 언급하지 않은 또 다른 기술적 과제들은 아래의 기재로부터 본 발명이 속하는 기술분야에서 통상의 지식을 가진자에게 명확하게 이해될 수 있을 것이다.The technical problems to be achieved in the present invention are not limited to the technical problems mentioned above, and other technical problems that are not mentioned will be clearly understood by those skilled in the art from the following description. Will be able to.
본 발명에 의한 음성을 인식하는 방법은 사용자로부터 음성 정보를 획득하는 단계; 획득된 음성 정보를 음성 데이터로 변환하는 단계; 제1 음성인식 모델로 상기 변환된 음성 데이터를 인식하여 제1 음성 인식 결과를 생성하는 단계; 제2 음성인식 모델로 상기 변환된 음성 데이터를 인식하여 제2 음성 인식 결과를 생성하는 단계; 및 특정 판단 절차를 통해서 상기 제1 음성 인식 결과 및 상기 제2 음성 인식 결과 중 특정 음성 인식 결과를 선택하는 단계를 포함A method of recognizing a voice according to the present invention includes obtaining voice information from a user; Converting the acquired voice information into voice data; Recognizing the converted speech data with a first speech recognition model and generating a first speech recognition result; Generating a second speech recognition result by recognizing the converted speech data with a second speech recognition model; And selecting a specific speech recognition result from the first speech recognition result and the second speech recognition result through a specific determination procedure.
또한, 본 발명에서, 상기 특정 판단 절차는, 상기 제1 음성 인식 결과 및 상기 제2 음성 인식 결과로부터 문맥 정보를 추출하는 단계; 상기 문맥정보를 기 설정된 상기 제1 음성 인식 모델의 제 1 특성 및 상기 제 2 음성 인식 모델의 제 2 특성과 각각 비교하는 단계; 및 상기 비교 결과에 기초하여 상기 제1 음성 인식 결과 및 상기 제2 음성 인식 결과 중 하나를 선택하는 단계를 포함한다.In addition, in the present invention, the specific determination procedure includes: extracting context information from the first speech recognition result and the second speech recognition result; Comparing the context information with a first characteristic of the first speech recognition model and a second characteristic of the second speech recognition model, respectively; And selecting one of the first speech recognition result and the second speech recognition result based on the comparison result.
또한, 본 발명에서, 문맥 정보는 상기 음성 정보의 일부 또는 상기 제 1 음성 인식 결과 및 상기 제 2 음성 인식 결과로부터 획득될 수 있는 정보 또는 음성을 발성한 사용자와 관련된 정보 중 적어도 하나를 포함할 수 있다.In addition, in the present invention, the context information may include at least one of a portion of the speech information or information that can be obtained from the first speech recognition result and the second speech recognition result or information related to a user who has spoken speech. have.
또한, 본 발명에서, 상기 제1 음성 인식 모델 및 상기 제2 음성 인식 모델은 상기 사용자로부터 획득되는 상기 음성 정보를 인식하기 위한 복수의 음성 인식 모델들 중 하나이다.In addition, in the present invention, the first voice recognition model and the second voice recognition model are one of a plurality of voice recognition models for recognizing the voice information obtained from the user.
또한, 본 발명은, 상기 복수의 음성 인식 모델들로 상기 변환된 음성 데이터를 인식하여 복수의 음성 인식 결과를 생성하는 단계를 더 포함하되, 상기 특정 음성 인식 결과는 상기 제1 음성 인식 결과, 상기 제2 음성 인식 결과 및 상기 복수의 음성 인식 결과 중에서 선택된다.In addition, the present invention, further comprising the step of recognizing the converted speech data with the plurality of speech recognition models to generate a plurality of speech recognition results, wherein the specific speech recognition result is the first speech recognition result, the The second speech recognition result and the plurality of speech recognition results are selected.
또한, 본 발명에서, 상기 특정 판단 절차는 문맥 정보에 포함된 문맥에 기초하여 음성 인식 결과를 판단하는 절차이다.In addition, in the present invention, the specific determination procedure is a procedure for determining a speech recognition result based on the context included in the context information.
또한, 본 발명은, 사용자로부터 음성 정보를 획득하는 단계; 획득된 음성 정보를 음성 데이터로 변환하는 단계; 상기 제1 음성 인식 모델로 상기 음성 데이터를 인식하여 제1 음성 인식 결과를 생성하는 단계; 상기 제1 음성 인식 결과에 기초하여 복수의 음성 인식 모델 중 상기 음성 데이터를 인식하기 위한 제2 음성 인식 모델을 선택하는 단계; 및 상기 제2 음성 인식 모델로 상기 음성 데이터를 인식하여 제2 음성 인식 결과를 생성하는 단계를 포함하는 방법을 제공한다.In addition, the present invention, obtaining the voice information from the user; Converting the acquired voice information into voice data; Generating a first speech recognition result by recognizing the speech data with the first speech recognition model; Selecting a second speech recognition model for recognizing the speech data among a plurality of speech recognition models based on the first speech recognition result; And recognizing the voice data using the second voice recognition model to generate a second voice recognition result.
또한, 본 발명은, 상기 제1 음성 인식 결과로부터 문맥 정보를 추출하는 단계; 및 상기 문맥 정보와 상기 복수의 음성 인식 모델의 기 설정된 특정을 비교하는 단계를 더 포함하되, 상기 제2 음성 인식 모델은 상기 비교결과에 기초하여 선택된다.In addition, the present invention, the step of extracting context information from the first speech recognition result; And comparing the context information with a predetermined characteristic of the plurality of speech recognition models, wherein the second speech recognition model is selected based on the comparison result.
또한, 본 발명에서, 상기 제1 음성 인식 모델은 상기 문맥 정보를 추출하기 위한 음성 인식 모델이다.In addition, in the present invention, the first speech recognition model is a speech recognition model for extracting the context information.
또한, 본 발명은, 사용자로부터 음성 정보를 획득하는 단계; 획득된 음성 정보를 음성 데이터로 변환하는 단계; 및 복수의 음성 인식 모델 중에 선택된 특정 음성 인식 모델로 상기 음성 데이터를 인식하여 음성 인식 결과를 생성하는 단계를 포함하는 방법을 제공한다.In addition, the present invention, obtaining the voice information from the user; Converting the acquired voice information into voice data; And recognizing the voice data with a specific voice recognition model selected from a plurality of voice recognition models to generate a voice recognition result.
또한, 본 발명은, 음성 인식을 위한 문맥 정보를 설정하는 단계; 및 상기 복수의 음성 인식 모델 중에서 특성이 상기 문맥 정보에 가장 적합한 상기 특정 음성 인식 모델을 선택하는 단계를 더 포함한다.In addition, the present invention, setting the context information for speech recognition; And selecting the specific speech recognition model whose characteristics are most suitable for the context information among the plurality of speech recognition models.
본 발명의 어느 한 실시예에 따르면, 음성입력을 인식할 때 복수의 음성인식모델을 사용하여 복수의 결과를 생성했을 때 이들 중 정확도가 높은 음식인식모델의 인식 결과를 선택함으로써, 음성 인식의 정확도를 높일 수 있다.According to an embodiment of the present invention, when a plurality of results are generated using a plurality of speech recognition models when recognizing a speech input, the accuracy of speech recognition is selected by selecting a recognition result of a food recognition model having high accuracy among them. Can increase
또한, 문맥 정보에 따른 음성 인식 모델을 선택함으로써, 복수의 음성 인식 모델 각각을 용도에 맞게 이용할 수 있다.In addition, by selecting the speech recognition model according to the context information, each of the plurality of speech recognition models can be used according to the purpose.
또한, 대규모 사용자를 위한 서비스나 사용자가 위치한 물리적, 상황적 환경이 수시로 바뀌는 환경에서도 적절한 음성인식모델을 선택할 수 있다.In addition, it is possible to select an appropriate voice recognition model in a service for a large user or in an environment in which the physical and situational environment in which the user is located changes frequently.
또한, 적절한 음성인식모델을 선택할 수 있음으로 인해, 거대 언어모델을 적용하면서 발생할 수 있는 유사 어휘로 인한 오인식을 줄일 수 있고, 소규모 언어모델을 적용하면서 발생할 수 있는 미등록 어휘로 인한 오인식을 줄일 수 있다.In addition, by selecting an appropriate speech recognition model, misrecognition due to similar vocabulary that may occur while applying a large language model may be reduced, and misrecognition due to unregistered vocabulary that may occur while applying a small language model may be reduced. .
본 발명에 관한 이해를 돕기 위해 상세한 설명의 일부로 포함되는, 첨부 도면은 본 발명에 대한 실시예를 제공하고, 상세한 설명과 함께 본 발명의 기술적 특징을 설명한다.BRIEF DESCRIPTION OF THE DRAWINGS The accompanying drawings, which are included as part of the detailed description to aid understanding of the present invention, provide embodiments of the present invention, and describe the technical features of the present invention together with the detailed description.
도 1은 본 발명의 일 실시예에 따른 음성인식장치의 블록도이다. 1 is a block diagram of a voice recognition device according to an embodiment of the present invention.
도 2 및 도 3은 본 발명의 일 실시예에 따른 음성 인식 장치의 일 예를 나타내는 도면이다.2 and 3 are views showing an example of a speech recognition apparatus according to an embodiment of the present invention.
도 4 및 도 5는 본 발명의 일 실시예에 따른 음성 인식 장치의 또 다른 일 예를 나타내는 도면이다. 4 and 5 are views showing another example of a speech recognition apparatus according to an embodiment of the present invention.
도 6은 본 발명의 일 실시예에 따른 음성 인식 장치의 또 다른 일 예를 나타내는 도면이다. 6 is a view showing another example of a speech recognition apparatus according to an embodiment of the present invention.
도 7은 본 발명의 일 실시 예에 따른 음성 인식 방법의 일 예를 나타내는 순서도이다.7 is a flowchart illustrating an example of a speech recognition method according to an embodiment of the present invention.
도 8은 본 발명의 일 실시 예에 따른 음성 인식 방법의 또 다른 일 예를 나타내는 순서도이다.8 is a flowchart illustrating another example of a speech recognition method according to an embodiment of the present invention.
도 9는 본 발명의 일 실시 예에 따른 음성 인식 방법의 또 다른 일 예를 나타내는 순서도이다.9 is a flowchart illustrating another example of a speech recognition method according to an embodiment of the present invention.
이하, 본 발명에 따른 바람직한 실시 형태를 첨부된 도면을 참조하여 상세하게 설명한다. 첨부된 도면과 함께 이하에 개시될 상세한 설명은 본 발명의 예시적인 실시형태를 설명하고자 하는 것이며, 본 발명이 실시될 수 있는 유일한 실시 형태를 나타내고자 하는 것이 아니다. 이하의 상세한 설명은 본 발명의 완전한 이해를 제공하기 위해서 구체적 세부사항을 포함한다. 그러나, 당 업자는 본 발명이 이러한 구체적 세부사항 없이도 실시될 수 있음을 안다. Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. DETAILED DESCRIPTION The following detailed description, together with the accompanying drawings, is intended to describe exemplary embodiments of the present invention, and is not intended to represent the only embodiments in which the present invention may be practiced. The following detailed description includes specific details to provide a thorough understanding of the present invention. However, one skilled in the art knows that the present invention may be practiced without these specific details.
몇몇 경우, 본 발명의 개념이 모호해지는 것을 피하기 위하여 공지의 구조 및 장치는 생략되거나, 각 구조 및 장치의 핵심 기능을 중심으로 한 블록도 형식으로 도시될 수 있다. In some cases, in order to avoid obscuring the concept of the present invention, well-known structures and devices may be omitted, or block diagrams centered on core functions of each structure and device may be illustrated.
도 1은 본 발명의 일 실시예에 따른 음성인식장치의 블록도이다. 1 is a block diagram of a voice recognition device according to an embodiment of the present invention.
도 1을 참조하면, 사용자의 음성을 인식하기 위한 음성인식장치(100)는 입력부(110), 저장부(120), 제어부(130) 및/또는 출력부(140) 등을 포함할 수 있다.Referring to FIG. 1, the voice recognition device 100 for recognizing a user's voice may include an input unit 110, a storage unit 120, a control unit 130, and / or an output unit 140.
도 1에 도시된 구성요소들이 필수적인 것은 아니어서, 그보다 많은 구성요소들을 갖거나 그보다 적은 구성요소들을 갖는 전자기기가 구현될 수도 있다.The components shown in FIG. 1 are not essential, so an electronic device having more or fewer components may be implemented.
이하, 상기 구성요소들에 대해 차례로 살펴본다.Hereinafter, the components will be described in turn.
입력부(110)는 오디오 신호, 비디오 신호 또는 사용자로부터 음성 정보(또는 음성 신호) 및 데이터를 입력 받을 수 있다.The input unit 110 may receive audio information, video signals, or voice information (or voice signals) and data from a user.
입력부(110)는 오디오 신호 또는 비디오 신호 입력 받기 위해서 카메라와 마이크 등을 포함할 수 있다. 카메라는 화상 통화모드 또는 촬영 모드에서 이미지 센서에 의해 얻어지는 정지영상 또는 동영상 등의 화상 프레임을 처리한다.The input unit 110 may include a camera and a microphone to receive an audio signal or a video signal. The camera processes a video frame such as a still image or video obtained by an image sensor in a video call mode or a shooting mode.
카메라에서 처리된 화상 프레임은 저장부(120)에 저장될 수 있다.The image frame processed by the camera may be stored in the storage unit 120.
마이크는 통화모드 또는 녹음모드, 음성인식 모드 등에서 마이크로폰(Microphone)에 의해 외부의 음향 신호를 입력 받아 전기적인 음성 데이터로 처리한다. 마이크에는 외부의 음향 신호를 입력 받는 과정에서 발생되는 잡음(noise)을 제거하기 위한 다양한 잡음 제거 알고리즘이 구현될 수 있다.The microphone receives an external sound signal by a microphone in a call mode, a recording mode, or a voice recognition mode, and processes it as electrical voice data. Various noise reduction algorithms for removing noise generated in the process of receiving an external sound signal may be implemented in the microphone.
입력부(110)는 마이크 또는 마이크로폰(microphone)을 통해서 사용자의 발화(utterance)된 음성이 입력되면 이를 전기적 신호로 변환하여 제어부(130)로 전달할 수 있다.When the user's uttered voice is input through a microphone or a microphone, the input unit 110 may convert it into an electrical signal and transmit it to the control unit 130.
제어부(130)는 입력부(110)로부터 수신한 신호에 음성인식(speech recognition) 알고리즘 또는 음성인식 엔진(speech recognition engine)을 적용하여 사용자의 음성 데이터를 획득할 수 있다.The controller 130 may acquire a user's voice data by applying a speech recognition algorithm or a speech recognition engine to the signal received from the input unit 110.
이때, 제어부(130)로 입력되는 신호는 음성인식을 위한 더 유용한 형태로 변환될 수 있으며, 제어부(130)는 입력된 신호를 아날로그 형태에서 디지털 형태로 변환하고, 음성의 시작과 끝지점을 검출하여 음성데이터에 포함된 실제 음성구간/데이터를 검출할 수 있다. 이를 EPD(End Point Detection)라 한다.At this time, the signal input to the control unit 130 may be converted into a more useful form for voice recognition, and the control unit 130 converts the input signal from an analog form to a digital form, and detects the start and end points of the voice. By doing so, the actual voice section / data included in the voice data can be detected. This is called EPD (End Point Detection).
그리고, 제어부(130)는 검출된 구간 내에서 켑스트럼(Cepstrum), 선형예측코딩(Linear Predictive Coefficient: LPC), 멜 프리퀀시 켑스트럼(Mel Frequency Cepstral Coefficient: MFCC) 또는 필터뱅크 에너지(Filter Bank Energy) 등의 특징벡터 추출 기술을 적용하여 신호의 특징 벡터를 추출할 수 있다.Then, the control unit 130 within the detected section Cepstrum (Cepstrum), linear predictive coding (Linear Predictive Coefficient: LPC), Mel frequency Cepstral (Mel Frequency Cepstral Coefficient: MFCC) or filter bank energy (Filter Bank) Energy) to extract feature vector of signal.
메모리(120)는 제어부(130)의 동작을 위한 프로그램을 저장할 수 있고, 입/출력되는 데이터들을 임시 저장할 수도 있다. 사용자로부터 심볼 기반 악성 코드 탐지 모델을 위한 샘플 파일을 저장할 수 있으며, 악성코드의 분석 결과를 저장할 수 있다.The memory 120 may store a program for the operation of the controller 130 and may temporarily store input / output data. A sample file for a symbol-based malicious code detection model can be stored from a user, and analysis results of the malicious code can be stored.
메모리(120)는 인식된 음성과 관련된 다양한 데이터를 저장할 수 있으며, 특히, 제어부(130)에 의해서 처리된 음성 데이터의 끝지점과 관련된 정보 및 특징 벡터를 저장할 수 있다.The memory 120 may store various data related to the recognized voice, and in particular, may store information and feature vectors related to the end point of the voice data processed by the controller 130.
메모리(120)는 플래시메모리(flash memory), 하드디크스(hard disc), 메모리카드, 롬(ROM:Read-OnlyMemory), 램(RAM:Random Access Memory), 메모리카드, EEPROM(Electrically Erasable Programmable Read-Only Memory), PROM(Programmable Read-Only Memory), 자기메모리, 자기디스크, 광디스크 중 적어도 하나의 저장매체를 포함할 수 있다.The memory 120 includes flash memory, hard disc, memory card, ROM (Read-Only Memory), RAM (Random Access Memory), memory card, EEPROM (Electrically Erasable Programmable Read) It may include at least one storage medium of -Only Memory), PROM (Programmable Read-Only Memory), magnetic memory, magnetic disk, or optical disk.
그리고, 제어부(130)는 추출된 특징벡터와 훈련된 기준패턴과의 비교를 통하여 인식결과를 얻을 수 있다. 이를 위해, 음성의 신호적인 특성을 모델링하여 비교하는 음성인식모델과 인식어휘에 해당하는 단어나 음절 등의 언어적인 순서관계를 모델링하는 언어모델(Language Model)이 사용될 수 있다.Then, the control unit 130 may obtain a recognition result through comparison between the extracted feature vector and the trained reference pattern. To this end, a speech recognition model for modeling and comparing signal characteristics of speech and a language model for modeling linguistic order relationships such as words or syllables corresponding to recognized vocabulary may be used.
음성인식모델은 다시 인식대상을 특징벡터 모델로 설정하고 이를 음성데이터의 특징벡터와 비교하는 직접비교방법과 인식대상의 특징벡터를 통계적으로 처리하여 이용하는 통계방법으로 나뉠 수 있다.The speech recognition model can be divided into a direct comparison method that sets the recognition target as a feature vector model and compares it with the feature vector of speech data, and a statistical method that statistically processes the feature vector of the recognition target.
직접비교방법은 인식대상이 되는 단어, 음소 등의 단위를 특징벡터모델로 설정하고 입력음성이 이와 얼마나 유사한지를 비교하는 방법으로서, 대표적으로 벡터양자화(Vector Quantization) 방법이 있다. 벡터 양자화 방법에 의하면 입력된 음성데이터의 특징벡터를 기준모델인 코드북(codebook)과 매핑시켜 대표값으로 부호화함으로써 이 부호값들을 서로 비교하는 방법이다.The direct comparison method is a method of setting units of words, phonemes, and the like to be recognized as feature vector models and comparing how similar the input voices are to each other. A representative method is vector quantization. According to the vector quantization method, a feature vector of the input speech data is mapped to a codebook, which is a reference model, and encoded as a representative value, thereby comparing these code values.
통계적모델 방법은 인식대상에 대한 단위를 상태열(State Sequence)로 구성하고 상태열간의 관계를 이용하는 방법이다. 상태열은 복수의 노드(node)로 구성될 수 있다. 상태열 간의 관계를 이용하는 방법은 다시 동적시간 와핑(Dynamic Time Warping: DTW), 히든마르코프모델(Hidden Markov Model: HMM), 신경회로망을 이용한 방식 등이 있다.The statistical model method is a method of constructing a unit for a recognition object into a state sequence and using the relationship between the state columns. The status column may consist of a plurality of nodes. The methods of using the relationship between the state columns are dynamic time warping (DTW), hidden markov model (HMM), and neural network.
동적시간 와핑은 같은 사람이 같은 발음을 해도 신호의 길이가 시간에 따라 달라지는 음성의 동적 특성을 고려하여 기준모델과 비교할 때 시간축에서의 차이를 보상하는 방법이고, 히든마르코프모델은 음성을 상태천이확률 및 각 상태에서의 노드(출력심볼)의 관찰확률을 갖는 마르코프프로세스로 가정한 후에 학습데이터를 통해 상태천이확률 및 노드의 관찰확률을 추정하고, 추정된 모델에서 입력된 음성이 발생할 확률을 계산하는 인식기술이다.Dynamic time warping is a method of compensating for differences in the time axis when compared with the reference model by considering the dynamic characteristics of the voice whose signal length varies with time even if the same person pronounces the same, and the Hidden Markov model makes the speech state transition probability. And after assuming the Markov process having the observation probability of the node (output symbol) in each state, estimates the state transition probability and the observation probability of the node through the learning data, and calculates the probability that the input voice will occur in the estimated model It is a recognition technology.
한편, 단어나 음절 등의 언어적인 순서관계를 모델링하는 언어모델은 언어를 구성하는 단위들간의 순서관계를 음성인식에서 얻어진 단위들에 적용함으로써 음향적인 모호성을 줄이고 인식의 오류를 줄일 수 있다. 언어모델에는 통계적언어 모델과 유한상태네트워크(Finite State Automata: FSA)에 기반한 모델이 있고, 통계적 언어모델에는 Unigram, Bigram, Trigram 등 단어의 연쇄확률이 이용된다.On the other hand, a language model that models a linguistic order relationship such as a word or a syllable can reduce acoustic ambiguity and reduce errors in recognition by applying the order relationship between units constituting language to units obtained in speech recognition. The language model includes a statistical language model and a model based on the Finite State Automata (FSA), and the statistical language model uses chain probabilities of words such as Unigram, Bigram, and Trigram.
제어부(130)는 음성을 인식함에 있어 상술한 방식 중 어느 방식을 사용해도 무방하다. 예를 들어, 히든마르코프모델이 적용된 음성인식모델을 사용할 수도 있고, 음성인식모델과 언어모델을 통합한 N-best 탐색법을 사용할 수 있다. N-best 탐색법은 음성인식모델과 언어모델을 이용하여 N개까지의 인식결과후보를 선택한 후, 이들 후보의 순위를 재평가함으로써 인식성능을 향상시킬 수 있다.The controller 130 may use any of the above-described methods in recognizing the voice. For example, a speech recognition model to which the Hidden Markov model is applied may be used, or an N-best search method incorporating a speech recognition model and a language model may be used. The N-best search method can improve recognition performance by selecting up to N recognition candidates using speech recognition model and language model, and re-evaluating the ranking of these candidates.
제어부(130)는 인식결과의 신뢰성을 확보하기 위해 신뢰도점수(confidence score)(또는'신뢰도'로 약칭될 수 있음)를 계산할 수 있다.The controller 130 may calculate a confidence score (or may be abbreviated as 'reliability') to secure the reliability of the recognition result.
신뢰도점수는 음성인식결과에 대해서 그 결과를 얼마나 믿을 만한 것인가를 나타내는 척도로서, 인식된 결과인 음소나 단어에 대해서, 그외의 다른 음소나 단어로부터 그 말이 발화되었을 확률에 대한 상대값으로 정의할 수 있다. 따라서, 신뢰도점수는 0 에서 1 사이의 값으로 표현할 수도 있고, 0 에서 100 사이의 값으로 표현할 수도 있다. 신뢰도 점수가 미리 설정된 임계값(threshold)보다 큰 경우에는 인식결과를 인정하고, 작은 경우에는 인식결과를 거절(rejection)할 수 있다.The reliability score is a measure of how reliable the result is for speech recognition results. It can be defined as the relative value of the probability that the word is spoken from other phonemes or words for the recognized phoneme or word. have. Therefore, the reliability score may be expressed as a value between 0 and 1, or may be expressed as a value between 0 and 100. When the reliability score is greater than a preset threshold, the recognition result may be recognized, and if the reliability score is small, the recognition result may be rejected.
이 외에도, 신뢰도점수는 종래의 다양한 신뢰도점수 획득 알고리즘에 따라 획득될 수 있다.In addition to this, the reliability score can be obtained according to various conventional reliability score acquisition algorithms.
제어부(130)는 소프트웨어, 하드웨어 또는 이들의 조합을 이용하여 컴퓨터로 읽을 수 있는 기록매체 내에서 구현될 수 있다. 하드웨어적인 구현에 의하면, ASICs(Application Specific Integrated Circuits), DSPs(Digital Signal Processors), DSPDs(Digital Signal Processing Devices), PLDs(Programmable LogicDevices), FPGAs(Field Programmable Gate Arrays), 프로세서(processor), 마이크로컨트롤러(microcontrollers),마이크로제어부(micro-processor) 등의 전기적인 유닛 중 적어도 하나를 이용하여 구현될 수 있다.The control unit 130 may be implemented in a computer-readable recording medium using software, hardware, or a combination thereof. According to the hardware implementation, ASICs (Application Specific Integrated Circuits), DSPs (Digital Signal Processors), DSPDs (Digital Signal Processing Devices), PLDs (Programmable LogicDevices), FPGAs (Field Programmable Gate Arrays), processors (processor), microcontrollers It may be implemented using at least one of electrical units such as (microcontrollers) and micro-processors.
소프트웨어적인 구현에 의하면, 적어도 하나의 기능 또는 동작을 수행하는 별개의 소프트웨어 모듈과 함께 구현될 수 있고, 소프트웨어코드는 적절한 프로그램언어로 쓰여진 소프트웨어 어플리케이션에 의해 구현될 수 있다.According to the software implementation, it may be implemented together with a separate software module that performs at least one function or operation, and the software code may be implemented by a software application written in an appropriate program language.
제어부(130)는 이하에서 후술할 도2내지 도6에서 제안된 기능, 과정 및/또는 방법을 구현하며, 이하에서는 설명의 편의를 위해 제어부(130)을 음성인식장치(100)와 동일시하여 설명한다. The control unit 130 implements the functions, processes, and / or methods proposed in FIGS. 2 to 6, which will be described later, and hereinafter, for convenience of description, the control unit 130 is identified with the voice recognition device 100 and described. do.
출력부(140)는 시각, 청각 등과 관련된 출력을 발생시키기 위한 것으로, 장치(100)에 의해 처리되는 정보를 출력한다.The output unit 140 is for generating output related to vision, hearing, and the like, and outputs information processed by the device 100.
예를 들어, 출력부(140)는 제어부(130)에서 처리된 음성 신호의 인식 결과를 시각 또는 청각을 통해 사용자가 인식할 수 있도록 출력할 수 있다.For example, the output unit 140 may output the recognition result of the voice signal processed by the controller 130 so that the user can recognize it through visual or hearing.
이하에서 설명하는 음성인식 모델은 도 1에서 설명한 음성인식 모델과 동일한 방법을 통해서 사용자로부터 입력된 음성 정보를 인식할 수 있다.The voice recognition model described below may recognize voice information input from a user through the same method as the voice recognition model described in FIG. 1.
도 2 및 도 3은 본 발명의 일 실시예에 따른 음성 인식 장치의 일 예를 나타내는 도면이다.2 and 3 are views showing an example of a speech recognition apparatus according to an embodiment of the present invention.
도 2 및 도 3을 참조하면, 음성 인식 장치는 사용자로부터 획득된 음성 데이터를 복수의 음성인식 모델로 인식하고, 문맥 정보에 기초하여 복수의 음성 인식 모델로부터 인식된 결과 중 하나를 선택하여 음성 인식 서비스를 제공할 수 있다.2 and 3, the speech recognition apparatus recognizes speech data obtained from a user as a plurality of speech recognition models, and selects one of the results recognized from the speech recognition models based on context information to recognize speech. Can provide services.
구체적으로, 음성 인식 장치는 사용자로부터 입력된 음성 정보를 전기적 신호로 변환하고, 변환된 전기적 신호인 아날로그 신호를 디지털 신호로 변환하여 음성 데이터를 생성할 수 있다.Specifically, the voice recognition apparatus may convert voice information input from a user into an electrical signal, and convert the analog signal, which is the converted electrical signal, into a digital signal to generate voice data.
이후 음성 인식 모델은 제1 음성 인식 모델(2010) 및 제2 음성 인식 모델(2020)을 이용하여 음성 데이터를 각각 인식할 수 있다.Thereafter, the voice recognition model may recognize voice data using the first voice recognition model 2010 and the second voice recognition model 2020, respectively.
음성 인식 장치는 기본 음성인식 모델 및 사용자 음성인식 모델 각각을 이용하여 사용자로부터 입력된 음성신호가 변환된 음성 데이터로부터 두 개의 음성인식 결과(음성인식 결과 1(2030), 음성인식 결과2(2040))을 획득할 수 있다.The speech recognition device uses two basic voice recognition models and a user voice recognition model, respectively, to obtain two voice recognition results (voice recognition result 1 (2030) and voice recognition result 2 (2040)) from voice data converted from a voice signal input from a user. ).
음성 인식 장치는 상기 제1 음성인식결과와 상기 제2 음성인식결과를 제1 특정 판단 절차(예를 들면, 제1 문맥기반 적정 음성인식모델 판단기법)에 적용해 제1 음성인식결과 및 제2 음성인식결과 중 더 적합한 음성 인식 결과(2050)를 선택해 출력할 수 있다.The speech recognition apparatus applies the first speech recognition result and the second speech recognition result to a first specific determination procedure (for example, a first context-based appropriate speech recognition model determination technique) to perform the first speech recognition result and the second speech recognition result. Among the voice recognition results, a more suitable voice recognition result 2050 may be selected and output.
즉, 음성 인식 장치는 따라 제1 음성 인식 결과 및 제2 음성 인식 결과 중 음성 인식의 목적에 더 적합한 음성 인식 결과를 제1 특정 판단 절차를 통해서 선택할 수 있으며, 선택된 음성 인식 결과를 출력할 수 있다.That is, the speech recognition apparatus may select a speech recognition result that is more suitable for the purpose of speech recognition among the first speech recognition result and the second speech recognition result through the first specific determination procedure, and output the selected speech recognition result. .
예를 들면, 제1 음성 인식 결과 및 제2 음성 인식 결과에서 추출한 문맥정보가 주소 검색과 관련된 경우, 제1 음성인식 모델과 제2 음성인식 모델 중 주소 검색에 더 적합한 음성 인식 모델을 선택하고, 선택된 음성 인식 모델의 음성 인식 결과를 음성 인식 서비스로 제공할 수 있다.For example, when context information extracted from the first speech recognition result and the second speech recognition result is related to address search, a speech recognition model more suitable for address search is selected from the first speech recognition model and the second speech recognition model, The speech recognition result of the selected speech recognition model may be provided as a speech recognition service.
이하, 도 3을 참조하여 특정 판단 절차에 대해서 살펴보도록 한다.Hereinafter, a specific determination procedure will be described with reference to FIG. 3.
도 3은 제1 음성 인식 결과 및 제2 음성 인식 결과를 문맥에 기반하여 적정한 음성 인식 모델을 판단하기 위한 제1 특정 판단 절차의 일 예를 나타내는 순서도이다.3 is a flowchart illustrating an example of a first specific determination procedure for determining an appropriate speech recognition model based on the context of the first speech recognition result and the second speech recognition result.
도 3에 도시된 바와 같이 제1 특정 판단 절차는, 제1 음성 인식 결과(3010)와 제2 음성인식결과(3020)가 제1 음성 인식 모델 및 제2 음성 인식 모델로부터 각각 생성된 경우, 제 1 음성 인식 결과(3010) 및 제 2 음성 인식 결과(3020)에서 추출된 문맥(3032)에 기초하여 제1 음성 인식 결과(3010)와 제2 음성인식결과(3020) 중 음성 인식의 목적에 더 적합한 음성 인식 모델을 선택할 수 있다(3034).As illustrated in FIG. 3, in the first specific determination procedure, when the first speech recognition result 3010 and the second speech recognition result 3020 are generated from the first speech recognition model and the second speech recognition model, respectively, Based on the context 3032 extracted from the first speech recognition result 3010 and the second speech recognition result 3020, the purpose of speech recognition among the first speech recognition result 3010 and the second speech recognition result 3020 is further added. A suitable speech recognition model can be selected (3034).
이후, 음성 인식 장치는 선택된 음성 인식 모델로부터 생성된 음성 인식 결과를 선택(3036)하여 출력(3040)할 수 있다.Thereafter, the speech recognition apparatus may select 3036 a speech recognition result generated from the selected speech recognition model, and output 3040 the result.
예를 들면, 도 3에서 제1 음성 인식 결과인 '이기통 주소 좀 알려줘'와 제2 음성 인식 결과인'이길동 주소 좀 알려줘'에서 음성 인식 장치는'주소 좀 알려줘'를 문맥정보로 판단하였다.For example, in FIG. 3, in the first voice recognition result 'tell me a self-address address' and the second voice recognition result 'tell me an address of Gil-dong', the speech recognition device determined 'tell me some address' as context information.
구체적으로, 음성 인식 장치는 이기통 주소 좀 알려줘'와 '이길동 주소 좀 알려줘'로부터 문맥 정보인 '주소좀 알려줘'(3032)를 추출할 수 있다.Specifically, the speech recognition device may extract the context information 'tell me an address' (3032) from 'tell me an address of an e-cylinder' and 'tell me an address of a Gil-dong'.
이후, 음성 인식 장치는 추출된 문맥 정보와 제1 음성 인식 모델의 특성(제1 특성) 및 제2 음성 인식 모델의 특성(제2 특성)을 비교하여 음성 인식의 목적에 더 적합한 음성인식 모델로 제1 음성 인식 모델을 선택할 수 있다(3034).Thereafter, the speech recognition apparatus compares the extracted contextual information with the characteristics of the first speech recognition model (first characteristic) and the characteristics of the second speech recognition model (second characteristic) as a speech recognition model more suitable for the purpose of speech recognition. A first voice recognition model may be selected (3034).
이후, 음성 인식 장치는 선택된 제1 음성 인식 모델의 제1 음성 인식 결과를 선택하고(3036), 선택된 제1 음성 인식 결과인 '이기통 주소 좀 알려줘'를 출력할 수 있다.Thereafter, the speech recognition apparatus may select a first speech recognition result of the selected first speech recognition model (3036), and may output 'tell me a base address', which is the selected first speech recognition result.
이때, 문맥 정보로 음성 데이터로부터 인식된 인식 문장의 부분 외에도 인식 결과 등을 통해 유추할 수 있는 모든 정보가 문맥정보로 사용될 수 있다. At this time, in addition to the part of the recognition sentence recognized from the voice data as the context information, all information that can be inferred through the recognition result or the like may be used as the context information.
예를 들면, 사용자와 관련된 정보인 사용자의 위치, 사용자가 처한 날씨, 사용자 습관, 사용자의 이전 발화 문맥, 사용자의 경력, 사용자의 직책, 사용자의 금전 상태, 현재 시각 및 사용자의 언어 등 중 적어도 하나가 문맥 정보로 사용될 수 있다.For example, at least one of information related to the user, such as the user's location, the user's weather, user habits, the user's previous speech context, the user's career, the user's position, the user's financial status, the current time, and the user's language. Can be used as contextual information.
도 4 및 도 5는 본 발명의 일 실시예에 따른 음성 인식 장치의 또 다른 일 예를 나타내는 도면이다. 4 and 5 are views showing another example of a speech recognition apparatus according to an embodiment of the present invention.
도 4 및 도 5를 참조하면, 음성 인식 장치는 사용자로부터 획득된 음성 데이터를 복수의 음성인식 모델로 인식하고, 문맥 정보에 기초하여 복수의 음성 인식 모델로부터 인식된 결과 중 하나를 선택하여 음성 인식 서비스를 제공할 수 있다.4 and 5, the speech recognition apparatus recognizes speech data obtained from a user as a plurality of speech recognition models, and selects one of the results recognized from the speech recognition models based on context information to recognize speech. Can provide services.
구체적으로, 음성 인식 장치는 사용자로부터 입력된 음성 정보를 제1 음성 인식 모델(4010)로 인식하여 제1 음성인식 결과(4020)를 생성할 수 있다.Specifically, the voice recognition device may generate the first voice recognition result 4020 by recognizing the voice information input from the user as the first voice recognition model 4010.
이때, 제1 음성 인식 모델은 사용자로부터 획득된 음성 정보로부터 문맥을 추출하기 위한 음성 정보 모델로써, 음성 인식의 목적에 따른 음성 정보 모델의 용도에 따라 작은 리소스만을 사용하도록 구성될 수 있다.At this time, the first voice recognition model is a voice information model for extracting context from voice information obtained from a user, and may be configured to use only small resources according to the purpose of the voice information model according to the purpose of voice recognition.
음성인식 장치는 제2 특정 판단 절차(예를 들면, 제2 문맥기반 적정 음성인식 모델 판단기법)을 이용하여 기 설정된 복수개의 음성 인식 모델 중 사용자로부터 입력된 음성 정보를 인식하기에 가장 적합한 특정 음성 인식 모델을 선택할 수 있다(4030).The voice recognition apparatus is a specific voice most suitable for recognizing voice information input from a user among a plurality of preset voice recognition models using a second specific determination procedure (for example, a second context-based appropriate voice recognition model determination technique). A recognition model may be selected (4030).
즉, 음성인식 장치는 음성 인식의 목적 및 용도에 따라 제 1 음성 인식 결과에 기초하여 복수의 음성 인식 모델 중 특정 음성 인식 모델을 선택할 수 있다.That is, the speech recognition apparatus may select a specific speech recognition model from among a plurality of speech recognition models based on the first speech recognition result according to the purpose and purpose of speech recognition.
예를 들면, 제 1 음성 인식 결과에서 추출한 문맥 정보가 주소 검색과 관련된 경우, 복수의 후보 음성 인식 모델 중에서 주소 검색에 가장 적합한 음성 인식 모델을 특정 음성 인식 모델로 선택할 수 있다.For example, when context information extracted from the first speech recognition result is related to address search, a speech recognition model that is most suitable for address search among a plurality of candidate speech recognition models may be selected as a specific speech recognition model.
이때, 제2 특정 판단 절차는 특정 음성 인식 모델을 선택하기 위해 제1 음성인식결과로부터 문맥정보를 추출하고, 추출된 문맥정보를 이용하여 특정 음성 인식 모델을 선택하는 절차를 포함한다.At this time, the second specific determination procedure includes a process of extracting context information from the first speech recognition result to select a specific speech recognition model, and selecting a specific speech recognition model using the extracted context information.
이후, 음성 인식 절차는 선택된 특정 음성 인식 모델을 이용하여 사용자로부터 입력된 음성 정보가 변환된 음성 데이터를 재 인식하여 최종적으로 음성 인식 결과(4040)을 생성할 수 있다.Thereafter, the voice recognition procedure may re-recognize the voice data converted from the voice information input from the user using the selected specific voice recognition model to finally generate the voice recognition result 4040.
이하, 도 5을 참조하여 제 2 특정 판단 절차에 대해서 살펴보도록 한다.Hereinafter, the second specific determination procedure will be described with reference to FIG. 5.
도 5는 제1 음성 인식 결과 및 제2 음성 인식 결과를 문맥에 기반하여 적정한 음성 인식 모델을 판단하기 위한 제2 특정 판단 절차의 일 예를 나타내는 순서도이다.5 is a flowchart illustrating an example of a second specific determination procedure for determining an appropriate speech recognition model based on the context of the first speech recognition result and the second speech recognition result.
구체적으로, 음성 인식 장치는 도 4에서 설명한 제1 음성인식 모델에 의해서 사용자의 음성 정보를 인식한 제1 음성 인식 결과(5010)를 생성(또는 입력 받아)하고, 생성된 제1 음성 인식 결과에 기초하여 제2 특정 판단 절차를 통해서 복수(예를 들면, N개)의 음성 인식 모델 중에서 도 4에서 살펴본 바와 같이 음성 인식의 목적에 가장 적합한 특정 음성 인식 모델을 선택할 수 있다(5020).Specifically, the speech recognition device generates (or receives) the first speech recognition result 5010 that recognizes the user's speech information by using the first speech recognition model described with reference to FIG. 4, and the generated speech recognition result Based on the second specific determination procedure, a specific speech recognition model that is most suitable for the purpose of speech recognition may be selected from a plurality of (eg, N) speech recognition models as shown in FIG. 4 (5020).
이때, 제2 특정 판단 절차는 특정 음성 인식 모델을 선택하기 위해 제1 음성인식결과로부터 문맥정보를 추출하고, 추출된 문맥정보를 이용하여 특정 음성 인식 모델을 선택하는 절차를 포함한다.At this time, the second specific determination procedure includes a process of extracting context information from the first speech recognition result to select a specific speech recognition model, and selecting a specific speech recognition model using the extracted context information.
예를 들면, 제1 음성 인식 모델을 통해서 인식한 제1 음성 인식 결과인 '이길동 주소 좀 알려줘'로부터 '주소 좀 알려줘'를 문맥 정보로 추출할 수 있다.For example, from the first speech recognition model recognized through the first speech recognition result, 'tell me the address of Lee Gil-dong', 'tell me the address' can be extracted as context information.
이때, 제1 음성 인식 모델은 앞에서 살펴본 바와 같이 사용자로부터 획득된 음성 정보로부터 문맥을 추출하기 위한 음성 정보 모델로써, 음성 인식의 목적에 따른 음성 정보 모델의 용도에 따라 작은 리소스만을 사용하도록 구성될 수 있다.At this time, the first voice recognition model is a voice information model for extracting context from voice information obtained from a user as described above, and may be configured to use only small resources according to the purpose of the voice information model according to the purpose of voice recognition. have.
문맥 정보는 음성 인식 모델을 통해서 인식된 문장의 일 부분 외에도 인식 결과 등을 통해 유추할 수 있는 모든 정보가 문맥정보로 사용될 수 있다. In the context information, in addition to a part of the sentence recognized through the speech recognition model, all information that can be inferred through the recognition result may be used as the context information.
예를 들면, 사용자와 관련된 정보인 사용자의 위치, 사용자가 처한 날씨, 사용자 습관, 사용자의 이전 발화 문맥, 사용자의 경력, 사용자의 직책, 사용자의 금전 상태, 현재 시각 및 사용자의 언어 등 중 적어도 하나가 문맥 정보로 사용될 수 있다.For example, at least one of information related to the user, such as the user's location, the user's weather, user habits, the user's previous speech context, the user's career, the user's position, the user's financial status, the current time, and the user's language. Can be used as contextual information.
이후, 음성 인식 장치는 추출된 문맥 정보인 '주소 좀 알려줘'를 이용하여 복수의 음성 인식 모델 중에서 도 4에서 살펴본 바와 같이 음성 인식의 목적에 가장 적합한 특정 음성 인식 모델을 선택할 수 있다.Thereafter, the speech recognition apparatus may select a specific speech recognition model that is most suitable for the purpose of speech recognition, as shown in FIG. 4, from the plurality of speech recognition models using the extracted context information 'tell me the address'.
이와 같은 방법을 통해서 음성 인식 장치는 문맥 정보를 획득하기 위한 음성 인식 모델을 통해서 문맥 정보를 추출하여 음성 인식의 목적에 가장 적합한 특정 음성 인식 모델을 선택할 수 있다.Through this method, the speech recognition apparatus may extract context information through a speech recognition model for obtaining context information, and select a specific speech recognition model that is most suitable for the purpose of speech recognition.
도 6은 본 발명의 일 실시예에 따른 음성 인식 장치의 또 다른 일 예를 나타내는 도면이다. 6 is a view showing another example of a speech recognition apparatus according to an embodiment of the present invention.
도 6을 참조하면, 음성 인식 장치는 음성 인식을 위한 문맥 정보를 설정하여 사전에 복수의 음성 인식 모델들 중에서 특정한 음성 인식 모델을 선택할 수 있으며, 선택된 음성 인식 모델을 통해서 인식된 음성 인식 결과를 이용하여 음성 인식 서비스를 제공할 수 있다.Referring to FIG. 6, the speech recognition apparatus may select a specific speech recognition model among a plurality of speech recognition models in advance by setting context information for speech recognition, and use the speech recognition results recognized through the selected speech recognition model To provide a speech recognition service.
구체적으로, 음성 인식 장치는 기 설정된 문맥정보에 따라 복수의 음성 인식 모델 중에서 음성 인식에 가장 적합하다고 판단되는 특정 음성 인식 모델을 선택할 수 있다(6010).Specifically, the speech recognition apparatus may select a specific speech recognition model, which is determined to be most suitable for speech recognition, from among a plurality of speech recognition models according to predetermined context information (6010).
예를 들면, 음성 인식 서비스의 목적 및 용도가 주소 검색인 경우, 음성 인식 장치는 복수 개의 음성 인식 모델 중 주소 검색을 위한 용도로 기 설정된 음성 인식 모델을 특정 음성 인식 모델로 선택할 수 있다.For example, when the purpose and purpose of the speech recognition service is address search, the speech recognition apparatus may select a speech recognition model preset for use in address search among a plurality of speech recognition models as a specific speech recognition model.
이후, 음성 인식 모델은 선택된 특정 음식 모델을 통해서 사용자로부터 획득된 음성 데이터를 인식하여 음성 인식 결과(6020)를 생성할 수 있다.Thereafter, the voice recognition model may generate voice recognition results 6020 by recognizing voice data obtained from a user through the selected specific food model.
이때, 음성 데이터는 사용자로부터 획득된 음성 정보가 전기적 신호로 변경되고, 변경된 전기적 신호인 아날로그 신호가 디지털 신호로 변경된 데이터를 의미할 수 있다.In this case, the voice data may mean data in which voice information obtained from a user is changed into an electrical signal, and an analog signal, which is a changed electrical signal, is converted into a digital signal.
도 7은 본 발명의 일 실시 예에 따른 음성 인식 방법의 일 예를 나타내는 순서도이다.7 is a flowchart illustrating an example of a speech recognition method according to an embodiment of the present invention.
도 7을 참조하면, 도 2 및 도 3에서 살펴본 바와 같이 음성 인식 장치는 복수의 음성 인식 장치들을 통해서 음성 인식 결과를 생성하고, 생성된 음성 인식 결과들 중에서 가장 적합한 음성 인식 결과를 선택하여 음성 인식 서비스를 제공할 수 있다.Referring to FIG. 7, as described in FIGS. 2 and 3, the speech recognition device generates a speech recognition result through a plurality of speech recognition devices, and selects the most suitable speech recognition result among the generated speech recognition results to recognize the speech Can provide services.
구체적으로, 음성 인식 장치는 사용자로부터 음성 정보를 획득하고, 획득된 음성 정보를 음성 데이터로 변환할 수 있다(S7010).Specifically, the speech recognition device may acquire speech information from a user and convert the acquired speech information into speech data (S7010).
예를 들면, 음성 인식 장치는 사용자로부터 획득한 음성 정보를 전기적 신호로 변환하고, 변경된 전기적 신호인 아날로그 신호를 디지털 신호인 음성 데이터로 변환할 수 있다.For example, the voice recognition device may convert voice information obtained from a user into an electrical signal, and convert an analog signal, which is a changed electrical signal, into voice data, which is a digital signal.
이후, 음성 인식 장치는 음성 데이터를 제1 음성 인식 모델 및 제2 음성 인식 모델로 각각 인식하여 제1 음성 인식 결과 및 제2 음성 인식 결과를 생성할 수 있다(S7020, S7030).Thereafter, the voice recognition apparatus may recognize the voice data using the first voice recognition model and the second voice recognition model, respectively, and generate first voice recognition results and second voice recognition results (S7020, S7030).
이후, 음성 인식 장치는 도 2 및 도 3에서 살펴본 제1 특정 판단 절차를 통해서 제1 음성 인식 결과 및 제2 음성 인식 결과 중 음성 인식의 목적에 더 적합한 음성 인식 결과를 선택하여 음성 인식 서비스를 제공할 수 있다(S7040).Thereafter, the speech recognition apparatus provides a speech recognition service by selecting a speech recognition result that is more suitable for the purpose of speech recognition from the first speech recognition result and the second speech recognition result through the first specific determination procedure illustrated in FIGS. 2 and 3. It can be done (S7040).
예를 들면, 음성 인식 장치는 제1 음성 인식 결과 및 제2 음성 인식 결과로부터 문맥 정보를 추출하고, 추출된 문맥 정보를 기 설정된 제1 음성 인식 모델의 제 1 특성 및 제 2 음성 인식 모델의 제 2 특성과 각각 비교할 수 있다.For example, the speech recognition apparatus extracts context information from the first speech recognition result and the second speech recognition result, and extracts the extracted context information from the first characteristics of the first speech recognition model and the second speech recognition model. Each can be compared with 2 characteristics.
이후, 음성 인식 장치는 상기 비교 결과에 기초하여 상기 제1 음성 인식 모델 및 제2 음성 인식 모델 중 음성 인식의 목적 및/또는 용도에 적합한 음성 인식 모델을 선택할 수 있다.Thereafter, the speech recognition apparatus may select a speech recognition model suitable for the purpose and / or purpose of speech recognition among the first speech recognition model and the second speech recognition model based on the comparison result.
이후, 음성 인식 장치는 선택된 제2 음성 인식 모델에 의해서 생성된 제2 음성 인식 결과를 음성 인식 결과로 선택하고, 선택된 제2 음성 인식 결과에 기초하여 음성 인식 서비스를 제공할 수 있다.Thereafter, the speech recognition apparatus may select a second speech recognition result generated by the selected second speech recognition model as a speech recognition result, and provide a speech recognition service based on the selected second speech recognition result.
도 8은 본 발명의 일 실시 예에 따른 음성 인식 방법의 또 다른 일 예를 나타내는 순서도이다.8 is a flowchart illustrating another example of a speech recognition method according to an embodiment of the present invention.
도 8을 참조하면, 음성 인식 모델은 음성 데이터를 통해서 문맥 정보를 추출하고, 추출된 문맥 정보에 기초하여 음성 인식 서비스를 제공할 수 있다.Referring to FIG. 8, the speech recognition model may extract context information through speech data and provide a speech recognition service based on the extracted context information.
먼저 단계 S8010은 도 7의 단계 S7010과 동일하므로 설명을 생략하도록 한다.First, step S8010 is the same as step S7010 in FIG. 7, and thus description thereof will be omitted.
이후, 음성 인식 장치는 제1 음성 인식 모델로 음성 데이터를 인식하여 제1 음성 인식 결과를 생성한다(S8020).Thereafter, the speech recognition device recognizes the speech data using the first speech recognition model to generate a first speech recognition result (S8020).
이때, 제1 음성 인식 모델은 도 4에서 설명한 바와 같이 사용자로부터 획득된 음성 정보로부터 문맥을 추출하기 위한 음성 정보 모델로써, 음성 인식의 목적에 따른 음성 정보 모델의 용도에 따라 작은 리소스만을 사용하도록 구성될 수 있다.At this time, the first voice recognition model is a voice information model for extracting context from voice information obtained from a user as described in FIG. 4, and is configured to use only small resources according to the purpose of the voice information model according to the purpose of voice recognition. Can be.
음성인식 장치는 제1 음성 인식 결과로부터 문맥 정보를 추출할 수 있다(S8030).The speech recognition device may extract context information from the first speech recognition result (S8030).
문맥 정보는 음성 인식 모델을 통해서 인식된 문장의 일 부분 외에도 인식 결과 등을 통해 유추할 수 있는 모든 정보를 의미할 수 있다.The context information may mean all information that can be inferred through a recognition result, etc., in addition to a part of a sentence recognized through a voice recognition model.
예를 들면, 사용자와 관련된 정보인 사용자의 위치, 사용자가 처한 날씨, 사용자 습관, 사용자의 이전 발화 문맥, 사용자의 경력, 사용자의 직책, 사용자의 금전 상태, 현재 시각 및 사용자의 언어 등 중 적어도 하나가 문맥 정보로 사용될 수 있다.For example, at least one of information related to the user, such as the user's location, the user's weather, user habits, the user's previous speech context, the user's career, the user's position, the user's financial status, the current time, and the user's language. Can be used as contextual information.
이후, 음성 인식 장치는 도 4 및 도 5에서 설명한 제2 특정 판단 절차을 이용하여 기 설정된 복수개의 음성 인식 모델 중 사용자로부터 입력된 음성 정보를 인식하기에 가장 적합한 특정 음성 인식 모델을 선택할 수 있다(S8040).Thereafter, the voice recognition apparatus may select a specific voice recognition model most suitable for recognizing voice information input from a user from among a plurality of preset voice recognition models using the second specific determination procedure described with reference to FIGS. 4 and 5 (S8040). ).
즉, 음성인식 장치는 음성 인식의 목적 및 용도에 따라 제 1 음성 인식 결과에 기초하여 복수의 음성 인식 모델 중 특정 음성 인식 모델을 선택할 수 있다.That is, the speech recognition apparatus may select a specific speech recognition model from among a plurality of speech recognition models based on the first speech recognition result according to the purpose and purpose of speech recognition.
예를 들면, 제 1 음성 인식 결과에서 추출한 문맥 정보가 주소 검색과 관련된 경우, 복수의 후보 음성 인식 모델 중에서 주소 검색에 가장 적합한 음성 인식 모델을 특정 음식 인식 모델로 선택할 수 있다.For example, when context information extracted from the first speech recognition result is related to address search, a speech recognition model most suitable for address search among a plurality of candidate speech recognition models may be selected as a specific food recognition model.
이때, 제2 특정 판단 절차는 특정 음성 인식 모델을 선택하기 위해 제1 음성인식결과로부터 문맥정보를 추출하고, 추출된 문맥정보를 이용하여 특정 음성 인식 모델을 선택하는 절차를 포함한다.At this time, the second specific determination procedure includes a process of extracting context information from the first speech recognition result to select a specific speech recognition model, and selecting a specific speech recognition model using the extracted context information.
이후, 음성 인식 절차는 선택된 특정 음성 인식 모델을 이용하여 사용자로부터 입력된 음성 정보가 변환된 음성 데이터를 재 인식하여 최종적으로 음성 인식 결과을 생성할 수 있다(S8040).Thereafter, the voice recognition procedure may re-recognize the voice data converted from the voice information input from the user using the selected specific voice recognition model to finally generate a voice recognition result (S8040).
이후, 음성 인식 장치는 특정 음성 인식 모델을 통해서 음성 데이터를 인식한 음성 인식 결과에 기초하여 음성 인식 서비스를 제공할 수 있다.Thereafter, the voice recognition apparatus may provide a voice recognition service based on a voice recognition result of recognizing voice data through a specific voice recognition model.
도 9는 본 발명의 일 실시 예에 따른 음성 인식 방법의 또 다른 일 예를 나타내는 순서도이다.9 is a flowchart illustrating another example of a speech recognition method according to an embodiment of the present invention.
도 9를 참조하면, 음성 인식 장치는 사용자로부터 음성 정보를 입력받기 전에 문맥 정보에 기초하여 복수의 음성 인식 모델 중 특정 음성 인식 모델을 선택할 수 있으며, 선택된 음성 인식 모델을 통해서 사용자로부터 입력되는 음성 정보를 인식할 수 있다.Referring to FIG. 9, the voice recognition apparatus may select a specific voice recognition model from among a plurality of voice recognition models based on context information before receiving voice information from the user, and voice information input from the user through the selected voice recognition model Can recognize.
구체적으로, 음성 인식 장치는 음성 인식을 위한 문맥 정보를 기 설정할 수 있다.Specifically, the speech recognition device may preset context information for speech recognition.
문맥 정보는 음성 인식 모델을 통해서 인식된 문장의 일 부분 외에도 인식 결과 등을 통해 유추할 수 있는 모든 정보를 의미할 수 있다.The context information may mean all information that can be inferred through a recognition result, etc., in addition to a part of a sentence recognized through a voice recognition model.
예를 들면, 사용자와 관련된 정보인 사용자의 위치, 사용자가 처한 날씨, 사용자 습관, 사용자의 이전 발화 문맥, 사용자의 경력, 사용자의 직책, 사용자의 금전 상태, 현재 시각 및 사용자의 언어 등 중 적어도 하나가 문맥 정보로 사용될 수 있다.For example, at least one of information related to the user, such as the user's location, the user's weather, user habits, the user's previous speech context, the user's career, the user's position, the user's financial status, the current time, and the user's language. Can be used as contextual information.
이후, 음성 인식 장치는 문맥 정보에 기초하여 복수의 음성 인식 모델 중 음성 인식의 목적/용도에 따라 특정 음성 인식 모델을 선택한다(S9020).Thereafter, the speech recognition apparatus selects a specific speech recognition model according to the purpose / use of speech recognition among a plurality of speech recognition models based on context information (S9020).
예를 들면, 주소 검색인 경우, 음성 인식 장치는 복수의 음성 인식 모델 중에서 주소 검색을 위한 용도로 기 설정된 음성 인식 모델을 특정 음성 인식 모델로 선택할 수 있다.For example, in the case of an address search, the speech recognition apparatus may select a speech recognition model preset for an address search among a plurality of speech recognition models as a specific speech recognition model.
이후, 음성 인식 장치는 사용자로부터 음성 정보가 획득된 경우, 획득된 음성 정보를 음성 데이터로 변환할 수 있다(S9010).Thereafter, when the voice information is obtained from the user, the voice recognition apparatus may convert the obtained voice information into voice data (S9010).
예를 들면, 음성 인식 장치는 사용자로부터 획득한 음성 정보를 전기적 신호로 변환하고, 변경된 전기적 신호인 아날로그 신호를 디지털 신호인 음성 데이터로 변환할 수 있다.For example, the voice recognition device may convert voice information obtained from a user into an electrical signal, and convert an analog signal, which is a changed electrical signal, into voice data, which is a digital signal.
이후, 음성 인식 장치는 선택된 특정 음성 인식 모델로 음성 데이터를 인식하여 음성 인식 결과를 생성할 수 있다(S9050).Thereafter, the speech recognition device may generate speech recognition results by recognizing speech data using the selected specific speech recognition model (S9050).
이후, 음성 인식 장치는 특정 음성 인식 모델을 통해서 음성 데이터를 인식한 음성 인식 결과에 기초하여 음성 인식 서비스를 제공할 수 있다.Thereafter, the voice recognition apparatus may provide a voice recognition service based on a voice recognition result of recognizing voice data through a specific voice recognition model.
본 발명에 따른 실시예는 다양한 수단, 예를 들어, 하드웨어, 펌웨어(firmware), 소프트웨어 또는 그것들의 결합 등에 의해 구현될 수 있다. 하드웨어에 의한 구현의 경우, 본 발명의 일 실시예는 하나 또는 그 이상의 ASICs(application specific integrated circuits), DSPs(digital signal processors), DSPDs(digital signal processing devices), PLDs(programmable logic devices), FPGAs(field programmable gate arrays), 프로세서, 콘트롤러, 마이크로콘트롤러, 마이크로프로세서 등에 의해 구현될 수 있다.Embodiments according to the present invention may be implemented by various means, for example, hardware, firmware, software, or a combination thereof. For implementation by hardware, one embodiment of the invention includes one or more application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), FPGAs ( field programmable gate arrays), processors, controllers, microcontrollers, microprocessors, etc.
펌웨어나 소프트웨어에 의한 구현의 경우, 본 발명의 일 실시예는 이상에서 설명된 기능 또는 동작들을 수행하는 모듈, 절차, 함수 등의 형태로 구현될 수 있다. 소프트웨어코드는 메모리에 저장되어 프로세서에 의해 구동될 수 있다. 상기 메모리는 상기 프로세서 내부 또는 외부에 위치하여, 이미 공지된 다양한 수단에 의해 상기 프로세서와 데이터를 주고받을 수 있다.In the case of implementation by firmware or software, an embodiment of the present invention may be implemented in the form of a module, procedure, function, etc. that performs the functions or operations described above. The software code can be stored in memory and driven by a processor. The memory is located inside or outside the processor, and can exchange data with the processor by various known means.
본 발명은 본 발명의 필수적 특징을 벗어나지 않는 범위에서 다른 특정한 형태로 구체화될 수 있음은 당 업자에게 자명하다. 따라서, 상술한 상세한 설명은 모든 면에서 제한적으로 해석되어서는 아니되고 예시적인 것으로 고려되어야 한다. 본 발명의 범위는 첨부된 청구항의 합리적 해석에 의해 결정되어야 하고, 본 발명의 등가적 범위 내에서의 모든 변경은 본 발명의 범위에 포함된다. It is apparent to those skilled in the art that the present invention may be embodied in other specific forms without departing from the essential features of the present invention. Therefore, the above detailed description should not be construed as limiting in all respects, but should be considered illustrative. The scope of the present invention should be determined by rational interpretation of the appended claims, and all changes within the equivalent scope of the present invention are included in the scope of the present invention.
본 발명은 다양한 음성인식 기술 분야에 적용될 수 있으며, 본 발명은 문맥에 기반한 최적의 음성인식모델 선택 방법을 제공할 수 있다.The present invention can be applied to various voice recognition technology fields, and the present invention can provide a method for selecting an optimal voice recognition model based on context.
이런 특징으로 인해, 분야별로 강점이 다른 다수의 음성인식모델을 이용한 서비스에서 불특정 음성입력이 들어왔을 때 최상의 음성인식결과를 도출할 수 있다.Due to this feature, it is possible to derive the best voice recognition result when an unspecified voice input is received from a service using multiple voice recognition models having different strengths for each field.
이러한 특징은 음성인식뿐만 아니라 다른 인공지능 서비스에서도 적용될 수 있다.This feature can be applied not only to voice recognition, but also to other artificial intelligence services.
Claims (11)
- 음성을 인식하는 방법에 있어서,In the voice recognition method,사용자로부터 음성 정보를 획득하는 단계;Obtaining voice information from a user;획득된 음성 정보를 음성 데이터로 변환하는 단계;Converting the acquired voice information into voice data;제1 음성인식 모델로 상기 변환된 음성 데이터를 인식하여 제1 음성 인식 결과를 생성하는 단계;Recognizing the converted speech data with a first speech recognition model and generating a first speech recognition result;제2 음성인식 모델로 상기 변환된 음성 데이터를 인식하여 제2 음성 인식 결과를 생성하는 단계; 및Generating a second speech recognition result by recognizing the converted speech data with a second speech recognition model; And특정 판단 절차를 통해서 상기 제1 음성 인식 결과 및 상기 제2 음성 인식 결과 중 특정 음성 인식 결과를 선택하는 단계를 포함하는 방법.And selecting a specific speech recognition result from the first speech recognition result and the second speech recognition result through a specific determination procedure.
- 제 1 항에 있어서, 상기 특정 판단 절차는,The method of claim 1, wherein the specific determination procedure,상기 제1 음성 인식 결과 및 상기 제2 음성 인식 결과로부터 문맥 정보를 추출하는 단계;Extracting context information from the first speech recognition result and the second speech recognition result;상기 문맥정보를 기 설정된 상기 제1 음성 인식 모델의 제 1 특성 및 상기 제 2 음성 인식 모델의 제 2 특성과 각각 비교하는 단계; 및Comparing the context information with a first characteristic of the first speech recognition model and a second characteristic of the second speech recognition model, respectively; And상기 비교 결과에 기초하여 상기 제1 음성 인식 결과 및 상기 제2 음성 인식 결과 중 하나를 선택하는 단계를 포함하는 방법.And selecting one of the first speech recognition result and the second speech recognition result based on the comparison result.
- 제 2 항에 있어서,According to claim 2,문맥 정보는 상기 음성 정보의 일부 또는 상기 제 1 음성 인식 결과 및 상기 제 2 음성 인식 결과로부터 획득될 수 있는 정보 또는 음성을 발성한 사용자와 관련된 정보 중 적어도 하나를 포함하는 방법.The context information includes at least one of a part of the speech information or information that can be obtained from the first speech recognition result and the second speech recognition result or information related to a user who has spoken.
- 제 1 항에 있어서,According to claim 1,상기 제1 음성 인식 모델 및 상기 제2 음성 인식 모델은 상기 사용자로부터 획득되는 상기 음성 정보를 인식하기 위한 복수의 음성 인식 모델들 중 하나인 방법.The first voice recognition model and the second voice recognition model are one of a plurality of voice recognition models for recognizing the voice information obtained from the user.
- 제 1 항이 있어서,According to claim 1,상기 복수의 음성 인식 모델들로 상기 변환된 음성 데이터를 인식하여 복수의 음성 인식 결과를 생성하는 단계를 더 포함하되,Further comprising the step of recognizing the converted speech data with the plurality of speech recognition models to generate a plurality of speech recognition results,상기 특정 음성 인식 결과는 상기 제1 음성 인식 결과, 상기 제2 음성 인식 결과 및 상기 복수의 음성 인식 결과 중에서 선택되는 방법.The specific speech recognition result is selected from the first speech recognition result, the second speech recognition result, and the plurality of speech recognition results.
- 제 1 항에 있어서,According to claim 1,상기 특정 판단 절차는 문맥 정보에 포함된 문맥에 기초하여 음성 인식 결과를 판단하는 절차인 방법.The specific determination procedure is a method of determining a speech recognition result based on a context included in context information.
- 음성을 인식하는 방법에 있어서,In the voice recognition method,사용자로부터 음성 정보를 획득하는 단계;Obtaining voice information from a user;획득된 음성 정보를 음성 데이터로 변환하는 단계;Converting the acquired voice information into voice data;상기 제1 음성 인식 모델로 상기 음성 데이터를 인식하여 제1 음성 인식 결과를 생성하는 단계;Generating a first speech recognition result by recognizing the speech data with the first speech recognition model;상기 제1 음성 인식 결과에 기초하여 복수의 음성 인식 모델 중 상기 음성 데이터를 인식하기 위한 제2 음성 인식 모델을 선택하는 단계; 및Selecting a second speech recognition model for recognizing the speech data among a plurality of speech recognition models based on the first speech recognition result; And상기 제2 음성 인식 모델로 상기 음성 데이터를 인식하여 제2 음성 인식 결과를 생성하는 단계를 포함하는 방법.And recognizing the voice data with the second voice recognition model to generate a second voice recognition result.
- 제 7 항에 있어서,The method of claim 7,상기 제1 음성 인식 결과로부터 문맥 정보를 추출하는 단계; 및Extracting context information from the first speech recognition result; And상기 문맥 정보와 상기 복수의 음성 인식 모델의 기 설정된 특정을 비교하는 단계를 더 포함하되,Comprising the step of comparing the context information and a predetermined specific of the plurality of speech recognition models,상기 제2 음성 인식 모델은 상기 비교결과에 기초하여 선택되는 방법.The second voice recognition model is selected based on the comparison result.
- 제 8 항에 있어서,The method of claim 8,상기 제1 음성 인식 모델은 상기 문맥 정보를 추출하기 위한 음성 인식 모델인 방법.The first speech recognition model is a speech recognition model for extracting the context information.
- 음성을 인식하는 방법에 있어서,In the voice recognition method,사용자로부터 음성 정보를 획득하는 단계;Obtaining voice information from a user;획득된 음성 정보를 음성 데이터로 변환하는 단계; 및Converting the acquired voice information into voice data; And복수의 음성 인식 모델 중에 선택된 특정 음성 인식 모델로 상기 음성 데이터를 인식하여 음성 인식 결과를 생성하는 단계를 포함하는 방법.And recognizing the speech data using a specific speech recognition model selected from a plurality of speech recognition models to generate a speech recognition result.
- 제 10 항에 있어서,The method of claim 10,음성 인식을 위한 문맥 정보를 설정하는 단계; 및Setting context information for speech recognition; And상기 복수의 음성 인식 모델 중에서 음성 인식 모델의 특성이 상기 문맥 정보에 가장 적합한 상기 특정 음성 인식 모델을 선택하는 단계를 더 포함하는 방법.And selecting the specific speech recognition model in which the characteristics of the speech recognition model are best suited to the context information among the plurality of speech recognition models.
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/KR2018/013280 WO2020091123A1 (en) | 2018-11-02 | 2018-11-02 | Method and device for providing context-based voice recognition service |
CN201880099155.1A CN113016029A (en) | 2018-11-02 | 2018-11-02 | Method and apparatus for providing context-based speech recognition service |
KR1020217011945A KR20210052563A (en) | 2018-11-02 | 2018-11-02 | Method and apparatus for providing context-based voice recognition service |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/KR2018/013280 WO2020091123A1 (en) | 2018-11-02 | 2018-11-02 | Method and device for providing context-based voice recognition service |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2020091123A1 true WO2020091123A1 (en) | 2020-05-07 |
Family
ID=70463797
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/KR2018/013280 WO2020091123A1 (en) | 2018-11-02 | 2018-11-02 | Method and device for providing context-based voice recognition service |
Country Status (3)
Country | Link |
---|---|
KR (1) | KR20210052563A (en) |
CN (1) | CN113016029A (en) |
WO (1) | WO2020091123A1 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11721324B2 (en) | 2021-06-09 | 2023-08-08 | International Business Machines Corporation | Providing high quality speech recognition |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20240098282A (en) | 2022-12-20 | 2024-06-28 | 서강대학교산학협력단 | System for correcting errors of a speech recognition system and method thereof |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR100612839B1 (en) * | 2004-02-18 | 2006-08-18 | 삼성전자주식회사 | Domain based dialogue speech recognition method and device |
KR20110070688A (en) * | 2009-12-18 | 2011-06-24 | 한국전자통신연구원 | Speech Recognition Apparatus and Method with Two-Stage Speech Verification Scheme for Envelope Recognition Word Computation |
KR20140005639A (en) * | 2012-07-05 | 2014-01-15 | 삼성전자주식회사 | Electronic apparatus and method for modifying voice recognition errors thereof |
KR101415534B1 (en) * | 2007-02-23 | 2014-07-07 | 삼성전자주식회사 | Multi-stage speech recognition apparatus and method |
KR20150054445A (en) * | 2013-11-12 | 2015-05-20 | 한국전자통신연구원 | Sound recognition device |
Family Cites Families (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101034390A (en) * | 2006-03-10 | 2007-09-12 | 日电(中国)有限公司 | Apparatus and method for verbal model switching and self-adapting |
EP3091535B1 (en) * | 2009-12-23 | 2023-10-11 | Google LLC | Multi-modal input on an electronic device |
US9202465B2 (en) * | 2011-03-25 | 2015-12-01 | General Motors Llc | Speech recognition dependent on text message content |
US9502029B1 (en) * | 2012-06-25 | 2016-11-22 | Amazon Technologies, Inc. | Context-aware speech processing |
US9502032B2 (en) * | 2014-10-08 | 2016-11-22 | Google Inc. | Dynamically biasing language models |
CN105244027B (en) * | 2015-08-31 | 2019-10-15 | 百度在线网络技术(北京)有限公司 | Generate the method and system of homophonic text |
CN105654954A (en) * | 2016-04-06 | 2016-06-08 | 普强信息技术(北京)有限公司 | Cloud voice recognition system and method |
KR20180074210A (en) * | 2016-12-23 | 2018-07-03 | 삼성전자주식회사 | Electronic device and voice recognition method of the electronic device |
JP6532619B2 (en) * | 2017-01-18 | 2019-06-19 | 三菱電機株式会社 | Voice recognition device |
CN107945792B (en) * | 2017-11-06 | 2021-05-28 | 百度在线网络技术(北京)有限公司 | Voice processing method and device |
-
2018
- 2018-11-02 CN CN201880099155.1A patent/CN113016029A/en active Pending
- 2018-11-02 KR KR1020217011945A patent/KR20210052563A/en not_active Application Discontinuation
- 2018-11-02 WO PCT/KR2018/013280 patent/WO2020091123A1/en active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR100612839B1 (en) * | 2004-02-18 | 2006-08-18 | 삼성전자주식회사 | Domain based dialogue speech recognition method and device |
KR101415534B1 (en) * | 2007-02-23 | 2014-07-07 | 삼성전자주식회사 | Multi-stage speech recognition apparatus and method |
KR20110070688A (en) * | 2009-12-18 | 2011-06-24 | 한국전자통신연구원 | Speech Recognition Apparatus and Method with Two-Stage Speech Verification Scheme for Envelope Recognition Word Computation |
KR20140005639A (en) * | 2012-07-05 | 2014-01-15 | 삼성전자주식회사 | Electronic apparatus and method for modifying voice recognition errors thereof |
KR20150054445A (en) * | 2013-11-12 | 2015-05-20 | 한국전자통신연구원 | Sound recognition device |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11721324B2 (en) | 2021-06-09 | 2023-08-08 | International Business Machines Corporation | Providing high quality speech recognition |
Also Published As
Publication number | Publication date |
---|---|
CN113016029A (en) | 2021-06-22 |
KR20210052563A (en) | 2021-05-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2020027619A1 (en) | Method, device, and computer readable storage medium for text-to-speech synthesis using machine learning on basis of sequential prosody feature | |
Zissman et al. | Automatic language identification | |
WO2019139430A1 (en) | Text-to-speech synthesis method and apparatus using machine learning, and computer-readable storage medium | |
US6208964B1 (en) | Method and apparatus for providing unsupervised adaptation of transcriptions | |
WO2019139431A1 (en) | Speech translation method and system using multilingual text-to-speech synthesis model | |
WO2023163383A1 (en) | Multimodal-based method and apparatus for recognizing emotion in real time | |
KR100742888B1 (en) | Speech recognition method | |
WO2009145508A2 (en) | System for detecting speech interval and recognizing continuous speech in a noisy environment through real-time recognition of call commands | |
WO2019208860A1 (en) | Method for recording and outputting conversation between multiple parties using voice recognition technology, and device therefor | |
EP0549265A2 (en) | Neural network-based speech token recognition system and method | |
JPH0422276B2 (en) | ||
WO2021010617A1 (en) | Method and apparatus for detecting voice end point by using acoustic and language modeling information to accomplish strong voice recognition | |
WO2019172734A2 (en) | Data mining device, and voice recognition method and system using same | |
WO2021071110A1 (en) | Electronic apparatus and method for controlling electronic apparatus | |
WO2020091123A1 (en) | Method and device for providing context-based voice recognition service | |
WO2014200187A1 (en) | Apparatus for learning vowel reduction and method for same | |
WO2022075714A1 (en) | Speaker embedding extraction method and system using speech recognizer-based pooling technique for speaker recognition, and recording medium for same | |
JP2008052178A (en) | Speech recognition apparatus and speech recognition method | |
WO2020096078A1 (en) | Method and device for providing voice recognition service | |
WO2019208858A1 (en) | Voice recognition method and device therefor | |
WO2021071271A1 (en) | Electronic apparatus and controlling method thereof | |
WO2019208859A1 (en) | Method for generating pronunciation dictionary and apparatus therefor | |
WO2020096073A1 (en) | Method and device for generating optimal language model using big data | |
WO2019156427A1 (en) | Method for identifying utterer on basis of uttered word and apparatus therefor, and apparatus for managing voice model on basis of context and method thereof | |
EP3496092B1 (en) | Voice processing apparatus, voice processing method and program |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 18938770 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 20217011945 Country of ref document: KR Kind code of ref document: A |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 18938770 Country of ref document: EP Kind code of ref document: A1 |