CN109949815B

CN109949815B - Electronic device

Info

Publication number: CN109949815B
Application number: CN201910261486.7A
Authority: CN
Inventors: 郑晳荣; 金炅泰
Original assignee: Samsung Electronics Co Ltd
Current assignee: Samsung Electronics Co Ltd
Priority date: 2014-04-07
Filing date: 2015-04-07
Publication date: 2024-06-07
Anticipated expiration: 2035-04-07
Also published as: CN109949815A

Abstract

The invention discloses an electronic device. The electronic device includes: a processor configured to perform automatic speech recognition (ASR; automatic speech recognition) for speech input by utilizing the speech recognition model stored in the memory; and a communication module configured to: transmitting the voice input to a server and receiving a voice command from the server corresponding to the voice input, wherein the processor is further configured to: performing an operation corresponding to the result of the automatic speech recognition if the confidence score of the result of the automatic speech recognition is above a first threshold, and providing feedback to the user if the confidence score of the result of the automatic speech recognition is below a second threshold. Further, various embodiments can be realized which can be grasped from the specification.

Description

Electronic device

The application is a divisional application of the application application with the application number 201510162292.3 and the application name of 'electronic device and voice recognition executing method by using the electronic device and the server' which is submitted to the national intellectual property agency in the 4 th month of 2015.

Technical Field

Various embodiments of the present invention relate to a technique for recognizing a user's voice input and executing a voice command using a voice recognition model loaded in an electronic device and a voice recognition model available in a server.

Background

In addition to the conventional input manner using a keyboard or a mouse, recent electronic devices may also support an input manner using a user's voice (speech). For example, an electronic device such as a smart phone or a tablet computer may analyze Voice of a user input in a state where a specific function (e.g., S-Voice or Siri, etc.) is performed to convert the Voice into text, or may perform an operation corresponding to the Voice. In addition, the voice recognition function has been activated (always-on) in some electronic devices, and thus may be awakened (awake), unlocked (unlocked) or may perform functions such as internet retrieval, talking, or SMS/E-mail reading, at any time, according to the user's voice.

Disclosure of Invention

Although various studies and techniques associated with speech recognition are known, the method of performing speech recognition in an electronic device can only be limiting. For example, an electronic device may utilize a speech recognition model that is self-contained in the electronic device in order to achieve a rapid response to speech input. However, the storage space and processing power of the electronic device are limited, resulting in a limited number or variety of recognizable speech inputs.

To obtain more accurate and precise results for the voice input, the electronic device may communicate the voice input to the server requesting voice recognition and providing results replied from the server, or may perform certain operations based on the replied results. However, this method increases the communication usage of the electronic device and brings about a relatively slow response speed.

Various embodiments disclosed in the present specification may provide a voice recognition performing method that improves inefficiency that may occur in the foregoing various cases using two or more mutually different voice recognition capabilities or voice recognition models, and may provide a user with a fast response speed and high accuracy.

An electronic device according to various embodiments of the present invention may include: a processor that performs automatic speech recognition (ASR; automatic speech recognition) for speech input using the speech recognition model stored in the memory; and a communication module that provides the voice input to a server and receives a voice command from the server corresponding to the voice input. Wherein the processor (1) may perform an operation corresponding to the execution result of the automatic speech recognition if the reliability of the execution result of the automatic speech recognition is above a first threshold value, (2) may provide feedback for the reliability if the reliability of the execution result of the automatic speech recognition is less than a second threshold value.

According to various embodiments of the present invention, a voice recognition is performed using a voice recognition model self-installed in an electronic device and a voice recognition result through a server is additionally utilized based on the voice recognition result thereof, so that a voice recognition function having a fast response speed and high accuracy can be provided.

In addition, the results of speech recognition using the electronic device and the server may be compared and reflected in a speech recognition model or speech recognition algorithm based on the comparison results. Accordingly, accuracy and response speed can be improved more and more continuously as voice recognition is repeatedly performed.

Drawings

Fig. 1 shows an electronic device and a server connected to the electronic device via a network according to an embodiment of the invention.

Fig. 2 shows an electronic device and a server according to another embodiment of the invention.

Fig. 3 shows a flow chart of a speech recognition performing method according to an embodiment of the invention.

Fig. 4 shows a flowchart of a voice recognition performing method according to another embodiment of the present invention.

FIG. 5 shows a flow chart of a method of updating a threshold value according to one embodiment of the invention.

FIG. 6 shows a flowchart of a method of updating a speech recognition model, according to one embodiment of the invention.

Fig. 7 illustrates an electronic device within a network environment according to one embodiment of the invention.

Fig. 8 shows a block diagram of an electronic device according to an embodiment of the invention.

Detailed Description

Various embodiments of the present invention are described below with reference to the drawings. However, it is not intended to limit the invention to the particular embodiments, but it is to be understood that the invention includes various modifications, equivalents, and/or alternatives to the embodiments. With respect to the description of the drawings, like reference numerals may be used for like components.

In this specification, the expressions "having", "may have", "include" or "may include" are used to denote the presence of relevant features (e.g. numerical values, functions, operations or constituent elements of components, etc.), which do not exclude the presence of additional features.

In this specification, the expressions "a or B", "at least one of a and/or B" or "one or more of a and/or B" etc. may include all possible combinations of items listed together. For example, "a or B", "at least one of a and B" or "at least one of a or B" may refer to: (1) a case comprising at least one a; (2) a case comprising at least one B; or (3) a case where at least one A and at least one B are both included.

The expressions "first", "second", "first" or "second", etc. used in various embodiments may modify various constituent elements irrespective of order and/or importance, and are not limited to the relevant constituent elements. For example, the first user device and the second user device may represent user devices that are different from each other regardless of order or importance. For example, a first component may be named a second component, and similarly, a second component may be named a first component without departing from the scope of the claims of the present invention.

When it is mentioned that a certain component (e.g., a first component) is connected (functionally or communicatively) to ((operatively or communicatively) connected with/to) or is connected (connected to) another component (e.g., a second component), it is to be understood that the certain component is directly connected to the other component or is connected to the other component through another component (e.g., a third component). Conversely, when a component (e.g., a first component) is referred to as being "directly connected to" or "directly connected to" another component (e.g., a second component), it is understood that no other component (e.g., a third component) is present between the component and the other component.

The expression "configured to (or provided as) (configured to)" used in the present specification may be used interchangeably with, for example, "suitable for (usable for)", "… capable (HAVING THE CAPACITY to)", "designed to (designed to) the term", "changed to (adaptedto)", "manufactured to (map to) the term" or "capable of (usable of) the term" or the like, as appropriate. The term "configured to (or arranged to)" is not limited to meaning "specially designed (SPECIFICALLY DESIGNED to)" in hardware. In some cases, the expression "a device configured as …" may mean that the device can be configured with other devices or components. For example, the sentence "a processor configured (or arranged) to perform A, B and C" may represent a dedicated processor (e.g., an embedded processor) or a general-purpose processor (e.g., a CPU or application processor (application processor)) for performing related operations, wherein the general-purpose processor may perform related operations by executing one or more software programs stored in a memory device.

The terminology used in the description is for the purpose of describing particular embodiments only and is not intended to be limiting of the scope of other embodiments. The singular may also include the plural unless the context clearly indicates otherwise. Including technical or scientific terms, all terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Terms commonly used defined in dictionaries may be interpreted as having the same or similar meaning as the meaning of the related art in a text, and are not interpreted as idealized or overly formal sense unless expressly so defined herein. According to circumstances, even the terms defined in the specification cannot be construed as excluding the embodiments of the present invention.

In particular, in some embodiments, greater than relation (">") may be interchanged to greater than or equal to relation ("+").

Hereinafter, an electronic device according to various embodiments will be described with reference to the accompanying drawings. In this specification, a user may refer to a person using an electronic apparatus or an apparatus using an electronic apparatus (e.g., an artificial intelligence electronic apparatus).

Referring to fig. 1, an electronic device may include constituent elements such as a user terminal 100. For example, the user terminal 100 may include a microphone 110, a controller 120, an automatic speech recognition (ASR; automatic speech recognition) module 130, an automatic speech recognition model 140, a transceiver 150, a speaker 170, and a display 180. The configuration of the user terminal 100 shown in fig. 1 is exemplary, and is deformable to various forms capable of realizing the various embodiments disclosed in the present specification. For example, an electronic device may include: such as the user terminal 101 shown in fig. 2, the electronic device 701 shown in fig. 7, and the electronic device 801 shown in fig. 8, or may be appropriately denatured by using these components. Various embodiments of the present invention will be described below with reference to the user terminal 100.

The user terminal 100 may obtain voice input from a user through a microphone 110. For example, in the case where the user performs an application associated with speech recognition or speech recognition is always in an active state, the user's speech (speech) may be acquired through the microphone 110. Microphone 110 may include an Analog-to-digital converter (ADC) for converting Analog signals to digital signals. However, in some embodiments, the controller 120 may include an Analog-to-Digital converter (DAC), a Digital-to-Analog converter (DAC), and various signal processing or pre-processing circuits.

The controller 120 may provide the automatic voice recognition module 130 and the transceiver 150 with voice input acquired through the microphone 110 or an audio signal (or voice signal) generated based on the voice input. The audio signal provided by the controller 120 to the automatic speech recognition module 130 may be a signal that has been pre-processed for speech recognition. For example, the audio signal may be a noise filtering (noise filtering) signal or a signal to which an equalizer (equalizer) suitable for human voice is applied. Instead, the signal provided by the controller 120 to the transceiver 150 may be the voice input itself. Unlike the signals transmitted to the automatic speech recognition module 130, the controller 120 transmits acoustic data to the transceiver 150, so that more appropriate or better performing audio signal processing can be achieved by means of the server 200.

The controller 120 may control general operations of the user terminal 100. For example, the controller 120 controls voice input from a user and controls voice recognition operations, and may control the execution of voice recognition-based functions.

The automatic speech recognition module 130 may perform speech recognition on the audio signal provided by the controller 120. The automatic speech recognition module 130 may perform orphan recognition (isolated word recognition), connective word speech recognition (connected word recognition), large-volume vocabulary recognition (large vocabulary recognition), and the like on the speech input (audio signal). The automatic speech recognition performed by the automatic speech recognition module 130 may be implemented in a speaker independent (speaker-independent) manner or may be implemented in a speaker dependent (speaker-independent) manner. The automatic speech recognition module 130 need not necessarily be constituted by one speech recognition engine, but may be constituted by two or more speech recognition engines. In addition, when the automatic speech recognition module 130 includes a plurality of speech recognition engines, the recognition purpose of each speech recognition engine may be different. For example, one speech recognition engine may recognize speech (wakeup speech) for activating an automatic speech recognition function, such as may recognize "Hi, galaxy", and another speech recognition engine may recognize speech command speech (command speech), such as may recognize "READ ARECENT E-mail". The automatic speech recognition module 130 performs speech recognition based on the automatic speech recognition model 140, and thus may specify a range (e.g., a category or number) of speech inputs that may be recognized by the automatic speech recognition model 140. The above description of the automatic speech recognition module 130 also applies to the automatic speech recognition module 230 of the server described later.

The automatic speech recognition module 130 may transform the speech input into text. The automatic speech recognition module 130 may determine operations or functions to be performed by the electronic device for speech input. In addition, the automatic speech recognition module 130 may also determine a confidence level (confidence level) or confidence score for the execution result of the automatic speech recognition.

The automatic speech recognition model 140 may include grammars (grammars). The grammar may include, among other things, a grammar of linguistic aspects, and may include (through user input or collection on a web page) a plurality of forms of grammar that are statistically generated. In various embodiments, the automatic speech recognition module 140 may include an acoustic model (acoustic model), a language model (language model), and the like. Or the automatic speech recognition model 140 may be a speech recognition model used for isolated word recognition. In various embodiments, the automatic speech recognition model 140 may include a recognition model that performs an appropriate level of speech recognition by taking into account the computing and storage capabilities of the user terminal 100. For example, the grammar may include a grammar for a specified command structure independent of the grammar of the language aspect. For example, "call [ user name ]" as a grammar for making calls (calls) to users of [ user name ] "may be included in the automatic speech recognition model 140.

The transceiver 150 may transmit the voice signal provided by the controller 120 to the server 200 through the network 10. Further, the result of performing the voice recognition corresponding to the transmitted voice signal may be received from the server 200.

The speaker 170 and display 180 may be used to interact with user input. For example, if voice input is provided by the user through the microphone 110, the result of performing voice recognition is displayed on the display 180 and may be output through the speaker 170. Of course, the speaker 170 and the display 180 may perform a general sound output function and a picture output function of the user terminal 100, respectively.

The server 200 may include components for performing voice recognition on voice input provided by the user terminal 100 through the network 10. Accordingly, a part of the constituent elements of the server 200 may correspond to the user terminal 100. For example, the server 200 may include a transceiver 210, a controller 220, an automatic speech recognition module 230, an automatic speech recognition model 240, and the like. In addition, server 200 may also include components such as an automatic speech recognition model converter 250 or a natural language processor (NLP; natural Language Processor) 260.

The controller 220 may control functional modules in the server 200 for performing voice recognition. For example, the controller 220 may be coupled to the automatic speech recognition module 230 and/or the natural language processor 260. Further, the controller 220 may perform functions associated with the identification model update in conjunction with the user terminal 100. In addition, the controller 220 may perform preprocessing of the voice signal transmitted through the network 10 and provide it to the automatic voice recognition module 230. Wherein the preprocessing may have other ways or effects than the preprocessing performed in the user terminal 100. In some embodiments, the controller 220 of the server 200 may be self-evident as a "orchestra (orchestrator)".

The automatic speech recognition module 230 may perform speech recognition on the speech signal provided by the controller 220. At least a portion of the description of the automatic speech recognition module 130 may be applicable to the automatic speech recognition module 230. However, although the server-use automatic speech recognition module 230 performs a function partially similar to that performed by the user terminal-use automatic speech recognition module 130, the scope of the functions or algorithms involved may be different. The automatic voice recognition module 230 performs voice recognition based on the automatic voice recognition model 240, whereby a result different from the voice recognition result of the automatic voice recognition module 130 of the user terminal 100 can be generated. Specifically, the recognition result is generated in the server 200 by means of the automatic voice recognition module 230 and the natural language processor 260 and based on voice recognition, natural language understanding (Natural Language Understanding; NLU), session management (Dialog Management; DM), or a combination thereof, whereas the recognition result may be generated in the user terminal 100 by means of the automatic voice recognition module 130. For example, as a result of the automatic speech recognition module 130 performing the automatic speech recognition, the first operation information and the first confidence level may be determined for the speech input, and as a result of the automatic speech recognition module 230 performing the speech recognition, the second operation information and the second confidence level may be determined. In some embodiments, the results of the execution of automatic speech recognition module 130 may be consistent with or at least partially different from the results of the execution of automatic speech recognition module 230. For example, although the second operation information and the second operation information correspond to each other, the second reliability may have a score (score) higher than the first reliability. In various embodiments, the speech recognition (ASR) performed by the automatic speech recognition module 130 of the user terminal 100 may be defined as a first speech recognition and the speech recognition (ASR) performed by the automatic speech recognition module 230 of the server 200 may be defined as a second speech recognition.

In various embodiments, if the algorithm of the first speech recognition performed in the automatic speech recognition module 130 is different from the algorithm of the second speech recognition performed in the automatic speech recognition module 230 or the model used for speech recognition is different, the server 200 may include an automatic speech recognition model converter 250 for model transformation between each other.

Further, the server 200 may include a natural language processor 260 for grasping the intention of the user and determining a function to be performed based on the result recognized in the automatic voice recognition module 230. The natural language processor 260 may perform the following functions: natural language analysis, which is to mechanically analyze the language phenomenon of human speaking, so as to make the language phenomenon into a form which can be understood by a computer; or natural language processing for re-rendering the computer-understandable modality into a human-understandable language.

In fig. 2, an example of an electronic device implemented in a different manner than fig. 1 is shown. However, the voice recognition method disclosed in the present specification may be performed by means of various forms of devices that can be modified thereby, in addition to the electronic device/user terminal in fig. 1 or fig. 2 or fig. 7 and fig. 8 described later.

Referring to fig. 2, the user terminal 101 may include a processor 121 and a memory 141. Processor 121 may include an automatic speech recognition engine 131 for performing speech recognition. The memory 141 may store an automatic speech recognition model 143 that is used by the automatic speech recognition engine 131 to perform speech recognition. For example, the processor 121, the automatic speech recognition engine 131, and the automatic speech recognition model 143 (or the memory 141) of fig. 2 may be understood to correspond to the controller 120, the automatic speech recognition module 130, and the automatic speech recognition model 140 of fig. 1, respectively, with respect to the functions performed by the respective constituent elements. Hereinafter, description of corresponding or repeated contents will be omitted.

The user terminal 101 may utilize a voice recognition module 111 (e.g., microphone 110) to obtain voice input from a user. Processor 121 may perform automatic speech recognition on the acquired speech input using automatic speech recognition model 143 stored in memory 141. Further, the user terminal 101 may provide a voice input to the server 200 through the communication module 151 and receive a voice command (e.g., second operation information) corresponding to the voice input from the server 200. The user terminal 101 may output a voice recognition result, which may be acquired by means of the automatic voice recognition engine 131 and the server 200, using the display 181 (or a speaker).

Hereinafter, various speech recognition methods will be described with reference to fig. 3 to 6 with reference to the user terminal 100.

In operation 301, the user terminal 100 may acquire a voice input of a user using, for example, a microphone voice acquisition module. The operation may be performed in a state where a predetermined function or application associated with speech recognition is performed by the user. However, in some embodiments, the voice recognition of the user terminal 100 may be always in an operational state (always-on) (e.g., the microphone is always in an active state), in which case operation 301 may be performed always for the user's speech. Alternatively, as described above, by means of mutually different speech recognition engines, automatic speech recognition may be activated by a predetermined speech input (for example, "Hi, galaxy") and automatic speech recognition for speech recognition of a subsequent input may be performed.

In operation 303, the user terminal 100 may transmit the voice signal (or at least a portion of the voice signal) to the server 200. Within the device, the speech signal (or an audio signal that converts the speech input into a (digital) speech signal and performs pre-processing on the speech signal) may be provided to the automatic speech recognition module 130 by a processor (e.g., the controller 120). In other words, in operation 303, the user terminal 100 may provide a voice signal, which is a recognition object, to an automatic voice recognition module, which is located inside and outside the apparatus, which can perform voice recognition. The user terminal 100 can use self-voice recognition together with voice recognition through the server 200.

In operation 305, self-directed speech recognition may be performed in the user terminal 100. The speech recognition may be defined as ASR1. For example, the automatic speech recognition module 130 may utilize the automatic speech recognition model 140 to perform speech recognition for speech input. For example, the automatic speech recognition model 140 may perform ASR1 on at least a portion of the speech signal. As a result of execution of the ASR1, a result of execution of the speech input may be obtained. For example, in the case where the user provides a voice input such as "tomorrow weather", the user terminal 100 may determine operation information such as "execute weather application, output tomorrow weather" using a voice recognition function for the voice input. In addition, the result of the execution of the speech recognition may include, in addition to the operation information, the reliability for the operation information. For example, while the automatic speech recognition module 130 may determine 95% of confidence in the case where the result of analysis of the user's speech is that "tomorrow weather" is determined, 60% of confidence may be given to the determined operation information in the case where the result of analysis of the intercom is that "daily weather" or "tomorrow weather" is not determined.

In operation 307, the processor may determine whether the confidence level is above a specified threshold (threshold). For example, when the reliability regarding the operation information determined by the automatic speech recognition module 130 is a specified level (e.g., 80%) or more, the user terminal 100 may perform an operation corresponding to a speech command recognized through the ASR1, that is, a speech recognition function of the user terminal 100 itself, in operation 309. The operations may include, for example: at least one of execution of at least one function executable by means of the processor, execution of at least one application, or input based on the result of execution of the automatic speech recognition.

Operation 309 may be performed prior to obtaining the results of the speech recognition from server 200 (e.g., operation 315). In other words, if the result of performing the voice recognition by itself in the user terminal 100 is to recognize a voice command having a sufficient degree of reliability, the user terminal directly performs a related operation without waiting for an additional voice recognition result acquired from the server 200, so that a rapid response speed to the user's voice input can be ensured.

If the reliability is less than the critical value in operation 307, the user terminal 100 may wait until the voice recognition result is obtained from the server 200 in operation 315. During the period of waiting for the operation to be performed, the user terminal 100 may display an appropriate short message, icon, image, or the like, thereby indicating that voice recognition for voice input is being performed.

In operation 311, the voice recognition by means of the server may be performed with respect to the voice signal transmitted to the server 200 in operation 303. The speech recognition may be defined as ASR2 (second automatic speech recognition). Further, natural language processing (Natural Language Processing; NLP) may be performed in operation 313. For example, natural language processing may be performed on the speech input or recognition results of the ASR2 by means of the natural language processor 260 of the server 200. In some embodiments, this process may also be performed selectively.

If the ASR2 or the speech recognition result (e.g., the second operation information and the second confidence level) that the ASR2 and the NLP are performed is obtained from the server 200 in operation 315, an operation corresponding to the speech command (e.g., the second operation information) recognized by means of the ASR2 may be performed in operation 317. For operation 317, in addition to the time taken to perform the voice recognition of the user terminal itself, the time taken to transmit the voice signal in operation 303 and to acquire the voice recognition result in operation 315 is additionally required, and thus a longer response time may be required than the performance of the operation based on operation 309. However, through operation 317, an operation with relatively high reliability and accuracy may be performed on speech recognition that cannot be processed by itself or speech recognition that has low reliability even though it can be processed by itself.

Referring to fig. 4, the speech acquisition operation 401, the speech signal transmission operation 403, the ASR1 operation 405, the ASR2 operation 415, and the natural language processing operation 417 correspond to the operations 301, 303, 305, 311, and 313 described with reference to fig. 3, respectively, and thus a description thereof is omitted.

The voice recognition performing method described with reference to fig. 4 is performed on the basis of two threshold values. Based on the first threshold and the second threshold related to the reliability smaller than the first threshold, operations (e.g., operations 409, 413, 421, respectively) that are different from each other may be performed when the reliability of the execution result of the ASR1 of operation 405 is in three cases: (1) the reliability is above a first critical value; (2) the confidence level is less than a second threshold; and (3) a confidence level between the first threshold (not included) and the second threshold (included). Wherein the relationship of no/no can be variously combined.

When the reliability is above the first threshold in operation 407, the user terminal 100 may perform an operation corresponding to the execution result of the ASR1 in operation 409. If the confidence level is less than the first threshold in operation 407, then a determination may be made in operation 411 as to whether the confidence level is less than the second threshold.

In operation 411, if the reliability is less than the second critical value, the user terminal 100 may provide feedback for the reliability. The feedback may include an output of a message or audio indicating that the user's voice input was not normally recognized by the electronic device or that the recognition result could not be trusted even if recognized. For example, the user terminal 100 may display a guide message such as "voice is not recognized, please make a round" or output through a speaker. Or the user terminal 100 may also pass through, for example, a message such as "do you say" XXX "? "feedback guides relatively easily recognizable speech inputs (e.g.," yes "," no "," not "," how to "and" not at all ", etc.), so that accuracy of recognition results with low confidence can be confirmed.

If feedback is provided in operation 413, operation 421 may not be performed even if the voice recognition result is acquired in operation 419 after a certain time has elapsed later. This is because a new voice input can be generated by the user through feedback, in which case it is not appropriate to perform an operation for a voice input that has occurred before. However, in some embodiments, even with feedback of operation 413, if no additional input occurs from the user for a predetermined time, and the speech recognition result (e.g., the second operation information and the second confidence level) received from the server 200 in operation 419 satisfies a specified condition (e.g., the second confidence level is above the first threshold or any third threshold), operation 421 may be performed after operation 413.

In operation 411, if the confidence level obtained in operation 405 is above a second critical value (in other words, the confidence level is between the first critical value (not included) and the second critical value (included)), the user terminal 100 may obtain a voice recognition result from the server 200 in operation 419. In operation 421, the user terminal 100 may perform an operation corresponding to the voice command (second operation information) recognized through the ASR 2.

In the embodiment of fig. 4, the reliability of the voice recognition result by means of the user terminal 100 is divided into usable and unusable levels and a level that can be used by referring to the automatic voice recognition result of the server 200, so that an appropriate operation can be performed according to the reliability. In particular, in case that the reliability is too low, the user terminal 100 provides feedback to guide the user to reenter regardless of whether the result is received from the server 200, so that it is possible to prevent a situation in which a message such as "unrecognized" is provided to the user after a response waiting time has elapsed.

Referring to fig. 5, the speech acquisition operation 501, the speech signal transmission operation 503, the ASR1 operation 505, the ASR2 operation 511, and the natural language processing operation 513 correspond to the operations 301, 303, 305, 311, and 313 described previously in fig. 3, respectively, and thus the description thereof is omitted.

In operation 507, if the reliability of the execution result for the ASR1 is above a threshold (e.g., a first threshold), operation 509 may be performed to perform an operation corresponding to the voice command (e.g., first operation information) recognized by the ASR 1. If the confidence level of the execution result of ASR1 is less than the threshold in operation 507, the process after operation 315 of FIG. 3 or the process after operation 411 of FIG. 4 may be performed.

In the embodiment of fig. 5, the flow does not terminate after performing operation 509 and operations 515 through 517 may be performed. In operation 515, the user terminal 100 may acquire a voice recognition result from the server 200. For example, the user terminal 100 may acquire second operation information and second confidence as a result of execution of the ASR2 for the voice signal transmitted in operation 503.

In operation 517, the user terminal 100 may compare the recognition results of the ASR1 and the ASR 2. For example, the user terminal may determine whether the recognition results of ASR1 and ASR2 are identical (or correspond) to each other or are different from each other. For example, when the recognition result of ASR1 is recognition of a speech such as "tomorrow weather" and the recognition result of ASR2 is recognition of a speech such as "tomorrow weather? "in the case of voice, the operation information in both cases may include" execute weather application, output tomorrow weather ". In this case, the recognition results of ASR1 and ASR2 may be understood to correspond to each other. However, if the speech recognition results are such that mutually different operations are to be performed, it may be judged that two (or more) speech recognition results do not correspond to each other.

In operation 519, the user terminal 100 may compare the self-automatic voice recognition execution result with the voice command received from the server to change the threshold value. For example, the user terminal 100 may decrease the first threshold value in case that the first operation information and the second operation information include voice commands identical to or corresponding to each other. For example, for a certain voice input, the result of performing the self-help voice recognition of the user terminal 100 is previously taken without waiting for the response of the server 200 only when the reliability of 80% or more is exhibited, and the result of performing the self-help voice recognition of the user terminal 100 can be used by updating the threshold value even if the reliability of 70% or more is obtained. The updating of the threshold value may be repeatedly performed every time the user uses the voice recognition function, and as a result, a low threshold value is set for the voice recognition frequently used by the user, so that a quick response speed can be brought.

However, if the execution results of ASR1 and ASR2 are different from each other, the threshold may be increased. In some embodiments, the threshold update operation may occur after a specified condition has been accumulated a specified number of times. For example, for a certain speech input, the threshold may be updated (turned down) when the number of times that the execution result of ASR1 coincides with the execution result of ASR2 occurs more than 5 times.

Referring to fig. 6, the speech acquisition operation 601, the speech signal transmission operation 603, the ASR1 operation 605, the ASR2 operation 611, and the natural language processing operation 613 correspond to the operations 301, 303, 305, 311, and 313 described earlier with reference to fig. 3, respectively, and thus the description thereof is omitted.

In operation 607, when the reliability of the execution result of the ASR1 is above a threshold (e.g., a first threshold), operations following operation 309 of fig. 3, operation 409 of fig. 4, and operation 509 of fig. 5 may be performed.

When the reliability of the execution result for the ASR1 is less than the critical value in operation 607, the user terminal 100 acquires the voice recognition result from the server 200 in operation 609, and may perform an operation corresponding to the voice command recognized through the ASR2 in operation 615. Operations 609 and 615 may correspond to operations 315 and 317 of fig. 3, or to operations 419 and 421 of fig. 4.

In operation 617, the user terminal 100 may compare the speech recognition results of the ASR1 and the ASR 2. Operation 617 may correspond to operation 517 of fig. 5.

In operation 619, the user terminal 100 may update a voice recognition model (e.g., the automatic voice recognition model 140) of the user terminal 100 based on the comparison result of operation 617. For example, the user terminal 100 may add the speech recognition result (e.g., the second operation information; or the second operation information and the second confidence level) of the ASR2 for the speech input to the speech recognition model. For example, when the first confidence level is less than the first threshold value, the second operation information for voice input and the second confidence level may be added to a voice recognition model that utilizes the first voice recognition. For example, when the first operation information and the second operation information do not correspond to each other, the user terminal 100 may add the second operation information (and the second reliability) to a voice recognition model that utilizes the first voice recognition based on the first reliability and the second reliability (e.g., a case where the second reliability is higher than the first reliability). Similar to the embodiment of fig. 5, the speech recognition model update operation may occur after a specified condition has been accumulated a specified number of times.

Referring to fig. 7, an electronic device 701 within a network environment 700 is illustrated in various embodiments. The electronic device 701 may include a bus 710, a processor 720, a memory 730, an input-output interface 750, a display 760, and a communication interface 770. In some embodiments, the electronic device 701 may omit at least one of the above-described components, or may be additionally equipped with other components.

Bus 710 may include, for example: circuitry for interconnecting components 710-770 and communicating (e.g., control messages and/or data) between the components.

Processor 720 may include one or more of a Central Processing Unit (CPU), an application processor (AP; application processor), or a communication processor (CP; communication processor). Processor 720 may perform operations or data processing, such as control and/or communication with respect to at least one other component of electronic device 701.

Memory 730 may include volatile and/or nonvolatile memory. Memory 730 may store, for example, commands or data associated with at least one other component of electronic device 701. According to one embodiment, memory 730 may store software and/or program 740. The programs 740 may include, for example, a kernel (kernel) 741, middleware (middleware) 743, application programming interfaces (APIs; application programming interface) 745, and/or application programs (or "applications") 747, etc. At least a portion of kernel 741, middleware 743, or application programming interface 745 can be referred to as an Operating System (OS).

The kernel 741 may control or manage system resources (e.g., bus 710, processor 720, memory 730, etc.) for performing operations or functions performed by other programs (e.g., middleware 743, application programming interface 745, or application 747). Further, the kernel 741 may access individual components of the electronic device 701 from the middleware 743, the application programming interface 745, or the application 747, and thus may provide an interface for controlling or managing system resources.

The middleware 743 may perform, for example, a relay function for causing the application programming interface 745 or the application 747 to communicate with the kernel 741 to transmit and receive data.

Further, middleware 743 may prioritize one or more job requests received from applications 747. For example, the middleware 743 may prioritize at least one of the applications 747 to be able to use system resources (e.g., bus 710, processor 720, memory 730, etc.) of the electronic device 701. For example, the middleware 743 processes the one or more job requests according to the order of priority given to the at least one application program, so that scheduling (scheduling) or load balancing (load balancing) for the one or more job requests may be performed.

The application programming interface 745, for example, as an interface for the application 747 to control the functions provided by the kernel 741 or the middleware 743, may include at least one interface or function (e.g., command) for file control, window control, image processing, or text control, for example.

The input/output interface 750 may function as an interface capable of transferring a command or data input from a user or other external device to other components of the electronic apparatus 701, for example. The input/output interface 750 may output commands and data received from other components of the electronic device 701 to a user or other external apparatuses.

The display 760 may include, for example, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, an Organic Light Emitting Diode (OLED) display, a microelectromechanical system (microelectromechanical systems; MEMS) display, or an electronic paper (electronic) display. The display 760 may, for example, display various content (e.g., text, images, video, icons or symbols, etc.) to a user. The display 760 may include a touch screen, for example, may receive touch, gesture, approach, or hover (hovering) input with an electronic pen or a portion of the user's body.

The communication interface 770 may set up, for example, communication between the electronic device 701 and an external device (e.g., the first external electronic device 702, the second external electronic device 704, or the server 706). For example, the communication interface 770 may be coupled to the network 762 via wireless communication or wired communication, thereby allowing communication with the external device (e.g., the second external electronic device 704 or the server 706).

The wireless communication may use, for example, at least one of LTE, LTE-A, CDMA, WCDMA, UMTS, wiBro, or GSM, etc., as a cellular communication protocol, for example. Further, the wireless communication may include, for example, a short-range communication network 764. The near field communication network 764 may include, for example, at least one of Wi-Fi, bluetooth (Bluetooth), near field communication (NFC; NEAR FIELD communication) or global positioning system (GPS; global positioning system), etc. The wired communication may include, for example, at least one of a universal serial bus (USB: universal serial bus), a high-definition multimedia interface (HDMI; high definition multimedia interface), RS-232 (recommended standard (recommended standard) 832), or plain old telephone service (POTS: plain old telephone service), etc. The network 762 may include a communication network (telecommunications network), which may include, for example, at least one of a computer network (e.g., LAN or WAN), the internet, or a telephone network (telephone network).

The first external electronic device 702 and the second external electronic device 704 may be the same type of device as the electronic device 701 or different types of devices, respectively. According to one embodiment, the server 706 may comprise a group of one or more servers. According to various embodiments, all or a portion of the operations performed in electronic device 701 may be performed in one or more other electronic devices (e.g., electronic devices 702, 704, or server 706). According to one embodiment, when the electronic device 701 needs to perform a certain function or service automatically or upon request, the electronic device 701 may replace the self-performed function or service or additionally request at least a portion of the functions associated with the self-performed function or service from other devices (e.g., the electronic devices 702, 704 or the server 706). Other electronic devices (e.g., electronic devices 702, 704 or server 706) may perform the requested function or additional functions and communicate the results thereof to electronic device 701. The electronic device 701 may not process or otherwise process the received results to provide the requested function or service. To this end, cloud computing (closed computing), decentralized computing, or client-server computing techniques, for example, may be utilized.

Referring to fig. 8, an electronic device 801 may include all or a portion of the electronic device 701 shown in fig. 7, for example. The electronic device 801 may include: one or more processors (e.g., application Processor (AP) 810, communication module 820, subscriber identity module 824, memory 830, sensor module 840, input device 850, display 860, interface 870, audio module 880, camera module 891, power management module 895, battery 896, indicator 897, and motor 898).

The processor 810 may control a plurality of hardware or software components connected to the processor 810 by driving an operating system or an application program, for example, and perform various data processing and operations. The processor 810 may be implemented, for example, by a system on chip (SoC). According to one embodiment, the processor 810 may also include a graphics processing unit (GPU; graphic processing unit) and/or an image signal processor (image signalprocessor). Processor 810 may also include at least a portion of the constituent elements shown in fig. 8 (e.g., cellular module 821). The processor 810 may load a command or data received from at least one of the other constituent elements (e.g., nonvolatile memory) into (load) volatile memory to process and store diverse data in the nonvolatile memory.

The communication module 820 may have the same or similar construction as the communication interface 770 of fig. 7. The communication module 820 may include, for example, a cellular module 821, a Wi-Fi module 823, a bluetooth module 825, a GPS module 827, a near field communication module 828, and a Radio Frequency (RF) module 829.

The cellular module 821 may provide voice calls, image calls, text services, internet services, or the like through, for example, a communication network. According to one embodiment, cellular module 821 may utilize a subscriber identity module (e.g., SIM card) 824 to perform differentiation and authentication of electronic device 801 within a communication network. According to one embodiment, cellular module 821 may perform at least a portion of the functions that processor 810 may provide. According to one embodiment, cellular module 821 may include a Communication Processor (CP).

The Wi-Fi module 823, bluetooth module 825, GPS module 827, or near field communication module 828, respectively, may include, for example, a processor for processing data transceived by the relevant modules. According to certain embodiments, at least a portion (e.g., two or more) of the cellular module 821, wi-Fi module 823, bluetooth module 825, GPS module 827, or near field communication module 828 may be included within one integrated chip (IC; INTEGRATED CHIP) or integrated chip package.

The radio frequency module 829 may transmit and receive, for example, a communication signal (e.g., a radio frequency signal). The radio frequency module 829 may include, for example, a transceiver (transmitter), a Power Amplifier Module (PAM), a frequency filter (frequency filter), a Low Noise Amplifier (LNA) (low noise amplifier), an antenna, or the like. According to another embodiment, at least one of the cellular module 821, the Wi-Fi module 823, the bluetooth module 825, the GPS module 827 or the near field communication module 828 may transmit and receive radio frequency signals through a dedicated radio frequency module.

The subscriber identity module 824 may include, for example, a card and/or a built-in SIM (embeddedSIM) that includes the subscriber identity module, and may include inherent identification information (e.g., an integrated circuit card identity (ICCID; INTEGRATED CIRCUIT CARD IDENTIFIER)) or subscriber information (e.g., an international mobile subscriber identity (IMSI; internationalmobile subscriber identity)).

Memory 830 (e.g., memory 730) may include, for example, internal memory 832 or external memory 834. Built-in memory 832 may include at least one of, for example, volatile memory (e.g., dynamic random access memory (DRAM: DYNAMIC RAM), static random access memory (SRAM: STATIC RAM), or synchronous dynamic random access memory (SDRAM: synchronous DYNAMIC RAM), etc.), non-volatile memory (e.g., one-time programmable read-only memory (OTPROM: one time programmable ROM), programmable read-only memory (PROM: programmable ROM), erasable programmable read-only memory (EPROM: erasable and programmable ROM), electrically erasable programmable read-only memory (EEPROM: ELECTRICALLY ERASABLE AND PROGRAMMABLE ROM), etc.), mask (mask) ROM, flash memory (e.g., NAND flash memory or NOR flash memory, etc.), hard disk drive (HARD DRIVE), or solid state disk (SSD; solid state drive STATE DRIVE).

The external memory 834 may include a flash drive (FLASH DRIVE) and may also include, for example, CF (compact flash) Card, SD (secure digital) Card, micro SD Card, mini SD Card, XD (extreme digital) Card, multiMedia Card (MMC), memory stick (memory stick), or the like. The external memory 834 can be functionally and/or physically connected to the electronic device 801 through a variety of interfaces.

The sensor module 840 may, for example, measure a physical quantity or sense an operating state of the electronic device 801 and convert the measured or sensed information into an electrical signal. The sensor module 840 may include, for example, at least one of a gesture sensor 840A, a gyroscope sensor 840B, a barometric sensor 840C, a magnetic sensor 840D, an acceleration sensor 840E, a hand-grip sensor 840F, a proximity sensor 840G, a color sensor 840H (e.g., an RGB sensor), a biological sensor 840I, a temperature/humidity sensor 840J, an illuminance sensor 840K, or an Ultraviolet (UV) sensor 840M. Additionally or alternatively, the sensor module 840 may also include, for example, an olfactory sensor (E-nose), an Electromyography (EMG) sensor, an electroencephalogram (EEG; electroencephalogram) sensor, an Electrocardiogram (ECG) sensor, an Infrared (IR) sensor, a rainbow sensor, and/or a fingerprint sensor. The sensor module 840 may also include control circuitry for controlling at least one sensor contained therein. In some embodiments, the electronic device 801 may also include a processor that is part of the processor 810 or separately configured to control the sensor module 840 so that the sensor module 840 may be controlled during periods when the processor 810 is in a sleep (sleep) state.

The input device 850 may include, for example, a touch panel 852, (digital) pen sensor 854, key 856, or ultrasonic (ultrasonic) input device 858. The touch panel 852 may use at least one of an electrostatic type, a pressure reducing type, an infrared type, or an ultrasonic type, for example. In addition, the touch panel 852 may further include a control circuit. The touch panel 852 may also include a haptic layer (TACTILE LAYER) so that a haptic response may be provided to the user.

The digital pen sensor 854 may be part of a touch panel, for example, or may include a special sheet for identification (sheet). The keys 856 may include, for example, physical buttons, optical keys, or a keyboard. The ultrasonic input device 858 senses ultrasonic waves generated from the input tool through a microphone (for example, the microphone 888), so that data corresponding to the sensed ultrasonic waves can be confirmed.

The display 860 (e.g., display 760) may include a panel 862, a hologram device 864, or a projector 866. The panel 862 may include the same or similar construction as the display 760 of fig. 7. The panel 862 may be implemented, for example, to have flexibility (flexible), transparency (transparent), or wearability (wearable). The panel 862 and the touch panel 852 may be constructed as one module. The hologram device 864 can display a stereoscopic image in a blank space using interference of light. Projector 866 may project light onto a screen to display an image. The screen may be located, for example, inside or outside of the electronic device 801. According to one embodiment, the display 860 may also include control circuitry for controlling the panel 862, the hologram device 864, or the projector 866.

The interface 870 may include, for example, an HDMI interface 872, a USB interface 874, an optical interface (optical interface) 876, or a D-sub interface 878. Interface 870 may be included with a communication interface 770, such as shown in fig. 7. In addition or alternatively, the interface 870 may include, for example, a mobile high-definition link (MHL) interface, an SD card/multimedia card interface, or an infrared data protocol (IrDA) INFRARED DATA association specification interface.

The audio module 880 may implement, for example, bi-directional conversion of sound (sound) with electrical signals. At least a portion of the constituent elements of the audio module 880 may be included in, for example, the input-output interface 750 shown in fig. 7. The audio module 880 may process sound information input or output through, for example, a speaker 882, a receiver 884, headphones 886, or a microphone 888.

The camera module 891, as a device that can take, for example, still images and video, may include more than one image sensor (e.g., front side sensor or back side sensor), lens, image signal processor (ISP; IMAGE SIGNAL processor), or flash (flash) (e.g., LED or xenon lamp (xenon lamp)) according to one embodiment.

The power management module 895 may manage power of the electronic device 801, for example. According to one embodiment, the power management module 895 may include a Power Management Integrated Circuit (PMIC), a charging integrated circuit (CHARGER INTEGRATED circuit), a battery, or a fuel gauge (battery or fuel gauge). The power management integrated circuit may be provided with wired and/or wireless charging means. Wireless charging methods include, for example, a magnetic resonance method, a magnetic induction method, an electromagnetic wave method, or the like. Additional circuitry for wireless charging may also be included, such as a coil loop, resonant circuit, or rectifier, among others. The battery gauge may measure, for example, the remaining charge of the battery 896, the voltage, current, or temperature during charging. The battery 896 may include, for example, a rechargeable battery (rechargeable battery) and/or a solar battery (solar battery).

The indicator 897 may display a particular state of the electronic device 801 or a portion thereof (e.g., the processor 810), such as a start (booting) state, a message state, or a charge state, etc. The motor 898 may convert an electrical signal into mechanical vibration, and may generate a vibration (vibration) effect, a haptic (haptic) effect, or the like. Although not shown, the electronic device 801 may include a processing device (e.g., GPU) for supporting a mobile Television (TV). The processing means for supporting the mobile tv may process media data based on specifications such as digital multimedia broadcasting (DMB; digital multimedia broadcasting), digital video broadcasting (DVB; digital video broadcasting) or media streaming (MediaFlo ^TM).

Each component described in this specification may be formed of one or more members, and the names of the relevant components may be different depending on the type of electronic device. In various embodiments, the electronic device may include at least one of the components described in the present specification, and a part of the components may be omitted or additional components may be included. In addition, some of the constituent elements of the electronic device according to various embodiments may be combined to constitute an individual (entity), so that the functions of the relevant constituent elements before combination can be similarly performed.

The term "module" as used in this specification may refer to a unit (unit) comprising, for example, one or a combination of two or more of hardware, software, or firmware (firmware). "module" may be used interchangeably with terms such as unit (unit), logic (logic), logic block (logic block), component or circuit (interchangeably use). The "module" may be the smallest unit of a component that is formed as a single piece or a portion thereof. A "module" may also be the smallest unit or part thereof for performing one or more functions. The "module" may be implemented mechanically or electronically. For example, a "module" may include at least one of an application-specific integrated circuit (ASIC) chip, a field-programmable gate array (FPGA) or a programmable logic device (programmable-logic device) for performing certain operations, as known or later developed.

At least a portion of an apparatus (e.g., a module or a function thereof) or a method (e.g., an operation) according to various embodiments may be implemented by, for example, commands stored in a computer readable storage medium (computer-readable storage media) in the form of program modules.

For example, the storage medium may store instructions for performing the following operations and functions: an operation that, when executed, causes a processor of the electronic device to obtain a voice input from a user to generate a voice signal; performing a first voice recognition of at least a portion of the voice signal to obtain first operation information and a first confidence score; an operation of transmitting at least a portion of the speech signal to a server for performing a second speech recognition; an operation of receiving second operation information for the transmitted signal from the server; (1) Executing a function corresponding to the first operation information when the first reliability is above a first critical value; (2) Providing feedback of the first confidence level when the first confidence level is less than a second threshold; (3) When the first confidence level is between the first threshold (not included) and the second threshold (included), a function corresponding to the second operation information is performed.

A module or program module according to various embodiments may include at least one of the foregoing components, or may have a portion omitted, or may also include additional other components. Operations performed by modules, program modules, or other components in accordance with various embodiments may be performed by sequential, parallel, repetitive, or heuristic (heuristic) methods. Also, some operations may be performed in a different order, or omitted, or there may be additional operations.

In addition, the embodiments disclosed in the present specification are presented for the purpose of illustration and understanding of the disclosed technical content, and are not intended to limit the scope of the invention. Accordingly, the scope of the present invention should be construed as including all other embodiments that are modified or varied based on the technical idea of the present invention.

Claims

1. An electronic device, comprising:

a processor configured to perform a first automatic speech recognition for a speech input by utilizing a speech recognition model stored in a memory; and

A communication module configured to: transmitting information related to the voice input to a server, and receiving a voice command corresponding to the voice input obtained by performing a second automatic voice recognition on the voice input in the server from the server,

Wherein the first automatic speech recognition and the second automatic speech recognition are performed in parallel,

Wherein the processor is further configured to:

directly performing an operation corresponding to the result of the first automatic speech recognition, without waiting for the speech command received from the server,

Executing the voice command received from the server if a first confidence score of a result of a first automatic voice recognition is less than the first threshold and greater than or equal to a second threshold,

Providing feedback to the user if the first confidence score of the result of the first automatic speech recognition is below the second threshold, and

Comparing the result of the first automatic speech recognition with the speech command received from the server, in case the first confidence score is below the first threshold, and updating the speech recognition model based on the result of the comparison,

Wherein the process of comparing the result of the first automatic speech recognition with the speech command received from the server comprises:

Determining whether a result of a first automatic voice recognition and the voice command received from the server correspond to each other;

Wherein the process of updating the speech recognition model based on the result of the comparison comprises:

When the result of the first automatic voice recognition and the voice command received from the server do not correspond to each other, accumulating a specified number of times, adding the voice command received from the server and the second confidence score of the voice command received from the server to the voice recognition model based on the second confidence score of the voice command received from the server being higher than the first confidence score of the result of the first automatic voice recognition.

2. The electronic device of claim 1, wherein the processor is further configured to:

The first threshold is changed based on a result of comparing a result of the first automatic speech recognition with the speech command received from the server.

3. The electronic device of claim 1, wherein, in the event that the first confidence score is above a first threshold, the processor is configured to perform the operation regardless of receipt of the voice command from the server.

4. The electronic device of claim 1, wherein performing the operation comprises: at least one function executable by the processor, at least one application, or at least one input based on the results of the first automatic speech recognition is performed.

5. The electronic device of claim 1, wherein providing the feedback comprises: a message or audio output is provided indicating that the speech input is not recognized or indicating low confidence in the result of the first automatic speech recognition.

6. The electronic device of claim 1, wherein the voice command received from the server corresponds to a result of voice recognition of a provided voice input at the server based on a different voice recognition model than the voice recognition model stored in the memory.

7. The electronic device of claim 1, wherein the processor is further configured to:

comparing a result of a first automatic speech recognition with the speech command received from the server if a first confidence score is above the first threshold;

Based on the result of the comparison, the first threshold is changed.

8. The electronic device of claim 1, wherein the processor is further configured to: in the event that the result of the first automatic speech recognition does not correspond to the speech command, the first threshold is increased.