WO2019127112A1

WO2019127112A1 - Voice interaction method and device and intelligent terminal

Info

Publication number: WO2019127112A1
Application number: PCT/CN2017/119039
Authority: WO
Inventors: 张含波
Original assignee: 深圳前海达闼云端智能科技有限公司
Priority date: 2017-12-27
Filing date: 2017-12-27
Publication date: 2019-07-04
Also published as: CN108369805A; CN108369805B

Abstract

Embodiments of the present invention provide a voice interaction method and device and an intelligent terminal. The method comprises: detecting noise information of a current interaction environment when receiving a voice interaction instruction, the noise information comprising noise volume and noise frequency; determining, according to the noise frequency, a main frequency used for synthesizing a response voice corresponding to the voice interaction instruction; synthesizing the response voice according to the main frequency; determining the volume at which the response voice is played according to the noise volume, the noise frequency, and the main frequency of the response voice; and playing the response voice at the determined volume. By means of the technical solution, the embodiments of the present application can dynamically adjust the main frequency of the response voice and the playback volume according to noise information on the current interaction environment based on the sound masking effect such that a user can obtain better voice interaction experience under any interaction environment.

Description

Voice interaction method, device and intelligent terminal

Technical field

The present application relates to the field of artificial intelligence technologies, and in particular, to a voice interaction method, device, and smart terminal.

Background technique

With the continuous development of artificial intelligence technology, intelligent terminals such as intelligent robots, smart homes, smart phones, smart home appliances, and smart car devices have been favored by more and more users, and people's lives have gradually entered the era of artificial intelligence.

Among them, in order to facilitate the use of the user, many intelligent terminals are configured with a voice interaction function, which can make a voice response to the user. Generally, when receiving the voice interaction instruction, the smart terminal may generate a response text according to the voice interaction instruction, and then perform text-to-speech conversion based on the response text, that is, TTS (Text to Speech) conversion, synthesize the response voice, and finally to the user. Play the synthesized response voice.

In the process of implementing the present application, the inventor has found that in the process of sounding based on the response text, the current smart terminal basically synthesizes the response voice at a preset frequency and plays the synthesized response at a fixed volume. Voice, without considering the noise condition of the interactive environment, so that sometimes the user hears that the volume of the response voice of the smart terminal is small, and the content of the conversation cannot be heard clearly; or, sometimes, the volume of the response voice of the smart terminal is large, Not in line with the atmosphere at the time, and may even be scared. During the process of voice interaction, the user hears that the volume of the response voice of the smart terminal is too large or too small, which is not conducive to the user's friendly experience.

Therefore, existing voice interaction technologies have yet to be improved and developed.

Summary of the invention

The embodiment of the present invention provides a voice interaction method, device, and intelligent terminal, which can solve the problem that the existing human-computer interaction experience is greatly affected by the noise condition of the interaction environment, which is not conducive to improving the user experience.

To solve the above technical problem, the embodiments of the present application provide the following technical solutions:

In a first aspect, the embodiment of the present application provides a voice interaction method, which is applied to a smart terminal, and the method includes:

When receiving the voice interaction instruction, detecting noise information of the current interaction environment, where the noise information includes a noise volume and a noise frequency;

Determining, according to the noise frequency, a primary frequency for synthesizing a response voice corresponding to the voice interaction instruction;

Synthesizing the response voice based on the primary frequency;

Determining a volume of playing the response voice according to the noise volume, the noise frequency, and a main frequency of the response voice;

The answering voice is played at the determined volume.

In a second aspect, the embodiment of the present application provides a voice interaction device, which is implemented in an intelligent terminal, and includes:

a noise detecting unit, configured to detect noise information of a current interaction environment when the voice interaction instruction is received, where the noise information includes a noise volume and a noise frequency;

a main frequency determining unit, configured to determine, according to the noise frequency, a main frequency for synthesizing a response voice corresponding to the voice interaction instruction;

a speech synthesis unit, configured to synthesize the response voice based on the main frequency;

a volume determining unit, configured to determine, according to the noise volume, the noise frequency, and a main frequency of the response voice, a volume of playing the response voice;

a playing unit, configured to play the answering voice at the determined volume.

In a third aspect, an embodiment of the present application provides an intelligent terminal, including:

At least one processor; and,

a memory communicatively coupled to the at least one processor; wherein

The memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to enable the at least one processor to perform a voice interaction method as described above.

In a fourth aspect, an embodiment of the present application provides a non-transitory computer readable storage medium, where the non-transitory computer readable storage medium stores computer executable instructions for causing a smart terminal to execute The voice interaction method as described above.

In a fifth aspect, the embodiment of the present application further provides a computer program product, the computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions, when the program When the instruction is executed by the smart terminal, the smart terminal is caused to perform the voice interaction method as described above.

The beneficial effects of the embodiments of the present application are as follows: the voice interaction method, the device, and the intelligent terminal provided by the embodiment of the present application detect the noise information of the current interaction environment when receiving the voice interaction instruction, where the noise information includes the noise volume and the noise frequency. And determining, according to the noise frequency, a primary frequency for synthesizing a response voice corresponding to the voice interaction instruction, synthesizing the response voice based on the primary frequency, and according to the noise volume, the noise frequency, and the The main frequency of the response voice determines the volume of playing the response voice, and finally plays the response voice at the determined volume, and can dynamically adjust the voice of the response voice according to the noise information of the current interactive environment based on the masking effect of the sound. The frequency and playback volume allow the user to get a better voice interaction experience in any interactive environment.

DRAWINGS

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings to be used in the embodiments of the present application will be briefly described below. Obviously, the drawings described below are only some embodiments of the present application, and other drawings may be obtained from those skilled in the art without departing from the drawings.

1 is a schematic diagram of one application environment of a voice interaction method provided by an embodiment of the present application;

2 is a schematic flowchart of a voice interaction method according to an embodiment of the present application;

3 is a schematic flowchart of another voice interaction method provided by an embodiment of the present application;

4 is a schematic structural diagram of a voice interaction apparatus according to an embodiment of the present application;

FIG. 5 is a schematic structural diagram of an intelligent terminal according to an embodiment of the present application.

Detailed ways

In order to make the objects, technical solutions, and advantages of the present application more comprehensible, the present application will be further described in detail below with reference to the accompanying drawings and embodiments. It is understood that the specific embodiments described herein are merely illustrative of the application and are not intended to be limiting.

It should be noted that, if there is no conflict, the various features in the embodiments of the present application may be combined with each other, and are all within the protection scope of the present application. In addition, although the functional module partitioning is performed in the device schematic, the logical sequence is shown in the flowchart, but in some cases, the illustrated may be performed in a different manner from the modules in the device, or in the order in the flowchart. Or the steps described. Moreover, the words "first", "second", "third" and the like used in the present application do not limit the data and the order of execution, but only distinguish the same or similar items whose functions and functions are substantially the same.

At present, most intelligent terminals synthesize acknowledgment voices at a specific frequency and play the synthesized acknowledgment voices at a fixed frequency during voice interaction. Therefore, the main frequency and volume of the sound emitted by the smart terminal are fixed. However, when the smart terminal is in an interactive environment with different noise conditions, the volume of the sound that the user hears from the smart terminal usually has a large and sometimes small problem. For example, suppose that a smart terminal, such as a robot, is located in a shopping mall; when the traffic of the shopping mall is large, the interaction environment of the smart terminal is relatively noisy, and when the user performs voice interaction with the smart terminal, When the voice of the smart terminal is small, the response content of the smart terminal is often inaudible; and when the traffic of the mall is small, the interaction environment of the smart terminal is relatively quiet, and the user is engaged with the smart terminal. When the voice interacts, the sound of the smart terminal is louder, which makes the user feel uncomfortable or scared.

The reason, the inventor found that: mainly because the human ear's auditory feeling is generally affected by the "masking effect" of the sound, that is, when people listen to a sound in a quiet environment, even if the volume of the sound is small, It can also be heard; however, while listening to this sound, if there is another sound (masking sound), it will affect the human ear's hearing effect on the sound. At this time, the volume of the sound needs to be increased to allow The human ear hears, that is, the human ear's hearing threshold for this sound is raised, and the number of decibels raised by the human ear to the hearing threshold of this sound is called the "masking amount." Among them, a large number of studies have shown that the masking effect of one sound (masking sound) on another sound (listening sound) is related to many factors, mainly depending on the relative intensity and frequency structure of the two sounds.

Based on this, the embodiment of the present application provides a voice interaction method, a voice interaction device, an intelligent terminal, a non-transitory computer readable storage medium, and a computer program product.

The voice interaction method provided by the embodiment of the present application is a voice-based masking effect, and the method for dynamically adjusting the main frequency of the response voice and the playback volume of the voice sent by the smart terminal according to the noise information of the current interaction environment, specifically: Receiving the voice interaction instruction, detecting noise information of the current interaction environment, where the noise information includes a noise volume and a noise frequency, and then determining, according to the noise frequency, a main frequency for synthesizing the response voice corresponding to the voice interaction instruction, And synthesizing the response voice based on the main frequency, and determining a volume for playing the response voice according to the noise volume, the noise frequency, and a main frequency of the response voice, and finally playing the determined volume at the determined volume Answer the voice. Therefore, in the embodiment of the present application, the main frequency of the synthesized response voice and the play volume thereof can be dynamically adjusted according to the noise condition of different interaction environments, so that the user can hear the response content of the smart terminal in any interaction environment, and It will not be scared because the sound heard is too large, so that users can get a better voice interaction experience in any interactive environment.

The voice interaction device provided by the embodiment of the present application is a virtual device that is configured by the software program to implement the voice interaction method provided by the embodiment of the present application, and the voice interaction method provided by the embodiment of the present application is based on the same inventive concept, and has The same technical features and benefits.

The smart terminal provided by the embodiment of the present application may be any type of electronic device, such as a robot, a smart phone, a personal computer, a tablet computer, a wearable smart device, a smart home appliance, and the like. The smart terminal can perform the voice interaction method provided by the embodiment of the present application, or run the voice interaction device provided by the embodiment of the present application.

Specifically, the embodiments of the present application are further described below in conjunction with the accompanying drawings.

FIG. 1 is a schematic diagram of one application environment of a voice interaction method provided by an embodiment of the present application. The location of the application environment may be fixed. For example, the location of the application environment may be a mall or an outdoor location; or the location of the application environment may be variable. The embodiment does not specifically limit this.

Specifically, as shown in FIG. 1, in the application environment, the user 10 and the smart terminal 20 may be included.

The user 10 can be any object capable of performing voice interaction with the smart terminal 20 (ie, an "interactive object" of the smart terminal 20), which can be through any suitable type, one or more user interaction devices (such as a mouse) The keyboard, the remote controller, the touch screen, the somatosensory camera, the audio collection device, and the like interact with the smart terminal 20 to input commands or control the smart terminal 20 to perform one or more operations.

The smart terminal 20 can be any suitable type of electronic device having certain logical computing capabilities and providing one or more functions capable of satisfying the user's intention. For example, robots, personal computers, tablets, smart phones, wearable smart devices, and the like. The smart terminal 20 can include any suitable type of storage medium for storing data, such as a magnetic disk, a compact disc (CD-ROM), a read only memory or a random access memory. The smart terminal 20 may also include one or more logical computing modules that perform any suitable type of function or operation in parallel, such as receiving interactive instructions, synthesizing responsive speech for interaction, and the like, in a single thread or multiple threads. The logic operation module may be any suitable type of electronic circuit or chip-type electronic device capable of performing logical operation operations, such as a single core processor, a multi-core processor, an audio processor.

In an actual application, the user 10 can perform voice interaction with the smart terminal 20 in any suitable manner. For example, the user 10 can input a voice interaction instruction to the smart terminal 20 through an interactive device such as a mouse, a keyboard, a touch screen, and a somatosensory operation. The smart terminal 20 can use the voice interaction method provided by the embodiment of the present application when receiving the voice interaction instruction. User 10 makes a voice response. For example, the user 10 can also input voice control information to the smart terminal 20 through the voice collecting device of the smart terminal 20. After the smart terminal 20 parses the voice control information, the corresponding voice interaction command can be obtained, and based on the voice interaction command, The voice response method provided by the embodiment of the present application is used to make a voice response to the user 10.

Specifically, in the embodiment of the present application, when the smart terminal 20 receives the voice interaction instruction, for example, when the smart terminal 20 receives the voice control information input by the user 10, "How long does it take to wait for the 25th?" Or, when the smart terminal 20 receives the voice interaction instruction "ranking query" input by the user 10 on its touch screen, the smart terminal 20 may first detect the current interactive environment (ie, the environment in which the current user 10 interacts with the smart terminal 20). Noise information, wherein the noise information includes a noise volume and a noise frequency; and then determining a main frequency for synthesizing a response voice corresponding to the voice interaction instruction according to the noise frequency, and based on the main frequency synthesis Responding to a voice, for example, based on the noise frequency, synthesizing a response voice having a specific primary frequency for the related voice interaction command, and the content is "you still have to wait 30 minutes"; then, according to the noise volume The noise frequency and the main frequency of the acknowledgment voice determine the volume at which the acknowledgment voice is played; Given the volume of the playback speech response.

It should be noted that the voice interaction method provided by the embodiment of the present application may be further extended to other suitable application environments, and is not limited to the application environment shown in FIG. 1 . Although only three users 10 and two smart terminals 20 are shown in FIG. 1, those skilled in the art can understand that in an actual application process, the application environment may further include more or fewer users and smart terminals. .

FIG. 2 is a schematic flowchart of a voice interaction method according to an embodiment of the present application, and the method may be performed by any type of smart terminal as described above.

Specifically, referring to FIG. 2, the method may include but is not limited to the following steps:

Step 110: When receiving the voice interaction instruction, detecting noise information of the current interaction environment, where the noise information includes a noise volume and a noise frequency.

In the present embodiment, the "voice interaction command" refers to an instruction capable of instructing the smart terminal to make a specific voice response. The intelligent terminal can make different voice responses for different voice interaction instructions.

The voice interaction instruction may be triggered by the control information input by the user to the smart terminal. The control information may include, but is not limited to, touch control information and voice control information, depending on the manner of interaction. For example, the user can input the touch control information of “quising the location of the store A” through the touch screen of the smart terminal to instruct the smart terminal to give “the specific location of the store A” by means of voice; for example, the user can also pass the smart terminal. The sound collection device (for example, a microphone) inputs voice control information of the voice "Where is the store A" to instruct the smart terminal to give "the specific location of the store A" by means of voice.

Alternatively, the voice interaction command may also be automatically triggered by the smart terminal itself under the preset condition. For example, for a welcome robot, when it detects that a customer is approaching, it can automatically trigger a voice interaction command to instruct the welcome robot to send a "welcome" voice response to the client. For example, for the sweeping robot, when the driving wheel is wound, it can automatically trigger a voice interactive command to instruct the sweeping robot to issue a voice prompt "the driving wheel is wound, please check" to prompt the user to sweep the robot. The current state of being entangled.

In this embodiment, the “current interaction environment” refers to an environment in which the smart terminal interacts with the user when receiving the voice interaction instruction; the “noise information” refers to the voice in the interaction environment that is not related to the interaction content. Information, the noise information including noise volume and noise frequency. Wherein, the "noise volume" is the intensity/loudness of the noise, and the "noise frequency" is the main frequency component in the noise.

Specifically, in this embodiment, when the user inputs the control information to the smart terminal through any interactive manner, or when the smart terminal itself meets the preset condition, the smart terminal may receive the corresponding voice interaction instruction. The intelligent terminal needs to first detect the noise of the current interactive environment, obtain the noise volume and the noise frequency of the current interactive environment according to the acoustic features in the noise, and then perform the following step 120.

Step 120: Determine, according to the noise frequency, a main frequency for synthesizing a response voice corresponding to the voice interaction instruction.

In the embodiment, the “answering voice” refers to a voice response made by the smart terminal to the user, and the voice content in the response voice corresponds to the voice interaction instruction received by the smart terminal. For example, if the voice interaction command received by the smart terminal is used to instruct the smart terminal to issue a prompt sound of the “drive wheel being wound”, the content of the corresponding response voice is “the drive wheel is wound, please check”. For another example, if the voice interaction command received by the smart terminal is used to instruct the smart terminal to answer “Where is the location of the store A” by voice, the content of the corresponding response voice may be “the store A is 50 meters ahead. Side corners." The "main frequency" is the main frequency component of the response speech.

In this embodiment, based on the "masking effect" of the sound, when detecting the noise frequency of the current interactive environment, the main frequency for synthesizing the response voice corresponding to the voice interaction instruction may be first determined according to the noise frequency. In general, in the "frequency domain masking effect", the low frequency sound can mask the high frequency sound, and therefore, it can be determined that the main frequency for synthesizing the response voice corresponding to the voice interactive command is lower than the noise frequency.

Among them, since the sound frequency and the masking curve are not linear in the "masking effect", in order to uniformly measure the sound frequency from the perceptual sense, the concept of "critical band" is generally introduced, that is, there are 24 critical points in the range of 20 Hz to 16 kHz. Band, the unit of the critical band is Bark, 1 Bark = the width of a critical band, when Bar (frequency) < 500Hz, 1Bark≈f/100; when f>500Hz, 1Bark≈9+4log(f/100 ). Therefore, in this embodiment, the determining, according to the noise frequency, determining a main frequency for synthesizing a response voice corresponding to the voice interaction instruction may be: determining a critical frequency band in which the noise frequency is located, A primary frequency for synthesizing the response speech corresponding to the voice interaction command is then determined based on the critical band. The critical frequency band at which the noise frequency is located may be determined by referring to a critical band table.

Also, since in the "masking effect", the closer the two frequencies are to each other, the greater the amount of masking each other; and the high frequency sound is easily masked by the low frequency sound (especially when the volume of the low frequency sound is large), and the low frequency sound It is difficult to mask high frequency sound. Therefore, in this embodiment, the determining, according to the critical frequency band, determining a primary frequency for synthesizing a response voice corresponding to the voice interaction instruction may be: determining that a primary frequency used for synthesizing the response voice is a frequency value in a higher critical band of the critical band such that the main frequency used to synthesize the response speech is lower than the noise frequency, and the critical frequency band at which the main frequency for synthesizing the response speech is located and the noise frequency is The critical bands are separated by a certain distance, so that low-frequency sound (response speech) is masked to high-frequency sound (noise), and at the same time, the two sounds are prevented from masking each other due to the closeness of the frequencies. For example, it is assumed that the critical frequency band in which the noise frequency is located is the fourth-level critical band (the frequency range corresponding to the critical band of the fourth-order critical band is: 400 Hz to 510 Hz), and then the composite response can be determined. The main frequency of speech is 250 Hz (the critical frequency band in which it is located is the second-order critical band).

In addition, in some embodiments, if the critical frequency band in which the noise frequency is located belongs to the low frequency range, for example, the critical frequency band in which the noise frequency is located is the first level critical band (corresponding frequency range is: 100 Hz to 200 Hz), at this time, If the low-frequency sound masking high-frequency sound is continued, it may be difficult to improve the auditory sensitivity of the user's voice played by the smart terminal (ie, answering the voice), and it may bring a bad hearing experience to the user. At this time, it may be determined. The main frequency for synthesizing the response speech is much higher than the noise frequency, for example, determining that the main frequency for synthesizing the response speech is 1000 Hz (the critical band in which it is located is the eighth-order critical band).

Step 130: Synthesize the response voice based on the primary frequency.

In this embodiment, when the smart terminal receives the voice interaction instruction, the response text may be first generated according to the voice interaction instruction, where the response text includes voice content used by the smart terminal to respond to the voice interaction instruction; and then, based on the step The main frequency determined in 120 performs a TTS (Text To Speech) conversion on the response text, and synthesizes a response voice having a specific main frequency and corresponding to the received voice interactive command.

In this embodiment, the mapping relationship between the voice interaction instruction and the response text may be established in the database of the smart terminal, so that when the smart terminal receives a voice interaction instruction, the corresponding response text may be queried. And further synthesizing the response voice corresponding to the voice interaction instruction based on the determined primary frequency (ie, performing TTS conversion on the response text corresponding to the voice interaction instruction based on the primary frequency).

Step 140: Determine, according to the noise volume, the noise frequency, and the main frequency of the response voice, the volume of playing the response voice.

According to the "masking effect", the masking effect of the sound is also related to the volume of the sound. The louder the volume of one sound, the larger the amount of masking of the other sound. Therefore, in the embodiment, the noise of the response voice to the interactive environment is masked by dynamically adjusting the volume of the response voice played by the smart terminal, so that the user can clearly hear the response voice in any noise environment.

Thus, in the present embodiment, after the response voice is synthesized at a specific primary frequency, the volume at which the response voice is played is also determined based on the noise volume, the noise frequency, and the dominant frequency of the response voice. Among them, the masking effect produced by different frequency masking methods may be different, the masking effect of the low frequency sound masking high frequency sound is strong, and the masking effect of the high frequency sound masking low frequency sound is weak, therefore, in this embodiment, it may be first based on The noise frequency and the main frequency of the response speech determine the amount of masking, and then determine the volume at which the answering voice is played based on the noise volume and the amount of masking.

Specifically, in the embodiment, the masking effect of the low frequency sound masking high frequency sound is strong, and the masking effect of the high frequency sound masking low frequency sound is weak, and the specific implementation of determining the masking amount according to the noise frequency and the main frequency of the response voice is specifically performed. The method may be: if the noise frequency is lower than a main frequency of the response voice, determining that the masking amount is a first masking amount; if the noise frequency is higher than a main frequency of the response voice, determining the The masking amount is a second masking amount; the first masking amount is greater than the second masking amount. Further, the specific implementation manner of determining the volume of playing the response voice according to the noise volume and the masking amount may be: using the sum of the noise volume and the masking amount as the volume of the playback response voice.

In addition, in other embodiments, the specific implementation manner of determining the volume of playing the response voice according to the noise volume, the noise frequency, and the main frequency of the response voice may also be: first, according to the noise frequency and the response voice. The main frequency determines the adjustment coefficient, and then the volume of the response voice is played according to the product of the noise volume and the adjustment coefficient, wherein the adjustment coefficient is greater than one.

Moreover, in still other embodiments, steps 120 through 140 may also be performed in combination. Specifically, a relationship comparison table as shown in Table 1 is established in advance according to the “masking effect”, and the mask corresponding to each noise information (including the critical frequency band where the noise frequency is located and the noise volume) can be determined by searching the relationship comparison table. The amount, the main frequency of the synthesized response voice, and the volume of the playback response voice. Wherein, n in Table 1 may be a variable, which is determined according to the actually detected noise volume; and the data in Table 1 is also merely illustrative and is not intended to limit the embodiments of the present application.

Table 1 relationship comparison table

In this embodiment, when the noise volume and the noise frequency of the current interactive environment are detected, the critical frequency band in which the noise frequency is located may be first determined, and then the primary frequency synthesis response corresponding to the critical frequency band is directly queried by querying the above Table 1. Voice and determine the volume at which the answering voice is played.

Step 150: Play the response voice at the determined volume.

In this embodiment, after determining the volume of the playback response voice, the smart terminal can play the response voice at the determined volume through any sounding device, such as a speaker, a speaker, or the like.

In this embodiment, since the main frequency of the response voice avoids the range of the critical frequency band in which the noise frequency is located, and the playback volume of the response voice is greater than the noise volume, the masking of the noise by the response voice can be realized, so that the user In the interactive environment with any noise situation, the response voice sent by the smart terminal can be clearly heard. At the same time, the main frequency and the playback volume of the response voice of the smart terminal are determined based on the noise information of the current interactive environment, so there is no such a reason. The sound is too loud to scare the user's problem.

Further, in the "masking effect" of sound, in addition to the masking phenomenon between the simultaneously emitted sounds, there is also a masking phenomenon between temporally adjacent sounds, which is called "time domain masking". Wherein, the time domain masking includes lead masking and lag masking. The main reason for the generation of time domain masking is that it takes a certain amount of time for the human brain to process the information. Generally, the advanced masking is very short, only 5 to 20 ms, and the lag masking can last for 50 to 200 ms.

Based on this, in other embodiments, when the voice interaction instruction received by the smart terminal is triggered by the voice control information input by the user, in order to prevent the user from speaking, the time domain masking is caused to the response voice played by the smart terminal. And playing the response voice at the determined volume, specifically: acquiring a time node that receives the voice interaction instruction triggered by the voice control information (ie, a time node when the user ends the inquiry); After the time node presets the duration, the response voice is played at the determined volume. The preset duration may be 200 ms.

According to the foregoing technical solution, the voice interaction method provided by the embodiment of the present application detects the noise information of the current interaction environment by receiving the voice interaction instruction, where the noise information includes noise volume and noise. a frequency, and then determining, according to the noise frequency, a main frequency for synthesizing a response voice corresponding to the voice interaction instruction, synthesizing the response voice based on the main frequency, and according to the noise volume, the noise frequency, and The main frequency of the response voice determines the volume of playing the response voice, and finally plays the response voice at the determined volume, and can dynamically adjust the response voice according to the noise information of the current interaction environment based on the masking effect of the sound. The main frequency and playback volume enable the user to get a better voice interaction experience in any interactive environment.

In addition, considering that each person's hearing sensitivity and personal habits will be different, adjusting the main frequency of the response voice and the volume of playing the response voice based on the same method may have different voice interaction effects for different users. Therefore, further, in the embodiment of the present application, another voice interaction method is also provided.

Specifically, referring to FIG. 3, the method may include but is not limited to the following steps:

Step 210: When receiving the voice interaction instruction, detecting noise information of the current interaction environment, where the noise information includes a noise volume and a noise frequency.

Step 220: Determine, according to the noise frequency, a main frequency for synthesizing a response voice corresponding to the voice interaction instruction.

Step 230: Synthesize the response voice based on the primary frequency.

Step 240: Determine, according to the noise volume, the noise frequency, and the main frequency of the response voice, the volume of playing the response voice.

Step 250: Play the response voice at the determined volume.

Step 260: Acquire interactive experience feedback information.

In this embodiment, the “interactive experience feedback information” refers to the user's evaluation of the voice interaction experience, and is used to evaluate the voice interaction experience between the user and the smart terminal. For example, the interactive experience feedback information may include: the volume of the response voice is too large, the volume of the response voice is appropriate, or the volume of the response voice is too small.

In some embodiments, the interactive experience feedback information may be input by the user to the smart terminal, for example, during the process of performing the voice interaction, or after the voice interaction is ended, the user inputs the interactive experience feedback information for the voice interaction experience. In order for the smart terminal to adjust the volume of the playback response voice in time, the user experience is further improved.

Alternatively, in other embodiments, the interactive experience feedback information may also be evaluated by the smart terminal in an appropriate manner, and then the interactive experience feedback information is obtained according to the evaluation result. For example, the smart terminal can determine whether the user can correctly understand the content of the response voice by using the interaction effect between the user and the smart terminal, or whether the user can clearly understand the response of the smart terminal during the interaction process. voice.

Step 270: Adjust the volume of playing the response voice according to the interaction experience feedback information.

In this embodiment, when the interactive experience feedback information is acquired, the volume of playing the response voice is adjusted according to the interactive experience feedback information. For example, when the interactive experience feedback information of “the volume of the response voice is too large” is obtained, the volume of playing the response voice is reduced; and when the interactive experience feedback information “the volume of the response voice is appropriate” is acquired, the response voice is maintained. The volume of the response is unchanged; when the interactive experience feedback information of "the volume of the response voice is too small" is obtained, the volume of playing the response voice is increased.

It can be understood that, in this embodiment, the interaction experience feedback information may be acquired in real time, so that the volume of playing the response voice may be adjusted in real time according to the interaction experience feedback information. Alternatively, the interaction experience feedback information may also be obtained when the interaction process is completed. Therefore, the smart terminal may adjust the volume of the playback response voice according to the interaction experience feedback information when performing the next voice interaction with the user, and/or , synthesize the main frequency of the response speech.

It should be noted that the foregoing steps 210 to 250 have the same technical features as the steps 110 to 150 in the voice interaction method shown in FIG. 2, and therefore, the specific implementation manners may refer to step 110 of the foregoing embodiment. The corresponding description in 150 will not be repeated in this embodiment.

According to the foregoing technical solution, the voice interaction method provided by the embodiment of the present application obtains the user's interactive experience feedback information after playing the response voice at the determined volume. The interaction experience feedback information adjusts the volume of playing the response voice, and can continuously improve the voice interaction effect for the characteristics of the interaction object, thereby further improving the user experience.

FIG. 4 is a schematic structural diagram of a voice interaction apparatus according to an embodiment of the present disclosure. The apparatus 40 can be implemented on a smart terminal configured with a voice interaction function, and can implement the voice interaction method provided by the foregoing embodiment.

Specifically, referring to FIG. 4, the apparatus 40 may include, but is not limited to, a noise detecting unit 41, a main frequency determining unit 42, a speech synthesizing unit 43, a volume determining unit 44, and a playing unit 45.

The noise detecting unit 41 is configured to detect noise information of a current interaction environment when the voice interaction instruction is received, where the noise information includes a noise volume and a noise frequency;

The main frequency determining unit 42 is configured to determine, according to the noise frequency, a main frequency for synthesizing a response voice corresponding to the voice interaction instruction;

The speech synthesis unit 43 is configured to synthesize the response voice based on the main frequency;

The volume determining unit 44 is configured to determine a volume for playing the response voice according to the noise volume, the noise frequency, and a main frequency of the response voice;

The playing unit 45 is configured to play the response voice at the determined volume.

In practical applications, when the voice interaction instruction is received, the noise information of the current interaction environment may be first detected by the noise detecting unit 41, the noise information including the noise volume and the noise frequency; and then the noise is determined by the main frequency determining unit 42 according to the noise. The frequency determination is used to synthesize a main frequency of the response voice corresponding to the voice interaction command, and the voice response unit 43 synthesizes the response voice based on the master frequency; and then, using the volume determination unit 44, according to the noise volume, The noise frequency and the main frequency of the response voice determine the volume at which the response voice is played; finally, the response voice is played by the playback unit 45 at the determined volume.

In some embodiments, the main frequency determining unit 42 is specifically configured to: determine a critical frequency band in which the noise frequency is located; and determine, according to the critical frequency band, a main frequency used to synthesize a response voice corresponding to the voice interaction instruction. .

In some embodiments, the volume determining unit 44 includes a masking amount determining module 441 and a volume determining module 442.

The masking amount determining module 441 is configured to determine a masking amount according to the noise frequency and a main frequency of the response voice; the volume determining module 442 is configured to determine, according to the noise volume and the masking amount, a volume of playing the response voice. . Specifically, in some embodiments, the masking amount determining module 441 is specifically configured to: if the noise frequency is lower than a primary frequency of the response voice, determine the masking amount as a first masking amount; if the noise frequency And above the main frequency of the response voice, determining that the masking amount is a second masking amount; the first masking amount is greater than the second masking amount.

In some embodiments, when the voice interaction instruction is triggered by the voice control information, the playing unit 45 is specifically configured to: acquire a time node that receives the voice interaction instruction triggered by the voice control information; After the time node presets the duration, the response voice is played at the determined volume.

In some embodiments, the device 40 further includes: a feedback unit 46 and a volume adjustment unit 47.

The feedback unit 46 is configured to obtain interaction experience feedback information.

The volume adjustment unit 47 is configured to adjust the volume of playing the response voice according to the interaction experience feedback information.

It should be noted that, since the voice interaction device and the voice interaction method in the foregoing method embodiments are based on the same inventive concept, the corresponding content and beneficial effects of the foregoing method embodiments are also applicable to the device embodiment. It will not be detailed.

According to the foregoing technical solution, the voice interaction device provided by the embodiment of the present application detects the noise information of the current interaction environment by the noise detecting unit 41 when receiving the voice interaction instruction, and the noise information. A noise level and a noise frequency are included, and then a main frequency for synthesizing a response voice corresponding to the voice interaction instruction is determined by the main frequency determining unit 42 according to the noise frequency, and further synthesized based on the main frequency in the speech synthesizing unit 43 The response voice; then, the volume determining unit 44 determines the volume of playing the response voice according to the noise volume, the noise frequency, and the main frequency of the response voice; finally, the determined unit through the playing unit 45 The volume of the response voice is played, and based on the masking effect of the sound, the main frequency and the playback volume of the response voice can be dynamically adjusted according to the noise information of the current interactive environment, so that the user can obtain better voice interaction in any interactive environment. Experience.

FIG. 5 is a schematic structural diagram of an intelligent terminal according to an embodiment of the present disclosure. The smart terminal 500 can be any type of electronic device, such as a smart phone, a robot, a personal computer, a wearable smart device, a smart home appliance, etc., capable of executing The voice interaction method provided by the foregoing method embodiment, or the voice interaction device provided by the foregoing device embodiment.

Specifically, referring to FIG. 5, the smart terminal 500 includes:

One or more processors 501 and memory 502, one processor 501 is taken as an example in FIG.

The processor 501 and the memory 502 may be connected by a bus or other means, as exemplified by a bus connection in FIG.

The memory 502 is used as a non-transitory computer readable storage medium, and can be used for storing a non-transitory software program, a non-transitory computer executable program, and a module, such as a program instruction/module corresponding to the voice interaction method in the embodiment of the present application. (For example, the noise detecting unit 41, the main frequency determining unit 42, the speech synthesizing unit 43, the volume determining unit 44, the playing unit 45, the feedback unit 46, and the volume adjusting unit 47 shown in Fig. 4). The processor 501 executes various functional applications and data processing of the voice interaction device 40 by running non-transitory software programs, instructions, and modules stored in the memory 502, i.e., implements the voice interaction method of any of the above method embodiments.

The memory 502 can include a storage program area and an storage data area, wherein the storage program area can store an operating system, an application required for at least one function; the storage data area can store data created according to the use of the voice interaction device 40, and the like. Moreover, memory 502 can include high speed random access memory, and can also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 502 can optionally include memory remotely located relative to processor 501, which can be connected to smart terminal 500 over a network. Examples of such networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The one or more modules are stored in the memory 502, and when executed by the one or more processors 501, perform a voice interaction method in any of the above method embodiments, for example, performing the above described FIG. Method steps 110 through 150, method steps 210 through 270 in FIG. 3, implement the functions of units 41-47 in FIG.

The embodiment of the present application further provides a non-transitory computer readable storage medium storing computer executable instructions executed by one or more processors, for example, Executed by a processor 501 in FIG. 5, the one or more processors may be configured to perform the voice interaction method in any of the foregoing method embodiments, for example, to perform the method steps 110 to 150 in FIG. 2 described above. Method steps 210 through 270 in 3 implement the functions of units 41-47 in FIG.

The device embodiments described above are merely illustrative, wherein the units described as separate components may or may not be physically separate, and the components displayed as units may or may not be physical units, ie may be located A place, or it can be distributed to multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the embodiment.

Through the description of the above embodiments, those skilled in the art can clearly understand that the various embodiments can be implemented by means of software plus a general hardware platform, and of course, by hardware. One of ordinary skill in the art can understand that all or part of the process of implementing the above embodiments can be completed by a computer program in a computer program product, and the computer program can be stored in a non-transitory computer. In reading a storage medium, the computer program includes program instructions, and when the program instructions are executed by the intelligent terminal, the smart terminal can be caused to execute the flow of the embodiments of the foregoing methods. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), or a random access memory (RAM).

The foregoing products (including the smart terminal, the non-transitory computer readable storage medium, and the computer program product) can perform the voice interaction method provided by the embodiment of the present application, and have the corresponding functional modules and beneficial effects of executing the voice interaction method. For details of the technical details that are not described in detail in this embodiment, refer to the voice interaction method provided by the embodiment of the present application.

Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present application, and are not limited thereto; in the idea of the present application, the technical features in the above embodiments or different embodiments may also be combined. The steps may be carried out in any order, and there are many other variations of the various aspects of the present application as described above, which are not provided in the details for the sake of brevity; although the present application has been described in detail with reference to the foregoing embodiments, The skilled person should understand that the technical solutions described in the foregoing embodiments may be modified, or some of the technical features may be equivalently replaced; and the modifications or substitutions do not deviate from the embodiments of the present application. The scope of the technical solution.

Claims

A voice interaction method is applied to an intelligent terminal, and is characterized by:

When receiving the voice interaction instruction, detecting noise information of the current interaction environment, where the noise information includes a noise volume and a noise frequency;

Determining, according to the noise frequency, a main frequency for synthesizing a response voice corresponding to the voice interaction instruction;

Synthesizing the response voice based on the primary frequency;

Determining a volume of playing the response voice according to the noise volume, the noise frequency, and a main frequency of the response voice;

The answering voice is played at the determined volume.
The voice interaction method according to claim 1, wherein the determining, according to the noise frequency, a main frequency for synthesizing a response voice corresponding to the voice interaction instruction comprises:

Determining a critical frequency band in which the noise frequency is located;

A main frequency for synthesizing a response voice corresponding to the voice interaction instruction is determined according to the critical band.
The voice interaction method according to claim 1, wherein the determining the volume of playing the response voice according to the noise volume, the noise frequency, and the main frequency of the response voice comprises:

Determining a masking amount according to the noise frequency and a main frequency of the response voice;

A volume at which the response voice is played is determined based on the noise volume and the masking amount.
The voice interaction method according to claim 3, wherein the determining the masking amount according to the noise frequency and the main frequency of the response voice comprises:

If the noise frequency is lower than a main frequency of the response voice, determining that the masking amount is a first masking amount;

If the noise frequency is higher than a main frequency of the response voice, determining that the masking amount is a second masking amount;

The first masking amount is greater than the second masking amount.
The voice interaction method according to any one of claims 1 to 4, wherein after the step of playing the response voice at the determined volume, the method further comprises:

Obtain interactive experience feedback information;

Adjusting the volume of playing the response voice according to the interactive experience feedback information.
The voice interaction method according to any one of claims 1 to 4, wherein when the voice interaction instruction is triggered by voice control information, the playing the response voice at the determined volume includes:

Obtaining a time node that receives the voice interaction instruction triggered based on the voice control information;

After the time duration is preset for the time node, the response voice is played at the determined volume.
A voice interaction device, running on a smart terminal, comprising:

a noise detecting unit, configured to detect noise information of a current interaction environment when the voice interaction instruction is received, where the noise information includes a noise volume and a noise frequency;

a main frequency determining unit, configured to determine, according to the noise frequency, a main frequency for synthesizing a response voice corresponding to the voice interaction instruction;

a speech synthesis unit, configured to synthesize the response voice based on the main frequency;

a volume determining unit, configured to determine, according to the noise volume, the noise frequency, and a main frequency of the response voice, a volume of playing the response voice;

a playing unit, configured to play the answering voice at the determined volume.
The voice interaction device according to claim 7, wherein the main frequency determining unit is specifically configured to:

Determining a critical frequency band in which the noise frequency is located;

A main frequency for synthesizing a response voice corresponding to the voice interaction instruction is determined according to the critical band.
The voice interaction device according to claim 7, wherein the volume determining unit comprises:

a masking amount determining module, configured to determine a masking amount according to the noise frequency and a main frequency of the response voice;

And a volume determining module, configured to determine a volume for playing the response voice according to the noise volume and the masking amount.
The voice interaction device according to claim 9, wherein the masking amount determining module is specifically configured to:

If the noise frequency is lower than a main frequency of the response voice, determining that the masking amount is a first masking amount;

If the noise frequency is higher than a main frequency of the response voice, determining that the masking amount is a second masking amount;

The first masking amount is greater than the second masking amount.
The voice interaction device according to any one of claims 7 to 10, wherein the voice interaction device further comprises:

a feedback unit, configured to obtain interaction experience feedback information;

a volume adjustment unit, configured to adjust a volume of playing the response voice according to the interaction experience feedback information.
The voice interaction device according to any one of claims 7 to 10, wherein when the voice interaction instruction is triggered by the voice control information, the playing unit is specifically configured to:

Obtaining a time node that receives the voice interaction instruction triggered based on the voice control information;

After the time duration is preset for the time node, the response voice is played at the determined volume.
An intelligent terminal, comprising:

At least one processor; and,

a memory communicatively coupled to the at least one processor; wherein

The memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to enable the at least one processor to perform the method of any of claims 1-6 Methods.
A non-transitory computer readable storage medium, characterized in that the non-transitory computer readable storage medium stores computer executable instructions for causing a smart terminal to perform as claimed in claims 1-6 The method of any of the preceding claims.
A computer program product, comprising: a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions, when the program instructions are executed by a smart terminal, The smart terminal is caused to perform the method of any of claims 1-6.