Nothing Special   »   [go: up one dir, main page]

WO2019127112A1 - Voice interaction method and device and intelligent terminal - Google Patents

Voice interaction method and device and intelligent terminal Download PDF

Info

Publication number
WO2019127112A1
WO2019127112A1 PCT/CN2017/119039 CN2017119039W WO2019127112A1 WO 2019127112 A1 WO2019127112 A1 WO 2019127112A1 CN 2017119039 W CN2017119039 W CN 2017119039W WO 2019127112 A1 WO2019127112 A1 WO 2019127112A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
noise
frequency
volume
response
Prior art date
Application number
PCT/CN2017/119039
Other languages
French (fr)
Chinese (zh)
Inventor
张含波
Original Assignee
深圳前海达闼云端智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳前海达闼云端智能科技有限公司 filed Critical 深圳前海达闼云端智能科技有限公司
Priority to PCT/CN2017/119039 priority Critical patent/WO2019127112A1/en
Priority to CN201780003279.0A priority patent/CN108369805B/en
Publication of WO2019127112A1 publication Critical patent/WO2019127112A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • G10L13/0335Pitch control
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0316Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude
    • G10L21/0324Details of processing therefor
    • G10L21/034Automatic adjustment
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/225Feedback of the input speech

Definitions

  • the present application relates to the field of artificial intelligence technologies, and in particular, to a voice interaction method, device, and smart terminal.
  • intelligent terminals such as intelligent robots, smart homes, smart phones, smart home appliances, and smart car devices have been favored by more and more users, and people's lives have gradually entered the era of artificial intelligence.
  • many intelligent terminals are configured with a voice interaction function, which can make a voice response to the user.
  • the smart terminal may generate a response text according to the voice interaction instruction, and then perform text-to-speech conversion based on the response text, that is, TTS (Text to Speech) conversion, synthesize the response voice, and finally to the user. Play the synthesized response voice.
  • TTS Text to Speech
  • the inventor has found that in the process of sounding based on the response text, the current smart terminal basically synthesizes the response voice at a preset frequency and plays the synthesized response at a fixed volume.
  • Voice without considering the noise condition of the interactive environment, so that sometimes the user hears that the volume of the response voice of the smart terminal is small, and the content of the conversation cannot be heard clearly; or, sometimes, the volume of the response voice of the smart terminal is large, Not in line with the atmosphere at the time, and may even be scared.
  • the user hears that the volume of the response voice of the smart terminal is too large or too small, which is not conducive to the user's friendly experience.
  • the embodiment of the present invention provides a voice interaction method, device, and intelligent terminal, which can solve the problem that the existing human-computer interaction experience is greatly affected by the noise condition of the interaction environment, which is not conducive to improving the user experience.
  • the embodiment of the present application provides a voice interaction method, which is applied to a smart terminal, and the method includes:
  • noise information of the current interaction environment When receiving the voice interaction instruction, detecting noise information of the current interaction environment, where the noise information includes a noise volume and a noise frequency;
  • the answering voice is played at the determined volume.
  • the embodiment of the present application provides a voice interaction device, which is implemented in an intelligent terminal, and includes:
  • a noise detecting unit configured to detect noise information of a current interaction environment when the voice interaction instruction is received, where the noise information includes a noise volume and a noise frequency;
  • a main frequency determining unit configured to determine, according to the noise frequency, a main frequency for synthesizing a response voice corresponding to the voice interaction instruction
  • a speech synthesis unit configured to synthesize the response voice based on the main frequency
  • a volume determining unit configured to determine, according to the noise volume, the noise frequency, and a main frequency of the response voice, a volume of playing the response voice
  • a playing unit configured to play the answering voice at the determined volume.
  • an intelligent terminal including:
  • At least one processor and,
  • the memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to enable the at least one processor to perform a voice interaction method as described above.
  • an embodiment of the present application provides a non-transitory computer readable storage medium, where the non-transitory computer readable storage medium stores computer executable instructions for causing a smart terminal to execute The voice interaction method as described above.
  • the embodiment of the present application further provides a computer program product, the computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions, when the program When the instruction is executed by the smart terminal, the smart terminal is caused to perform the voice interaction method as described above.
  • the voice interaction method, the device, and the intelligent terminal provided by the embodiment of the present application detect the noise information of the current interaction environment when receiving the voice interaction instruction, where the noise information includes the noise volume and the noise frequency. And determining, according to the noise frequency, a primary frequency for synthesizing a response voice corresponding to the voice interaction instruction, synthesizing the response voice based on the primary frequency, and according to the noise volume, the noise frequency, and the The main frequency of the response voice determines the volume of playing the response voice, and finally plays the response voice at the determined volume, and can dynamically adjust the voice of the response voice according to the noise information of the current interactive environment based on the masking effect of the sound.
  • the frequency and playback volume allow the user to get a better voice interaction experience in any interactive environment.
  • FIG. 1 is a schematic diagram of one application environment of a voice interaction method provided by an embodiment of the present application
  • FIG. 2 is a schematic flowchart of a voice interaction method according to an embodiment of the present application.
  • FIG. 3 is a schematic flowchart of another voice interaction method provided by an embodiment of the present application.
  • FIG. 4 is a schematic structural diagram of a voice interaction apparatus according to an embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of an intelligent terminal according to an embodiment of the present application.
  • a smart terminal such as a robot
  • the interaction environment of the smart terminal is relatively noisy, and when the user performs voice interaction with the smart terminal
  • the voice of the smart terminal is small, the response content of the smart terminal is often inaudible; and when the traffic of the mall is small, the interaction environment of the smart terminal is relatively quiet, and the user is engaged with the smart terminal.
  • the voice interacts the sound of the smart terminal is louder, which makes the user feel uncomfortable or scared.
  • the inventor found that: mainly because the human ear's auditory feeling is generally affected by the "masking effect" of the sound, that is, when people listen to a sound in a quiet environment, even if the volume of the sound is small, It can also be heard; however, while listening to this sound, if there is another sound (masking sound), it will affect the human ear's hearing effect on the sound.
  • the volume of the sound needs to be increased to allow
  • the human ear hears that is, the human ear's hearing threshold for this sound is raised, and the number of decibels raised by the human ear to the hearing threshold of this sound is called the "masking amount.”
  • the masking effect of one sound (masking sound) on another sound (listening sound) is related to many factors, mainly depending on the relative intensity and frequency structure of the two sounds.
  • the embodiment of the present application provides a voice interaction method, a voice interaction device, an intelligent terminal, a non-transitory computer readable storage medium, and a computer program product.
  • the voice interaction method provided by the embodiment of the present application is a voice-based masking effect, and the method for dynamically adjusting the main frequency of the response voice and the playback volume of the voice sent by the smart terminal according to the noise information of the current interaction environment, specifically: Receiving the voice interaction instruction, detecting noise information of the current interaction environment, where the noise information includes a noise volume and a noise frequency, and then determining, according to the noise frequency, a main frequency for synthesizing the response voice corresponding to the voice interaction instruction, And synthesizing the response voice based on the main frequency, and determining a volume for playing the response voice according to the noise volume, the noise frequency, and a main frequency of the response voice, and finally playing the determined volume at the determined volume Answer the voice.
  • the main frequency of the synthesized response voice and the play volume thereof can be dynamically adjusted according to the noise condition of different interaction environments, so that the user can hear the response content of the smart terminal in any interaction environment, and It will not be scared because the sound heard is too large, so that users can get a better voice interaction experience in any interactive environment.
  • the voice interaction device provided by the embodiment of the present application is a virtual device that is configured by the software program to implement the voice interaction method provided by the embodiment of the present application, and the voice interaction method provided by the embodiment of the present application is based on the same inventive concept, and has The same technical features and benefits.
  • the smart terminal provided by the embodiment of the present application may be any type of electronic device, such as a robot, a smart phone, a personal computer, a tablet computer, a wearable smart device, a smart home appliance, and the like.
  • the smart terminal can perform the voice interaction method provided by the embodiment of the present application, or run the voice interaction device provided by the embodiment of the present application.
  • FIG. 1 is a schematic diagram of one application environment of a voice interaction method provided by an embodiment of the present application.
  • the location of the application environment may be fixed.
  • the location of the application environment may be a mall or an outdoor location; or the location of the application environment may be variable.
  • the embodiment does not specifically limit this.
  • the user 10 and the smart terminal 20 may be included.
  • the user 10 can be any object capable of performing voice interaction with the smart terminal 20 (ie, an "interactive object" of the smart terminal 20), which can be through any suitable type, one or more user interaction devices (such as a mouse)
  • the keyboard, the remote controller, the touch screen, the somatosensory camera, the audio collection device, and the like interact with the smart terminal 20 to input commands or control the smart terminal 20 to perform one or more operations.
  • the smart terminal 20 can be any suitable type of electronic device having certain logical computing capabilities and providing one or more functions capable of satisfying the user's intention. For example, robots, personal computers, tablets, smart phones, wearable smart devices, and the like.
  • the smart terminal 20 can include any suitable type of storage medium for storing data, such as a magnetic disk, a compact disc (CD-ROM), a read only memory or a random access memory.
  • the smart terminal 20 may also include one or more logical computing modules that perform any suitable type of function or operation in parallel, such as receiving interactive instructions, synthesizing responsive speech for interaction, and the like, in a single thread or multiple threads.
  • the logic operation module may be any suitable type of electronic circuit or chip-type electronic device capable of performing logical operation operations, such as a single core processor, a multi-core processor, an audio processor.
  • the user 10 can perform voice interaction with the smart terminal 20 in any suitable manner.
  • the user 10 can input a voice interaction instruction to the smart terminal 20 through an interactive device such as a mouse, a keyboard, a touch screen, and a somatosensory operation.
  • the smart terminal 20 can use the voice interaction method provided by the embodiment of the present application when receiving the voice interaction instruction.
  • User 10 makes a voice response.
  • the user 10 can also input voice control information to the smart terminal 20 through the voice collecting device of the smart terminal 20. After the smart terminal 20 parses the voice control information, the corresponding voice interaction command can be obtained, and based on the voice interaction command,
  • the voice response method provided by the embodiment of the present application is used to make a voice response to the user 10.
  • the smart terminal 20 when the smart terminal 20 receives the voice interaction instruction, for example, when the smart terminal 20 receives the voice control information input by the user 10, "How long does it take to wait for the 25th?" Or, when the smart terminal 20 receives the voice interaction instruction "ranking query" input by the user 10 on its touch screen, the smart terminal 20 may first detect the current interactive environment (ie, the environment in which the current user 10 interacts with the smart terminal 20).
  • the current interactive environment ie, the environment in which the current user 10 interacts with the smart terminal 20.
  • Noise information wherein the noise information includes a noise volume and a noise frequency; and then determining a main frequency for synthesizing a response voice corresponding to the voice interaction instruction according to the noise frequency, and based on the main frequency synthesis Responding to a voice, for example, based on the noise frequency, synthesizing a response voice having a specific primary frequency for the related voice interaction command, and the content is "you still have to wait 30 minutes"; then, according to the noise volume
  • the noise frequency and the main frequency of the acknowledgment voice determine the volume at which the acknowledgment voice is played; Given the volume of the playback speech response.
  • the voice interaction method provided by the embodiment of the present application may be further extended to other suitable application environments, and is not limited to the application environment shown in FIG. 1 .
  • the application environment may further include more or fewer users and smart terminals. .
  • FIG. 2 is a schematic flowchart of a voice interaction method according to an embodiment of the present application, and the method may be performed by any type of smart terminal as described above.
  • the method may include but is not limited to the following steps:
  • Step 110 When receiving the voice interaction instruction, detecting noise information of the current interaction environment, where the noise information includes a noise volume and a noise frequency.
  • the "voice interaction command” refers to an instruction capable of instructing the smart terminal to make a specific voice response.
  • the intelligent terminal can make different voice responses for different voice interaction instructions.
  • the voice interaction instruction may be triggered by the control information input by the user to the smart terminal.
  • the control information may include, but is not limited to, touch control information and voice control information, depending on the manner of interaction.
  • the user can input the touch control information of “quising the location of the store A” through the touch screen of the smart terminal to instruct the smart terminal to give “the specific location of the store A” by means of voice; for example, the user can also pass the smart terminal.
  • the sound collection device for example, a microphone inputs voice control information of the voice "Where is the store A" to instruct the smart terminal to give "the specific location of the store A" by means of voice.
  • the voice interaction command may also be automatically triggered by the smart terminal itself under the preset condition.
  • a welcome robot when it detects that a customer is approaching, it can automatically trigger a voice interaction command to instruct the welcome robot to send a "welcome" voice response to the client.
  • the driving wheel when the driving wheel is wound, it can automatically trigger a voice interactive command to instruct the sweeping robot to issue a voice prompt "the driving wheel is wound, please check” to prompt the user to sweep the robot.
  • the current state of being entangled when it detects that a customer is approaching, it can automatically trigger a voice interaction command to instruct the welcome robot to send a "welcome" voice response to the client.
  • the driving wheel when the driving wheel is wound, it can automatically trigger a voice interactive command to instruct the sweeping robot to issue a voice prompt "the driving wheel is wound, please check” to prompt the user to sweep the robot.
  • the current state of being entangled when being a welcome robot, when it detects that a customer is approaching, it can automatically trigger
  • the “current interaction environment” refers to an environment in which the smart terminal interacts with the user when receiving the voice interaction instruction;
  • the “noise information” refers to the voice in the interaction environment that is not related to the interaction content.
  • Information the noise information including noise volume and noise frequency.
  • the “noise volume” is the intensity/loudness of the noise
  • the “noise frequency” is the main frequency component in the noise.
  • the smart terminal may receive the corresponding voice interaction instruction.
  • the intelligent terminal needs to first detect the noise of the current interactive environment, obtain the noise volume and the noise frequency of the current interactive environment according to the acoustic features in the noise, and then perform the following step 120.
  • Step 120 Determine, according to the noise frequency, a main frequency for synthesizing a response voice corresponding to the voice interaction instruction.
  • the “answering voice” refers to a voice response made by the smart terminal to the user, and the voice content in the response voice corresponds to the voice interaction instruction received by the smart terminal.
  • the voice interaction command received by the smart terminal is used to instruct the smart terminal to issue a prompt sound of the “drive wheel being wound”
  • the content of the corresponding response voice is “the drive wheel is wound, please check”.
  • the voice interaction command received by the smart terminal is used to instruct the smart terminal to answer “Where is the location of the store A” by voice
  • the content of the corresponding response voice may be “the store A is 50 meters ahead. Side corners.”
  • the "main frequency" is the main frequency component of the response speech.
  • the main frequency for synthesizing the response voice corresponding to the voice interaction instruction may be first determined according to the noise frequency.
  • the noise frequency In general, in the "frequency domain masking effect", the low frequency sound can mask the high frequency sound, and therefore, it can be determined that the main frequency for synthesizing the response voice corresponding to the voice interactive command is lower than the noise frequency.
  • critical band the unit of the critical band, when Bar (frequency) ⁇ 500Hz, 1Bark ⁇ f/100; when f>500Hz, 1Bark ⁇ 9+4log(f/100 ).
  • the determining, according to the noise frequency, determining a main frequency for synthesizing a response voice corresponding to the voice interaction instruction may be: determining a critical frequency band in which the noise frequency is located, A primary frequency for synthesizing the response speech corresponding to the voice interaction command is then determined based on the critical band.
  • the critical frequency band at which the noise frequency is located may be determined by referring to a critical band table.
  • the determining, according to the critical frequency band, determining a primary frequency for synthesizing a response voice corresponding to the voice interaction instruction may be: determining that a primary frequency used for synthesizing the response voice is a frequency value in a higher critical band of the critical band such that the main frequency used to synthesize the response speech is lower than the noise frequency, and the critical frequency band at which the main frequency for synthesizing the response speech is located and the noise frequency is The critical bands are separated by a certain distance, so that low-frequency sound (response speech) is masked to high-frequency sound (noise), and at the same time, the two sounds are prevented from masking each other due to the closeness of the frequencies.
  • the critical frequency band in which the noise frequency is located is the fourth-level critical band (the frequency range corresponding to the critical band of the fourth-order critical band is: 400 Hz to 510 Hz), and then the composite response can be determined.
  • the main frequency of speech is 250 Hz (the critical frequency band in which it is located is the second-order critical band).
  • the critical frequency band in which the noise frequency is located belongs to the low frequency range, for example, the critical frequency band in which the noise frequency is located is the first level critical band (corresponding frequency range is: 100 Hz to 200 Hz), at this time, If the low-frequency sound masking high-frequency sound is continued, it may be difficult to improve the auditory sensitivity of the user's voice played by the smart terminal (ie, answering the voice), and it may bring a bad hearing experience to the user. At this time, it may be determined.
  • the main frequency for synthesizing the response speech is much higher than the noise frequency, for example, determining that the main frequency for synthesizing the response speech is 1000 Hz (the critical band in which it is located is the eighth-order critical band).
  • Step 130 Synthesize the response voice based on the primary frequency.
  • the response text when the smart terminal receives the voice interaction instruction, the response text may be first generated according to the voice interaction instruction, where the response text includes voice content used by the smart terminal to respond to the voice interaction instruction; and then, based on the step
  • the main frequency determined in 120 performs a TTS (Text To Speech) conversion on the response text, and synthesizes a response voice having a specific main frequency and corresponding to the received voice interactive command.
  • TTS Text To Speech
  • the mapping relationship between the voice interaction instruction and the response text may be established in the database of the smart terminal, so that when the smart terminal receives a voice interaction instruction, the corresponding response text may be queried. And further synthesizing the response voice corresponding to the voice interaction instruction based on the determined primary frequency (ie, performing TTS conversion on the response text corresponding to the voice interaction instruction based on the primary frequency).
  • Step 140 Determine, according to the noise volume, the noise frequency, and the main frequency of the response voice, the volume of playing the response voice.
  • the masking effect of the sound is also related to the volume of the sound.
  • the louder the volume of one sound the larger the amount of masking of the other sound. Therefore, in the embodiment, the noise of the response voice to the interactive environment is masked by dynamically adjusting the volume of the response voice played by the smart terminal, so that the user can clearly hear the response voice in any noise environment.
  • the volume at which the response voice is played is also determined based on the noise volume, the noise frequency, and the dominant frequency of the response voice.
  • the masking effect produced by different frequency masking methods may be different, the masking effect of the low frequency sound masking high frequency sound is strong, and the masking effect of the high frequency sound masking low frequency sound is weak, therefore, in this embodiment, it may be first based on The noise frequency and the main frequency of the response speech determine the amount of masking, and then determine the volume at which the answering voice is played based on the noise volume and the amount of masking.
  • the masking effect of the low frequency sound masking high frequency sound is strong, and the masking effect of the high frequency sound masking low frequency sound is weak
  • the specific implementation of determining the masking amount according to the noise frequency and the main frequency of the response voice is specifically performed.
  • the method may be: if the noise frequency is lower than a main frequency of the response voice, determining that the masking amount is a first masking amount; if the noise frequency is higher than a main frequency of the response voice, determining the The masking amount is a second masking amount; the first masking amount is greater than the second masking amount.
  • the specific implementation manner of determining the volume of playing the response voice according to the noise volume and the masking amount may be: using the sum of the noise volume and the masking amount as the volume of the playback response voice.
  • the specific implementation manner of determining the volume of playing the response voice according to the noise volume, the noise frequency, and the main frequency of the response voice may also be: first, according to the noise frequency and the response voice.
  • the main frequency determines the adjustment coefficient, and then the volume of the response voice is played according to the product of the noise volume and the adjustment coefficient, wherein the adjustment coefficient is greater than one.
  • steps 120 through 140 may also be performed in combination.
  • a relationship comparison table as shown in Table 1 is established in advance according to the “masking effect”, and the mask corresponding to each noise information (including the critical frequency band where the noise frequency is located and the noise volume) can be determined by searching the relationship comparison table.
  • n in Table 1 may be a variable, which is determined according to the actually detected noise volume; and the data in Table 1 is also merely illustrative and is not intended to limit the embodiments of the present application.
  • the critical frequency band in which the noise frequency is located may be first determined, and then the primary frequency synthesis response corresponding to the critical frequency band is directly queried by querying the above Table 1. Voice and determine the volume at which the answering voice is played.
  • Step 150 Play the response voice at the determined volume.
  • the smart terminal can play the response voice at the determined volume through any sounding device, such as a speaker, a speaker, or the like.
  • the main frequency of the response voice avoids the range of the critical frequency band in which the noise frequency is located, and the playback volume of the response voice is greater than the noise volume, the masking of the noise by the response voice can be realized, so that the user In the interactive environment with any noise situation, the response voice sent by the smart terminal can be clearly heard.
  • the main frequency and the playback volume of the response voice of the smart terminal are determined based on the noise information of the current interactive environment, so there is no such a reason. The sound is too loud to scare the user's problem.
  • time domain masking includes lead masking and lag masking.
  • the main reason for the generation of time domain masking is that it takes a certain amount of time for the human brain to process the information.
  • the advanced masking is very short, only 5 to 20 ms, and the lag masking can last for 50 to 200 ms.
  • the time domain masking is caused to the response voice played by the smart terminal. And playing the response voice at the determined volume, specifically: acquiring a time node that receives the voice interaction instruction triggered by the voice control information (ie, a time node when the user ends the inquiry); After the time node presets the duration, the response voice is played at the determined volume.
  • the preset duration may be 200 ms.
  • the voice interaction method detects the noise information of the current interaction environment by receiving the voice interaction instruction, where the noise information includes noise volume and noise. a frequency, and then determining, according to the noise frequency, a main frequency for synthesizing a response voice corresponding to the voice interaction instruction, synthesizing the response voice based on the main frequency, and according to the noise volume, the noise frequency, and The main frequency of the response voice determines the volume of playing the response voice, and finally plays the response voice at the determined volume, and can dynamically adjust the response voice according to the noise information of the current interaction environment based on the masking effect of the sound.
  • the main frequency and playback volume enable the user to get a better voice interaction experience in any interactive environment.
  • the method may include but is not limited to the following steps:
  • Step 210 When receiving the voice interaction instruction, detecting noise information of the current interaction environment, where the noise information includes a noise volume and a noise frequency.
  • Step 220 Determine, according to the noise frequency, a main frequency for synthesizing a response voice corresponding to the voice interaction instruction.
  • Step 230 Synthesize the response voice based on the primary frequency.
  • Step 240 Determine, according to the noise volume, the noise frequency, and the main frequency of the response voice, the volume of playing the response voice.
  • Step 250 Play the response voice at the determined volume.
  • Step 260 Acquire interactive experience feedback information.
  • the “interactive experience feedback information” refers to the user's evaluation of the voice interaction experience, and is used to evaluate the voice interaction experience between the user and the smart terminal.
  • the interactive experience feedback information may include: the volume of the response voice is too large, the volume of the response voice is appropriate, or the volume of the response voice is too small.
  • the interactive experience feedback information may be input by the user to the smart terminal, for example, during the process of performing the voice interaction, or after the voice interaction is ended, the user inputs the interactive experience feedback information for the voice interaction experience.
  • the user experience is further improved.
  • the interactive experience feedback information may also be evaluated by the smart terminal in an appropriate manner, and then the interactive experience feedback information is obtained according to the evaluation result.
  • the smart terminal can determine whether the user can correctly understand the content of the response voice by using the interaction effect between the user and the smart terminal, or whether the user can clearly understand the response of the smart terminal during the interaction process. voice.
  • Step 270 Adjust the volume of playing the response voice according to the interaction experience feedback information.
  • the volume of playing the response voice is adjusted according to the interactive experience feedback information. For example, when the interactive experience feedback information of “the volume of the response voice is too large” is obtained, the volume of playing the response voice is reduced; and when the interactive experience feedback information “the volume of the response voice is appropriate” is acquired, the response voice is maintained. The volume of the response is unchanged; when the interactive experience feedback information of "the volume of the response voice is too small” is obtained, the volume of playing the response voice is increased.
  • the interaction experience feedback information may be acquired in real time, so that the volume of playing the response voice may be adjusted in real time according to the interaction experience feedback information.
  • the interaction experience feedback information may also be obtained when the interaction process is completed. Therefore, the smart terminal may adjust the volume of the playback response voice according to the interaction experience feedback information when performing the next voice interaction with the user, and/or , synthesize the main frequency of the response speech.
  • steps 210 to 250 have the same technical features as the steps 110 to 150 in the voice interaction method shown in FIG. 2, and therefore, the specific implementation manners may refer to step 110 of the foregoing embodiment. The corresponding description in 150 will not be repeated in this embodiment.
  • the voice interaction method obtained by the embodiment of the present application obtains the user's interactive experience feedback information after playing the response voice at the determined volume.
  • the interaction experience feedback information adjusts the volume of playing the response voice, and can continuously improve the voice interaction effect for the characteristics of the interaction object, thereby further improving the user experience.
  • FIG. 4 is a schematic structural diagram of a voice interaction apparatus according to an embodiment of the present disclosure.
  • the apparatus 40 can be implemented on a smart terminal configured with a voice interaction function, and can implement the voice interaction method provided by the foregoing embodiment.
  • the apparatus 40 may include, but is not limited to, a noise detecting unit 41, a main frequency determining unit 42, a speech synthesizing unit 43, a volume determining unit 44, and a playing unit 45.
  • the noise detecting unit 41 is configured to detect noise information of a current interaction environment when the voice interaction instruction is received, where the noise information includes a noise volume and a noise frequency;
  • the main frequency determining unit 42 is configured to determine, according to the noise frequency, a main frequency for synthesizing a response voice corresponding to the voice interaction instruction;
  • the speech synthesis unit 43 is configured to synthesize the response voice based on the main frequency
  • the volume determining unit 44 is configured to determine a volume for playing the response voice according to the noise volume, the noise frequency, and a main frequency of the response voice;
  • the playing unit 45 is configured to play the response voice at the determined volume.
  • the noise information of the current interaction environment may be first detected by the noise detecting unit 41, the noise information including the noise volume and the noise frequency; and then the noise is determined by the main frequency determining unit 42 according to the noise.
  • the frequency determination is used to synthesize a main frequency of the response voice corresponding to the voice interaction command, and the voice response unit 43 synthesizes the response voice based on the master frequency; and then, using the volume determination unit 44, according to the noise volume, The noise frequency and the main frequency of the response voice determine the volume at which the response voice is played; finally, the response voice is played by the playback unit 45 at the determined volume.
  • the main frequency determining unit 42 is specifically configured to: determine a critical frequency band in which the noise frequency is located; and determine, according to the critical frequency band, a main frequency used to synthesize a response voice corresponding to the voice interaction instruction. .
  • the volume determining unit 44 includes a masking amount determining module 441 and a volume determining module 442.
  • the masking amount determining module 441 is configured to determine a masking amount according to the noise frequency and a main frequency of the response voice; the volume determining module 442 is configured to determine, according to the noise volume and the masking amount, a volume of playing the response voice. . Specifically, in some embodiments, the masking amount determining module 441 is specifically configured to: if the noise frequency is lower than a primary frequency of the response voice, determine the masking amount as a first masking amount; if the noise frequency And above the main frequency of the response voice, determining that the masking amount is a second masking amount; the first masking amount is greater than the second masking amount.
  • the playing unit 45 when the voice interaction instruction is triggered by the voice control information, is specifically configured to: acquire a time node that receives the voice interaction instruction triggered by the voice control information; After the time node presets the duration, the response voice is played at the determined volume.
  • the device 40 further includes: a feedback unit 46 and a volume adjustment unit 47.
  • the feedback unit 46 is configured to obtain interaction experience feedback information.
  • the volume adjustment unit 47 is configured to adjust the volume of playing the response voice according to the interaction experience feedback information.
  • the voice interaction device detects the noise information of the current interaction environment by the noise detecting unit 41 when receiving the voice interaction instruction, and the noise information.
  • a noise level and a noise frequency are included, and then a main frequency for synthesizing a response voice corresponding to the voice interaction instruction is determined by the main frequency determining unit 42 according to the noise frequency, and further synthesized based on the main frequency in the speech synthesizing unit 43
  • the response voice then, the volume determining unit 44 determines the volume of playing the response voice according to the noise volume, the noise frequency, and the main frequency of the response voice; finally, the determined unit through the playing unit 45
  • the volume of the response voice is played, and based on the masking effect of the sound, the main frequency and the playback volume of the response voice can be dynamically adjusted according to the noise information of the current interactive environment, so that the user can obtain better voice interaction in any interactive environment.
  • the main frequency and the playback volume of the response voice can be dynamically adjusted according to the noise information of the current interactive environment, so that the user
  • FIG. 5 is a schematic structural diagram of an intelligent terminal according to an embodiment of the present disclosure.
  • the smart terminal 500 can be any type of electronic device, such as a smart phone, a robot, a personal computer, a wearable smart device, a smart home appliance, etc., capable of executing The voice interaction method provided by the foregoing method embodiment, or the voice interaction device provided by the foregoing device embodiment.
  • the smart terminal 500 includes:
  • One or more processors 501 and memory 502, one processor 501 is taken as an example in FIG.
  • the processor 501 and the memory 502 may be connected by a bus or other means, as exemplified by a bus connection in FIG.
  • the memory 502 is used as a non-transitory computer readable storage medium, and can be used for storing a non-transitory software program, a non-transitory computer executable program, and a module, such as a program instruction/module corresponding to the voice interaction method in the embodiment of the present application.
  • a module such as a program instruction/module corresponding to the voice interaction method in the embodiment of the present application.
  • the processor 501 executes various functional applications and data processing of the voice interaction device 40 by running non-transitory software programs, instructions, and modules stored in the memory 502, i.e., implements the voice interaction method of any of the above method embodiments.
  • the memory 502 can include a storage program area and an storage data area, wherein the storage program area can store an operating system, an application required for at least one function; the storage data area can store data created according to the use of the voice interaction device 40, and the like.
  • memory 502 can include high speed random access memory, and can also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device.
  • memory 502 can optionally include memory remotely located relative to processor 501, which can be connected to smart terminal 500 over a network. Examples of such networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
  • the one or more modules are stored in the memory 502, and when executed by the one or more processors 501, perform a voice interaction method in any of the above method embodiments, for example, performing the above described FIG. Method steps 110 through 150, method steps 210 through 270 in FIG. 3, implement the functions of units 41-47 in FIG.
  • the embodiment of the present application further provides a non-transitory computer readable storage medium storing computer executable instructions executed by one or more processors, for example, Executed by a processor 501 in FIG. 5, the one or more processors may be configured to perform the voice interaction method in any of the foregoing method embodiments, for example, to perform the method steps 110 to 150 in FIG. 2 described above.
  • Method steps 210 through 270 in 3 implement the functions of units 41-47 in FIG.
  • the device embodiments described above are merely illustrative, wherein the units described as separate components may or may not be physically separate, and the components displayed as units may or may not be physical units, ie may be located A place, or it can be distributed to multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
  • the various embodiments can be implemented by means of software plus a general hardware platform, and of course, by hardware.
  • One of ordinary skill in the art can understand that all or part of the process of implementing the above embodiments can be completed by a computer program in a computer program product, and the computer program can be stored in a non-transitory computer.
  • the computer program includes program instructions, and when the program instructions are executed by the intelligent terminal, the smart terminal can be caused to execute the flow of the embodiments of the foregoing methods.
  • the storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), or a random access memory (RAM).
  • the foregoing products can perform the voice interaction method provided by the embodiment of the present application, and have the corresponding functional modules and beneficial effects of executing the voice interaction method.
  • the voice interaction method provided by the embodiment of the present application.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Telephonic Communication Services (AREA)
  • User Interface Of Digital Computer (AREA)
  • Telephone Function (AREA)

Abstract

Embodiments of the present invention provide a voice interaction method and device and an intelligent terminal. The method comprises: detecting noise information of a current interaction environment when receiving a voice interaction instruction, the noise information comprising noise volume and noise frequency; determining, according to the noise frequency, a main frequency used for synthesizing a response voice corresponding to the voice interaction instruction; synthesizing the response voice according to the main frequency; determining the volume at which the response voice is played according to the noise volume, the noise frequency, and the main frequency of the response voice; and playing the response voice at the determined volume. By means of the technical solution, the embodiments of the present application can dynamically adjust the main frequency of the response voice and the playback volume according to noise information on the current interaction environment based on the sound masking effect such that a user can obtain better voice interaction experience under any interaction environment.

Description

一种语音交互方法、装置和智能终端Voice interaction method, device and intelligent terminal 技术领域Technical field
本申请涉及人工智能技术领域,尤其涉及一种语音交互方法、装置和智能终端。The present application relates to the field of artificial intelligence technologies, and in particular, to a voice interaction method, device, and smart terminal.
背景技术Background technique
随着人工智能技术的不断发展,智能机器人、智能家居、智能手机、智能家电、智能车载设备等智能终端受到了越来越多的用户的青睐,人们的生活已经逐渐走入人工智能时代。With the continuous development of artificial intelligence technology, intelligent terminals such as intelligent robots, smart homes, smart phones, smart home appliances, and smart car devices have been favored by more and more users, and people's lives have gradually entered the era of artificial intelligence.
其中,为了方便用户的使用,很多智能终端都配置有语音交互功能,能够向用户作出语音响应。一般地,智能终端可以在接收到语音交互指令时,根据该语音交互指令生成应答文本,然后基于该应答文本进行文本语音转换,即,TTS(Text to Speech)转换,合成应答语音,最后向用户播放所合成的应答语音。Among them, in order to facilitate the use of the user, many intelligent terminals are configured with a voice interaction function, which can make a voice response to the user. Generally, when receiving the voice interaction instruction, the smart terminal may generate a response text according to the voice interaction instruction, and then perform text-to-speech conversion based on the response text, that is, TTS (Text to Speech) conversion, synthesize the response voice, and finally to the user. Play the synthesized response voice.
在实现本申请的过程中,发明人发现:目前的智能终端在基于应答文本进行发声的过程中,基本都是以预先设定好的频率合成应答语音,并以固定的音量播放所合成的应答语音,没有考虑交互环境的噪声状况,从而使得有时候用户听到智能终端的应答语音的音量较小,无法听清楚对话内容;或者,有时候用户听到智能终端的应答语音的音量较大,不符合当时的气氛,甚至有可能被吓到。在进行语音交互的过程中,用户听到智能终端的应答语音的音量过大或者过小,均不利于用户的友好体验。In the process of implementing the present application, the inventor has found that in the process of sounding based on the response text, the current smart terminal basically synthesizes the response voice at a preset frequency and plays the synthesized response at a fixed volume. Voice, without considering the noise condition of the interactive environment, so that sometimes the user hears that the volume of the response voice of the smart terminal is small, and the content of the conversation cannot be heard clearly; or, sometimes, the volume of the response voice of the smart terminal is large, Not in line with the atmosphere at the time, and may even be scared. During the process of voice interaction, the user hears that the volume of the response voice of the smart terminal is too large or too small, which is not conducive to the user's friendly experience.
因此,现有的语音交互技术还有待于改进和发展。Therefore, existing voice interaction technologies have yet to be improved and developed.
发明内容Summary of the invention
本申请实施例提供一种语音交互方法、装置和智能终端,能够解决现有人机交互体验受交互环境的噪声状况的影响较大,不利于提升用户体验的问题。The embodiment of the present invention provides a voice interaction method, device, and intelligent terminal, which can solve the problem that the existing human-computer interaction experience is greatly affected by the noise condition of the interaction environment, which is not conducive to improving the user experience.
为解决上述技术问题,本申请实施例提供了以下几种技术方案:To solve the above technical problem, the embodiments of the present application provide the following technical solutions:
第一方面,本申请实施例提供了一种语音交互方法,应用于智能终端,该方法包括:In a first aspect, the embodiment of the present application provides a voice interaction method, which is applied to a smart terminal, and the method includes:
当接收到语音交互指令时,检测当前交互环境的噪声信息,所述噪声信息包括噪声音量和噪声频率;When receiving the voice interaction instruction, detecting noise information of the current interaction environment, where the noise information includes a noise volume and a noise frequency;
根据所述噪声频率确定用于合成与所述语音交互指令对应的应答语音的主 频率;Determining, according to the noise frequency, a primary frequency for synthesizing a response voice corresponding to the voice interaction instruction;
基于所述主频率合成所述应答语音;Synthesizing the response voice based on the primary frequency;
根据所述噪声音量、所述噪声频率和所述应答语音的主频率确定播放所述应答语音的音量;Determining a volume of playing the response voice according to the noise volume, the noise frequency, and a main frequency of the response voice;
以所确定的所述音量播放所述应答语音。The answering voice is played at the determined volume.
第二方面,本申请实施例提供一种语音交互装置,运行于智能终端,包括:In a second aspect, the embodiment of the present application provides a voice interaction device, which is implemented in an intelligent terminal, and includes:
噪声检测单元,用于当接收到语音交互指令时,检测当前交互环境的噪声信息,所述噪声信息包括噪声音量和噪声频率;a noise detecting unit, configured to detect noise information of a current interaction environment when the voice interaction instruction is received, where the noise information includes a noise volume and a noise frequency;
主频率确定单元,用于根据所述噪声频率确定用于合成与所述语音交互指令对应的应答语音的主频率;a main frequency determining unit, configured to determine, according to the noise frequency, a main frequency for synthesizing a response voice corresponding to the voice interaction instruction;
语音合成单元,用于基于所述主频率合成所述应答语音;a speech synthesis unit, configured to synthesize the response voice based on the main frequency;
音量确定单元,用于根据所述噪声音量、所述噪声频率和所述应答语音的主频率确定播放所述应答语音的音量;a volume determining unit, configured to determine, according to the noise volume, the noise frequency, and a main frequency of the response voice, a volume of playing the response voice;
播放单元,用于以所确定的所述音量播放所述应答语音。a playing unit, configured to play the answering voice at the determined volume.
第三方面,本申请实施例提供一种智能终端,包括:In a third aspect, an embodiment of the present application provides an intelligent terminal, including:
至少一个处理器;以及,At least one processor; and,
与所述至少一个处理器通信连接的存储器;其中,a memory communicatively coupled to the at least one processor; wherein
所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行如上所述的语音交互方法。The memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to enable the at least one processor to perform a voice interaction method as described above.
第四方面,本申请实施例提供了一种非暂态计算机可读存储介质,所述非暂态计算机可读存储介质存储有计算机可执行指令,所述计算机可执行指令用于使智能终端执行如上所述的语音交互方法。In a fourth aspect, an embodiment of the present application provides a non-transitory computer readable storage medium, where the non-transitory computer readable storage medium stores computer executable instructions for causing a smart terminal to execute The voice interaction method as described above.
第五方面,本申请实施例还提供了一种计算机程序产品,所述计算机程序产品包括存储在非暂态计算机可读存储介质上的计算机程序,所述计算机程序包括程序指令,当所述程序指令被智能终端执行时,使所述智能终端执行如上所述的语音交互方法。In a fifth aspect, the embodiment of the present application further provides a computer program product, the computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions, when the program When the instruction is executed by the smart terminal, the smart terminal is caused to perform the voice interaction method as described above.
本申请实施例的有益效果在于:本申请实施例提供的语音交互方法、装置和智能终端通过在接收到语音交互指令时,检测当前交互环境的噪声信息,所述噪声信息包括噪声音量和噪声频率,然后根据所述噪声频率确定用于合成与 所述语音交互指令对应的应答语音的主频率,基于所述主频率合成所述应答语音,并根据所述噪声音量、所述噪声频率和所述应答语音的主频率确定播放所述应答语音的音量,最后以所确定的所述音量播放所述应答语音,能够基于声音的掩蔽效应,根据当前的交互环境的噪声信息动态调整其应答语音的主频率和播放音量,使得用户在任意交互环境下都可以获得较好的语音交互体验。The beneficial effects of the embodiments of the present application are as follows: the voice interaction method, the device, and the intelligent terminal provided by the embodiment of the present application detect the noise information of the current interaction environment when receiving the voice interaction instruction, where the noise information includes the noise volume and the noise frequency. And determining, according to the noise frequency, a primary frequency for synthesizing a response voice corresponding to the voice interaction instruction, synthesizing the response voice based on the primary frequency, and according to the noise volume, the noise frequency, and the The main frequency of the response voice determines the volume of playing the response voice, and finally plays the response voice at the determined volume, and can dynamically adjust the voice of the response voice according to the noise information of the current interactive environment based on the masking effect of the sound. The frequency and playback volume allow the user to get a better voice interaction experience in any interactive environment.
附图说明DRAWINGS
为了更清楚地说明本申请实施例的技术方案,下面将对本申请实施例中所需要使用的附图作简单地介绍。显而易见地,下面所描述的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings to be used in the embodiments of the present application will be briefly described below. Obviously, the drawings described below are only some embodiments of the present application, and other drawings may be obtained from those skilled in the art without departing from the drawings.
图1是本申请实施例提供的语音交互方法的其中一种应用环境的示意图;1 is a schematic diagram of one application environment of a voice interaction method provided by an embodiment of the present application;
图2是本申请实施例提供的一种语音交互方法的流程示意图;2 is a schematic flowchart of a voice interaction method according to an embodiment of the present application;
图3是本申请实施例提供的另一种语音交互方法的流程示意图;3 is a schematic flowchart of another voice interaction method provided by an embodiment of the present application;
图4是本申请实施例提供的一种语音交互装置的结构示意图;4 is a schematic structural diagram of a voice interaction apparatus according to an embodiment of the present application;
图5是本申请实施例提供的一种智能终端的结构示意图。FIG. 5 is a schematic structural diagram of an intelligent terminal according to an embodiment of the present application.
具体实施方式Detailed ways
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本申请,并不用于限定本申请。In order to make the objects, technical solutions, and advantages of the present application more comprehensible, the present application will be further described in detail below with reference to the accompanying drawings and embodiments. It is understood that the specific embodiments described herein are merely illustrative of the application and are not intended to be limiting.
需要说明的是,如果不冲突,本申请实施例中的各个特征可以相互结合,均在本申请的保护范围之内。另外,虽然在装置示意图中进行了功能模块划分,在流程图中示出了逻辑顺序,但是在某些情况下,可以以不同于装置中的模块划分,或流程图中的顺序执行所示出或描述的步骤。再者,本申请所采用的“第一”“第二”“第三”等字样并不对数据和执行次序进行限定,仅是对功能和作用基本相同的相同项或相似项进行区分。It should be noted that, if there is no conflict, the various features in the embodiments of the present application may be combined with each other, and are all within the protection scope of the present application. In addition, although the functional module partitioning is performed in the device schematic, the logical sequence is shown in the flowchart, but in some cases, the illustrated may be performed in a different manner from the modules in the device, or in the order in the flowchart. Or the steps described. Moreover, the words "first", "second", "third" and the like used in the present application do not limit the data and the order of execution, but only distinguish the same or similar items whose functions and functions are substantially the same.
目前,大多数智能终端在进行语音交互时都是以特定的频率合成应答语音并且以固定的音量播放所合成的应答语音,因此,智能终端发出的声音的主频率和音量大小是固定的。然而,当智能终端处于具有不同噪声状况的交互环境时,用户听到智能终端所发出的声音的音量通常会存在时而大,时而小的问题。举例来说,假设智能终端,比如,机器人,所在位置为一商场内;当该商场的 人流量较大时,该智能终端所处交互环境比较嘈杂,用户在跟该智能终端进行语音交互时,听到该智能终端发出的声音较小,常常会听不清智能终端的应答内容;而当该商场的人流量较小时,该智能终端所处的交互环境比较安静,用户在跟该智能终端进行语音交互时,听到该智能终端发出的声音较大,容易使用户感到不适或者被吓到。At present, most intelligent terminals synthesize acknowledgment voices at a specific frequency and play the synthesized acknowledgment voices at a fixed frequency during voice interaction. Therefore, the main frequency and volume of the sound emitted by the smart terminal are fixed. However, when the smart terminal is in an interactive environment with different noise conditions, the volume of the sound that the user hears from the smart terminal usually has a large and sometimes small problem. For example, suppose that a smart terminal, such as a robot, is located in a shopping mall; when the traffic of the shopping mall is large, the interaction environment of the smart terminal is relatively noisy, and when the user performs voice interaction with the smart terminal, When the voice of the smart terminal is small, the response content of the smart terminal is often inaudible; and when the traffic of the mall is small, the interaction environment of the smart terminal is relatively quiet, and the user is engaged with the smart terminal. When the voice interacts, the sound of the smart terminal is louder, which makes the user feel uncomfortable or scared.
究其原因,发明人发现:其主要是因为人耳的听觉感受一般会受声音的“掩蔽效应”影响,即:当人们在安静的环境中倾听一个声音时,即使这个声音的音量很小,也可以听到;但是,在倾听这个声音的同时,如果存在另一个声音(掩蔽声),就会影响到人耳对这个声音的听闻效果,这时,需要把这个声音的音量增大才能让人耳听到,也就是说,人耳对这个声音的听阈提高了,而人耳对这个声音的听阈所提高的分贝数,称为“掩蔽量”。其中,大量的研究表明,一个声音(掩蔽声)对另一个声音(倾听声)的掩蔽效果与很多因素有关,主要取决于这两个声音的相对强度和频率结构。The reason, the inventor found that: mainly because the human ear's auditory feeling is generally affected by the "masking effect" of the sound, that is, when people listen to a sound in a quiet environment, even if the volume of the sound is small, It can also be heard; however, while listening to this sound, if there is another sound (masking sound), it will affect the human ear's hearing effect on the sound. At this time, the volume of the sound needs to be increased to allow The human ear hears, that is, the human ear's hearing threshold for this sound is raised, and the number of decibels raised by the human ear to the hearing threshold of this sound is called the "masking amount." Among them, a large number of studies have shown that the masking effect of one sound (masking sound) on another sound (listening sound) is related to many factors, mainly depending on the relative intensity and frequency structure of the two sounds.
基于此,本申请实施例提供了一种语音交互方法、一种语音交互装置、一种智能终端、一种非暂态计算机可读存储介质以及一种计算机程序产品。Based on this, the embodiment of the present application provides a voice interaction method, a voice interaction device, an intelligent terminal, a non-transitory computer readable storage medium, and a computer program product.
其中,本申请实施例提供的语音交互方法是一种基于声音的掩蔽效应,根据当前的交互环境的噪声信息动态调整智能终端发出的应答语音的主频率及其播放音量的方法,具体为:在接收到语音交互指令时,检测当前交互环境的噪声信息,所述噪声信息包括噪声音量和噪声频率,然后根据所述噪声频率确定用于合成与所述语音交互指令对应的应答语音的主频率,基于所述主频率合成所述应答语音,并根据所述噪声音量、所述噪声频率和所述应答语音的主频率确定播放所述应答语音的音量,最后以所确定的所述音量播放所述应答语音。从而,在本申请实施例中,能够对应不同交互环境的噪声状况动态调整所合成的应答语音的主频率及其播放音量,使得用户在任意交互环境下都能够听清智能终端的应答内容,并且,不会因为听到的声音过大而被吓到,从而使得用户在任意交互环境下都可以获得较好的语音交互体验。The voice interaction method provided by the embodiment of the present application is a voice-based masking effect, and the method for dynamically adjusting the main frequency of the response voice and the playback volume of the voice sent by the smart terminal according to the noise information of the current interaction environment, specifically: Receiving the voice interaction instruction, detecting noise information of the current interaction environment, where the noise information includes a noise volume and a noise frequency, and then determining, according to the noise frequency, a main frequency for synthesizing the response voice corresponding to the voice interaction instruction, And synthesizing the response voice based on the main frequency, and determining a volume for playing the response voice according to the noise volume, the noise frequency, and a main frequency of the response voice, and finally playing the determined volume at the determined volume Answer the voice. Therefore, in the embodiment of the present application, the main frequency of the synthesized response voice and the play volume thereof can be dynamically adjusted according to the noise condition of different interaction environments, so that the user can hear the response content of the smart terminal in any interaction environment, and It will not be scared because the sound heard is too large, so that users can get a better voice interaction experience in any interactive environment.
其中,本申请实施例提供的语音交互装置是由软件程序构成的能够实现本申请实施例提供的语音交互方法的虚拟装置,其与本申请实施例提供的语音交互方法基于相同的发明构思,具有相同的技术特征以及有益效果。The voice interaction device provided by the embodiment of the present application is a virtual device that is configured by the software program to implement the voice interaction method provided by the embodiment of the present application, and the voice interaction method provided by the embodiment of the present application is based on the same inventive concept, and has The same technical features and benefits.
其中,本申请实施例提供的智能终端可以是任意类型的电子设备,比如:机器人、智能手机、个人电脑、平板电脑、可穿戴智能设备、智能家电等等。该智能终端能够执行本申请实施例提供的语音交互方法,或者,运行本申请实 施例提供的语音交互装置。The smart terminal provided by the embodiment of the present application may be any type of electronic device, such as a robot, a smart phone, a personal computer, a tablet computer, a wearable smart device, a smart home appliance, and the like. The smart terminal can perform the voice interaction method provided by the embodiment of the present application, or run the voice interaction device provided by the embodiment of the present application.
具体地,下面结合附图,对本申请实施例作进一步阐述。Specifically, the embodiments of the present application are further described below in conjunction with the accompanying drawings.
图1是本申请实施例提供的语音交互方法的其中一种应用环境的示意图。其中,该应用环境所处的位置可以是固定的,比如,该应用环境所处的位置可以是一商场内或者户外场所;或者,该应用环境所处的位置也可以是可变的,本申请实施例对此不作具体限定。FIG. 1 is a schematic diagram of one application environment of a voice interaction method provided by an embodiment of the present application. The location of the application environment may be fixed. For example, the location of the application environment may be a mall or an outdoor location; or the location of the application environment may be variable. The embodiment does not specifically limit this.
具体地,如图1所示,在该应用环境中,可以包括用户10和智能终端20。Specifically, as shown in FIG. 1, in the application environment, the user 10 and the smart terminal 20 may be included.
其中,用户10可以为任何能够与智能终端20进行语音交互的对象(即,智能终端20的“交互对象”),其可以通过任何合适的类型的,一种或者多种用户交互设备(比如鼠标、键盘、遥控器、触摸屏、体感摄像头以及音频采集装置等)与智能终端20进行交互,输入指令或者控制智能终端20执行一种或者多种操作。The user 10 can be any object capable of performing voice interaction with the smart terminal 20 (ie, an "interactive object" of the smart terminal 20), which can be through any suitable type, one or more user interaction devices (such as a mouse) The keyboard, the remote controller, the touch screen, the somatosensory camera, the audio collection device, and the like interact with the smart terminal 20 to input commands or control the smart terminal 20 to perform one or more operations.
其中,智能终端20可以为任何合适类型的,具有一定逻辑运算能力,提供一个或者多个能够满足用户意图的功能的电子设备。例如,机器人、个人电脑、平板电脑、智能手机、可穿戴智能设备等。该智能终端20可以包括任何合适类型的,用以存储数据的存储介质,例如磁碟、光盘(CD-ROM)、只读存储记忆体或随机存储记忆体等。该智能终端20还可以包括一个或者多个逻辑运算模块,单线程或者多线程并行执行任何合适类型的功能或者操作,例如接收交互指令、合成用于交互的应答语音等。所述逻辑运算模块可以是任何合适类型的,能够执行逻辑运算操作的电子电路或者贴片式电子器件,例如单核心处理器、多核心处理器、音频处理器。The smart terminal 20 can be any suitable type of electronic device having certain logical computing capabilities and providing one or more functions capable of satisfying the user's intention. For example, robots, personal computers, tablets, smart phones, wearable smart devices, and the like. The smart terminal 20 can include any suitable type of storage medium for storing data, such as a magnetic disk, a compact disc (CD-ROM), a read only memory or a random access memory. The smart terminal 20 may also include one or more logical computing modules that perform any suitable type of function or operation in parallel, such as receiving interactive instructions, synthesizing responsive speech for interaction, and the like, in a single thread or multiple threads. The logic operation module may be any suitable type of electronic circuit or chip-type electronic device capable of performing logical operation operations, such as a single core processor, a multi-core processor, an audio processor.
在实际应用中,用户10可以通过任意合适的方式与智能终端20进行语音交互。比如,用户10可以通过鼠标、键盘、触摸屏、体感操作等交互设备向智能终端20输入语音交互指令,智能终端20在接收到该语音交互指令时,可以采用本申请实施例提供的语音交互方法对用户10作出语音响应。又如,用户10也可以通过智能终端20的声音采集设备向智能终端20输入语音控制信息,智能终端20对该语音控制信息进行解析后可以得到相应的语音交互指令,进而基于该语音交互指令,采用本申请实施例提供的语音交互方法对用户10作出语音响应。In an actual application, the user 10 can perform voice interaction with the smart terminal 20 in any suitable manner. For example, the user 10 can input a voice interaction instruction to the smart terminal 20 through an interactive device such as a mouse, a keyboard, a touch screen, and a somatosensory operation. The smart terminal 20 can use the voice interaction method provided by the embodiment of the present application when receiving the voice interaction instruction. User 10 makes a voice response. For example, the user 10 can also input voice control information to the smart terminal 20 through the voice collecting device of the smart terminal 20. After the smart terminal 20 parses the voice control information, the corresponding voice interaction command can be obtained, and based on the voice interaction command, The voice response method provided by the embodiment of the present application is used to make a voice response to the user 10.
具体地,在本申请实施例中,当智能终端20接收到语音交互指令时,比如,当智能终端20接收到用户10向其输入的语音控制信息“请问第25号大概还要 等多久”时,或者,当智能终端20接收到用户10在其触摸屏上输入的语音交互指令“排位查询”时,智能终端20可以首先检测当前交互环境(即,当前用户10与智能终端20进行交互的环境)的噪声信息,其中,所述噪声信息包括噪声音量和噪声频率;然后根据所述噪声频率确定用于合成与所述语音交互指令对应的应答语音的主频率,并基于所述主频率合成所述应答语音,比如,基于所述噪声频率,针对上述相关的语音交互指令,合成具有特定的主频率,并且,内容为“您还需等待30分钟”的应答语音;接着,根据所述噪声音量、所述噪声频率和所述应答语音的主频率确定播放所述应答语音的音量;最后,以所确定的所述音量播放所述应答语音。Specifically, in the embodiment of the present application, when the smart terminal 20 receives the voice interaction instruction, for example, when the smart terminal 20 receives the voice control information input by the user 10, "How long does it take to wait for the 25th?" Or, when the smart terminal 20 receives the voice interaction instruction "ranking query" input by the user 10 on its touch screen, the smart terminal 20 may first detect the current interactive environment (ie, the environment in which the current user 10 interacts with the smart terminal 20). Noise information, wherein the noise information includes a noise volume and a noise frequency; and then determining a main frequency for synthesizing a response voice corresponding to the voice interaction instruction according to the noise frequency, and based on the main frequency synthesis Responding to a voice, for example, based on the noise frequency, synthesizing a response voice having a specific primary frequency for the related voice interaction command, and the content is "you still have to wait 30 minutes"; then, according to the noise volume The noise frequency and the main frequency of the acknowledgment voice determine the volume at which the acknowledgment voice is played; Given the volume of the playback speech response.
其中,需要说明的是,本申请实施例提供的语音交互方法的还可以进一步的拓展到其他合适的应用环境中,而不限于图1中所示的应用环境。虽然图1中仅显示了三个用户10和两个智能终端20,但本领域技术人员可以理解的是,在实际应用过程中,该应用环境还可以包括更多或者更少的用户、智能终端。It should be noted that the voice interaction method provided by the embodiment of the present application may be further extended to other suitable application environments, and is not limited to the application environment shown in FIG. 1 . Although only three users 10 and two smart terminals 20 are shown in FIG. 1, those skilled in the art can understand that in an actual application process, the application environment may further include more or fewer users and smart terminals. .
图2是本申请实施例提供的一种语音交互方法的流程示意图,该方法可以由如上所述的任一类型的智能终端执行。FIG. 2 is a schematic flowchart of a voice interaction method according to an embodiment of the present application, and the method may be performed by any type of smart terminal as described above.
具体地,请参阅图2,该方法可以包括但不限于以下步骤:Specifically, referring to FIG. 2, the method may include but is not limited to the following steps:
步骤110:当接收到语音交互指令时,检测当前交互环境的噪声信息,所述噪声信息包括噪声音量和噪声频率。Step 110: When receiving the voice interaction instruction, detecting noise information of the current interaction environment, where the noise information includes a noise volume and a noise frequency.
在本实施例中,所述“语音交互指令”是指能够指示智能终端作出特定的语音响应的指令。针对不同的语音交互指令,智能终端可以做出不同的语音响应。In the present embodiment, the "voice interaction command" refers to an instruction capable of instructing the smart terminal to make a specific voice response. The intelligent terminal can make different voice responses for different voice interaction instructions.
其中,该语音交互指令可以由用户向智能终端输入的控制信息触发。根据交互方式的不同,该控制信息可以包括但不限于:触摸控制信息和语音控制信息。比如,用户可以通过该智能终端的触摸屏输入“查询商店A的位置”的触摸控制信息以指示智能终端通过语音的方式给出“商店A的具体位置”;又如,用户也可以通过该智能终端的声音采集设备(比如,麦克风)输入语音“商店A在哪”的语音控制信息以指示智能终端通过语音的方式给出“商店A的具体位置”。The voice interaction instruction may be triggered by the control information input by the user to the smart terminal. The control information may include, but is not limited to, touch control information and voice control information, depending on the manner of interaction. For example, the user can input the touch control information of “quising the location of the store A” through the touch screen of the smart terminal to instruct the smart terminal to give “the specific location of the store A” by means of voice; for example, the user can also pass the smart terminal. The sound collection device (for example, a microphone) inputs voice control information of the voice "Where is the store A" to instruct the smart terminal to give "the specific location of the store A" by means of voice.
或者,该语音交互指令也可以由智能终端自身在满足预设条件下自动触发。比如,对于迎宾机器人来说,当其检测到有客户走近时,可以自动触发产生一 个语音交互指令,指示该迎宾机器人向该客户发出“欢迎光临”的语音响应。又如,对于扫地机器人来说,当其驱动轮被缠绕时,可以自动触发产生一个语音交互指令,指示该扫地机器人发出“驱动轮被缠绕,请检查”的语音提示,以提示用户该扫地机器人当前被缠绕的状态。Alternatively, the voice interaction command may also be automatically triggered by the smart terminal itself under the preset condition. For example, for a welcome robot, when it detects that a customer is approaching, it can automatically trigger a voice interaction command to instruct the welcome robot to send a "welcome" voice response to the client. For example, for the sweeping robot, when the driving wheel is wound, it can automatically trigger a voice interactive command to instruct the sweeping robot to issue a voice prompt "the driving wheel is wound, please check" to prompt the user to sweep the robot. The current state of being entangled.
在本实施例中,所述“当前交互环境”是指接收到语音交互指令时,智能终端与用户进行交互的环境;所述“噪声信息”是指该交互环境中与交互内容无关的声音的信息,该噪声信息包括噪声音量和噪声频率。其中,所述“噪声音量”即噪声的强度/响度,所述“噪声频率”即噪声中的主要频率成分。In this embodiment, the “current interaction environment” refers to an environment in which the smart terminal interacts with the user when receiving the voice interaction instruction; the “noise information” refers to the voice in the interaction environment that is not related to the interaction content. Information, the noise information including noise volume and noise frequency. Wherein, the "noise volume" is the intensity/loudness of the noise, and the "noise frequency" is the main frequency component in the noise.
具体地,在本实施例中,当用户通过任意交互方式向智能终端输入控制信息时,或者,当智能终端自身满足预设条件时,智能终端可以接收到相对应的语音交互指令,此时,智能终端需首先检测当前交互环境的噪声,根据该噪声中的声学特征获取当前交互环境的噪声音量和噪声频率,然后再执行下述步骤120。Specifically, in this embodiment, when the user inputs the control information to the smart terminal through any interactive manner, or when the smart terminal itself meets the preset condition, the smart terminal may receive the corresponding voice interaction instruction. The intelligent terminal needs to first detect the noise of the current interactive environment, obtain the noise volume and the noise frequency of the current interactive environment according to the acoustic features in the noise, and then perform the following step 120.
步骤120:根据所述噪声频率确定用于合成与所述语音交互指令对应的应答语音的主频率。Step 120: Determine, according to the noise frequency, a main frequency for synthesizing a response voice corresponding to the voice interaction instruction.
在本实施例中,所述“应答语音”是指智能终端向用户作出的语音响应,该应答语音中的语音内容与智能终端接收到的语音交互指令相对应。比如,若智能终端接收到的语音交互指令用于指示智能终端发出“驱动轮被缠绕”的提示声,则,其对应的应答语音的内容即“驱动轮被缠绕,请检查”。又如,若智能终端接收到的语音交互指令用于指示智能终端通过语音的方式回答“商店A的位置在哪里”,则其对应的应答语音的内容可以是“该商店A在前方50米右侧拐角处”。所述“主频率”即应答语音的主要频率成分。In the embodiment, the “answering voice” refers to a voice response made by the smart terminal to the user, and the voice content in the response voice corresponds to the voice interaction instruction received by the smart terminal. For example, if the voice interaction command received by the smart terminal is used to instruct the smart terminal to issue a prompt sound of the “drive wheel being wound”, the content of the corresponding response voice is “the drive wheel is wound, please check”. For another example, if the voice interaction command received by the smart terminal is used to instruct the smart terminal to answer “Where is the location of the store A” by voice, the content of the corresponding response voice may be “the store A is 50 meters ahead. Side corners." The "main frequency" is the main frequency component of the response speech.
在本实施例中,基于声音的“掩蔽效应”,可以在检测到当前交互环境的噪声频率时,首先根据该噪声频率确定用于合成与所述语音交互指令对应的应答语音的主频率。一般地,在“频域掩蔽效应”中,低频声能够掩蔽高频声,因此,可以确定用于合成与所述语音交互指令对应的应答语音的主频率低于所述噪声频率。In this embodiment, based on the "masking effect" of the sound, when detecting the noise frequency of the current interactive environment, the main frequency for synthesizing the response voice corresponding to the voice interaction instruction may be first determined according to the noise frequency. In general, in the "frequency domain masking effect", the low frequency sound can mask the high frequency sound, and therefore, it can be determined that the main frequency for synthesizing the response voice corresponding to the voice interactive command is lower than the noise frequency.
其中,由于在“掩蔽效应”中,声音频率与掩蔽曲线不是线性关系,为从感知上来统一度量声音频率,一般会引入“临界频带”的概念,即:在20Hz到16kHz范围内有24个临界频带,临界频带的单位为Bark(巴克),1Bark=一个临界频带的宽度,当f(频率)<500Hz时,1Bark≈f/100;当f>500Hz 时,1Bark≈9+4log(f/100)。因此,在本实施例中,所述根据所述噪声频率确定用于合成与所述语音交互指令对应的应答语音的主频率的具体实施方式可以是:确定所述噪声频率所处的临界频带,然后根据所述临界频带确定用于合成与所述语音交互指令对应的应答语音的主频率。其中,所述噪声频率所处临界频带可参考临界频带表来确定。Among them, since the sound frequency and the masking curve are not linear in the "masking effect", in order to uniformly measure the sound frequency from the perceptual sense, the concept of "critical band" is generally introduced, that is, there are 24 critical points in the range of 20 Hz to 16 kHz. Band, the unit of the critical band is Bark, 1 Bark = the width of a critical band, when Bar (frequency) < 500Hz, 1Bark≈f/100; when f>500Hz, 1Bark≈9+4log(f/100 ). Therefore, in this embodiment, the determining, according to the noise frequency, determining a main frequency for synthesizing a response voice corresponding to the voice interaction instruction may be: determining a critical frequency band in which the noise frequency is located, A primary frequency for synthesizing the response speech corresponding to the voice interaction command is then determined based on the critical band. The critical frequency band at which the noise frequency is located may be determined by referring to a critical band table.
又,由于在“掩蔽效应”中,两个频率越接近的声音,彼此的掩蔽量就越大;并且,高频声容易被低频声掩蔽(尤其是当低频声的音量很大时),而低频声则很难为高频声掩蔽。因此,在本实施例中,所述根据所述临界频带确定用于合成与所述语音交互指令对应的应答语音的主频率的具体实施方式可以是:确定用于合成应答语音的主频率为所述临界频带的上级临界频带中的频率值,以使该用于合成应答语音的主频率低于噪声频率,并且,该用于合成应答语音的主频率所处的临界频带与噪声频率所处的临界频带之间间隔一定的距离,从而实现低频声(应答语音)掩蔽高频声(噪声),同时,避免两种声音因频率相近而彼此掩蔽。比如,假设噪声频率所处的临界频带为第4级临界频带(经查临界频带划分表可知,该第4级临界频带对应的频率范围为:400Hz~510Hz),则,可以确定用于合成应答语音的主频率为250Hz(其所处临界频带为第2级临界频带)。Also, since in the "masking effect", the closer the two frequencies are to each other, the greater the amount of masking each other; and the high frequency sound is easily masked by the low frequency sound (especially when the volume of the low frequency sound is large), and the low frequency sound It is difficult to mask high frequency sound. Therefore, in this embodiment, the determining, according to the critical frequency band, determining a primary frequency for synthesizing a response voice corresponding to the voice interaction instruction may be: determining that a primary frequency used for synthesizing the response voice is a frequency value in a higher critical band of the critical band such that the main frequency used to synthesize the response speech is lower than the noise frequency, and the critical frequency band at which the main frequency for synthesizing the response speech is located and the noise frequency is The critical bands are separated by a certain distance, so that low-frequency sound (response speech) is masked to high-frequency sound (noise), and at the same time, the two sounds are prevented from masking each other due to the closeness of the frequencies. For example, it is assumed that the critical frequency band in which the noise frequency is located is the fourth-level critical band (the frequency range corresponding to the critical band of the fourth-order critical band is: 400 Hz to 510 Hz), and then the composite response can be determined. The main frequency of speech is 250 Hz (the critical frequency band in which it is located is the second-order critical band).
此外,在一些实施例中,若噪声频率所处的临界频带属于低频范围,比如,噪声频率所处的临界频带为第1级临界频带(对应的频率范围为:100Hz~200Hz),此时,若继续采用低频声掩蔽高频声的方式提升用户对智能终端播放的语音(即,应答语音)的听觉灵敏度会比较困难,并且有可能会给用户带来不好的听觉感受,此时,则可以确定用于合成应答语音的主频率远高于噪声频率,比如,确定用于合成应答语音的主频率为1000Hz(其所在临界频带为第8级临界频带)。In addition, in some embodiments, if the critical frequency band in which the noise frequency is located belongs to the low frequency range, for example, the critical frequency band in which the noise frequency is located is the first level critical band (corresponding frequency range is: 100 Hz to 200 Hz), at this time, If the low-frequency sound masking high-frequency sound is continued, it may be difficult to improve the auditory sensitivity of the user's voice played by the smart terminal (ie, answering the voice), and it may bring a bad hearing experience to the user. At this time, it may be determined. The main frequency for synthesizing the response speech is much higher than the noise frequency, for example, determining that the main frequency for synthesizing the response speech is 1000 Hz (the critical band in which it is located is the eighth-order critical band).
步骤130:基于所述主频率合成所述应答语音。Step 130: Synthesize the response voice based on the primary frequency.
在本实施例中,当智能终端接收到语音交互指令时,可以首先根据该语音交互指令生成应答文本,其中,该应答文本包括智能终端用于响应该语音交互指令的语音内容;然后,基于步骤120中所确定的主频率对该应答文本进行TTS(Text To Speech)转换,合成一个具有特定的主频率,并且与接收到的语音交互指令对应的应答语音。In this embodiment, when the smart terminal receives the voice interaction instruction, the response text may be first generated according to the voice interaction instruction, where the response text includes voice content used by the smart terminal to respond to the voice interaction instruction; and then, based on the step The main frequency determined in 120 performs a TTS (Text To Speech) conversion on the response text, and synthesizes a response voice having a specific main frequency and corresponding to the received voice interactive command.
其中,在本实施例中,可以在智能终端的数据库中建立语音交互指令和应 答文本的映射关系,从而,当智能终端接收到一个语音交互指令时即可查询到与之相对应的应答文本,进而基于所确定的主频率合成与所述语音交互指令对应的应答语音(即,基于所述主频率对所述语音交互指令对应的应答文本进行TTS转换)。In this embodiment, the mapping relationship between the voice interaction instruction and the response text may be established in the database of the smart terminal, so that when the smart terminal receives a voice interaction instruction, the corresponding response text may be queried. And further synthesizing the response voice corresponding to the voice interaction instruction based on the determined primary frequency (ie, performing TTS conversion on the response text corresponding to the voice interaction instruction based on the primary frequency).
步骤140:根据所述噪声音量、所述噪声频率和所述应答语音的主频率确定播放所述应答语音的音量。Step 140: Determine, according to the noise volume, the noise frequency, and the main frequency of the response voice, the volume of playing the response voice.
根据“掩蔽效应”还可知,声音的掩蔽效应还与声音的音量有关,一个声音的音量越大,其对另一个声音的掩蔽量越大。因此,在本实施例中,还通过动态调整智能终端播放应答语音的音量来实现应答语音对交互环境的噪声的掩蔽,使用户能够在任意噪声环境下都可以清楚听到应答语音。According to the "masking effect", the masking effect of the sound is also related to the volume of the sound. The louder the volume of one sound, the larger the amount of masking of the other sound. Therefore, in the embodiment, the noise of the response voice to the interactive environment is masked by dynamically adjusting the volume of the response voice played by the smart terminal, so that the user can clearly hear the response voice in any noise environment.
从而,在本实施例中,在以特定的主频率合成应答语音之后,还根据所述噪声音量、所述噪声频率和所述应答语音的主频率确定播放所述应答语音的音量。其中,由于不同的频率掩蔽方式所产生的掩蔽效果会有所不同,低频声掩蔽高频声的掩蔽效果较强,而高频声掩蔽低频声的掩蔽效果较弱,因此,在本实施例中,可以首先根据噪声频率和应答语音的主频率确定掩蔽量,然后再根据噪声音量和掩蔽量确定播放该应答语音的音量。Thus, in the present embodiment, after the response voice is synthesized at a specific primary frequency, the volume at which the response voice is played is also determined based on the noise volume, the noise frequency, and the dominant frequency of the response voice. Among them, the masking effect produced by different frequency masking methods may be different, the masking effect of the low frequency sound masking high frequency sound is strong, and the masking effect of the high frequency sound masking low frequency sound is weak, therefore, in this embodiment, it may be first based on The noise frequency and the main frequency of the response speech determine the amount of masking, and then determine the volume at which the answering voice is played based on the noise volume and the amount of masking.
具体地,在本实施例中,根据低频声掩蔽高频声的掩蔽效果较强,而高频声掩蔽低频声的掩蔽效果较弱的特性,所述根据噪声频率和应答语音的主频率确定掩蔽量的具体实施方式可以是:如果所述噪声频率低于所述应答语音的主频率,则确定所述掩蔽量为第一掩蔽量;如果所述噪声频率高于所述应答语音的主频率,则确定所述掩蔽量为第二掩蔽量;所述第一掩蔽量大于所述第二掩蔽量。进一步地,根据噪声音量和掩蔽量确定播放该应答语音的音量的具体实施方式可以是:以所述噪声音量和所述掩蔽量之和作为播放应答语音的音量。Specifically, in the embodiment, the masking effect of the low frequency sound masking high frequency sound is strong, and the masking effect of the high frequency sound masking low frequency sound is weak, and the specific implementation of determining the masking amount according to the noise frequency and the main frequency of the response voice is specifically performed. The method may be: if the noise frequency is lower than a main frequency of the response voice, determining that the masking amount is a first masking amount; if the noise frequency is higher than a main frequency of the response voice, determining the The masking amount is a second masking amount; the first masking amount is greater than the second masking amount. Further, the specific implementation manner of determining the volume of playing the response voice according to the noise volume and the masking amount may be: using the sum of the noise volume and the masking amount as the volume of the playback response voice.
此外,在另一些实施例中,根据所述噪声音量、所述噪声频率和所述应答语音的主频率确定播放所述应答语音的音量的具体实施方式也可以是:首先根据噪声频率和应答语音的主频率确定调整系数,然后再根据噪声音量和该调整系数的乘积作为播放该应答语音的音量,其中,该调整系数大于1。In addition, in other embodiments, the specific implementation manner of determining the volume of playing the response voice according to the noise volume, the noise frequency, and the main frequency of the response voice may also be: first, according to the noise frequency and the response voice. The main frequency determines the adjustment coefficient, and then the volume of the response voice is played according to the product of the noise volume and the adjustment coefficient, wherein the adjustment coefficient is greater than one.
再者,在又一些实施例中,步骤120至步骤140还可以合并执行。具体为:预先根据“掩蔽效应”建立如表1所示的关系对照表,通过查找该关系对照表即可确定了每一噪声信息(包括噪声频率所处的临界频带以及噪声音量)对应的掩蔽量、合成应答语音的主频率以及播放应答语音的音量。其中,表1中的n 可以为一可变量,其根据实际检测到的噪声音量确定;并且,表1中的数据也仅为示例性说明,并不用于限定本申请实施例。Moreover, in still other embodiments, steps 120 through 140 may also be performed in combination. Specifically, a relationship comparison table as shown in Table 1 is established in advance according to the “masking effect”, and the mask corresponding to each noise information (including the critical frequency band where the noise frequency is located and the noise volume) can be determined by searching the relationship comparison table. The amount, the main frequency of the synthesized response voice, and the volume of the playback response voice. Wherein, n in Table 1 may be a variable, which is determined according to the actually detected noise volume; and the data in Table 1 is also merely illustrative and is not intended to limit the embodiments of the present application.
表1关系对照表Table 1 relationship comparison table
Figure PCTCN2017119039-appb-000001
Figure PCTCN2017119039-appb-000001
在该实施例中,当检测到当前交互环境的噪声音量和噪声频率时,可以首先确定噪声频率所处的临界频带,然后直接通过查询上述表1,以与该临界频带对应的主频率合成应答语音并确定播放该应答语音的音量。In this embodiment, when the noise volume and the noise frequency of the current interactive environment are detected, the critical frequency band in which the noise frequency is located may be first determined, and then the primary frequency synthesis response corresponding to the critical frequency band is directly queried by querying the above Table 1. Voice and determine the volume at which the answering voice is played.
步骤150:以所确定的所述音量播放所述应答语音。Step 150: Play the response voice at the determined volume.
在本实施例中,智能终端可以在确定了播放应答语音的音量之后,通过任意发声设备,比如,喇叭、扬声器等,以所确定的音量播放该应答语音。In this embodiment, after determining the volume of the playback response voice, the smart terminal can play the response voice at the determined volume through any sounding device, such as a speaker, a speaker, or the like.
其中,在本实施例中,由于应答语音的主频率避开了噪声频率所处的临界频带的范围,并且,应答语音的播放音量大于噪声音量,从而能够实现应答语音对噪声的掩蔽,使得用户在具有任意噪声情况的交互环境下都能够清楚听到智能终端发出的应答语音,同时,智能终端的应答语音的主频率和播放音量均基于当前交互环境的噪声信息确定,所以也不会存在因为声音过大而吓到用户的问题。In this embodiment, since the main frequency of the response voice avoids the range of the critical frequency band in which the noise frequency is located, and the playback volume of the response voice is greater than the noise volume, the masking of the noise by the response voice can be realized, so that the user In the interactive environment with any noise situation, the response voice sent by the smart terminal can be clearly heard. At the same time, the main frequency and the playback volume of the response voice of the smart terminal are determined based on the noise information of the current interactive environment, so there is no such a reason. The sound is too loud to scare the user's problem.
进一步地,声音的“掩蔽效应”中,除了同时发出的声音之间有掩蔽现象之外,在时间上相邻的声音之间也存在掩蔽现象,称为“时域掩蔽”。其中,所述时域掩蔽包括超前掩蔽和滞后掩蔽。产生时域掩蔽的主要原因在于,人的大脑处理信息需要花费一定的时间,一般地,超前掩蔽很短,只有5~20ms,而滞后掩蔽可以持续50~200ms。Further, in the "masking effect" of sound, in addition to the masking phenomenon between the simultaneously emitted sounds, there is also a masking phenomenon between temporally adjacent sounds, which is called "time domain masking". Wherein, the time domain masking includes lead masking and lag masking. The main reason for the generation of time domain masking is that it takes a certain amount of time for the human brain to process the information. Generally, the advanced masking is very short, only 5 to 20 ms, and the lag masking can last for 50 to 200 ms.
基于此,在另一些实施例中,当智能终端接收到的语音交互指令由用户输入的语音控制信息触发时,为了避免用户说的话对智能终端播放的应答语音造成“时域掩蔽”,所述以所确定的所述音量播放所述应答语音,具体为:获取接收到基于所述语音控制信息触发的所述语音交互指令的时间节点(即,用户问话结束时的时间节点);在间隔所述时间节点预设时长后,以所确定的所述音量 播放所述应答语音。其中,所述预设时长可以是200ms。Based on this, in other embodiments, when the voice interaction instruction received by the smart terminal is triggered by the voice control information input by the user, in order to prevent the user from speaking, the time domain masking is caused to the response voice played by the smart terminal. And playing the response voice at the determined volume, specifically: acquiring a time node that receives the voice interaction instruction triggered by the voice control information (ie, a time node when the user ends the inquiry); After the time node presets the duration, the response voice is played at the determined volume. The preset duration may be 200 ms.
通过上述技术方案可知,本申请实施例的有益效果在于:本申请实施例提供的语音交互方法通过在接收到语音交互指令时,检测当前交互环境的噪声信息,所述噪声信息包括噪声音量和噪声频率,然后根据所述噪声频率确定用于合成与所述语音交互指令对应的应答语音的主频率,基于所述主频率合成所述应答语音,并根据所述噪声音量、所述噪声频率和所述应答语音的主频率确定播放所述应答语音的音量,最后以所确定的所述音量播放所述应答语音,能够基于声音的掩蔽效应,根据当前的交互环境的噪声信息动态调整其应答语音的主频率和播放音量,使得用户在任意交互环境下都可以获得较好的语音交互体验。According to the foregoing technical solution, the voice interaction method provided by the embodiment of the present application detects the noise information of the current interaction environment by receiving the voice interaction instruction, where the noise information includes noise volume and noise. a frequency, and then determining, according to the noise frequency, a main frequency for synthesizing a response voice corresponding to the voice interaction instruction, synthesizing the response voice based on the main frequency, and according to the noise volume, the noise frequency, and The main frequency of the response voice determines the volume of playing the response voice, and finally plays the response voice at the determined volume, and can dynamically adjust the response voice according to the noise information of the current interaction environment based on the masking effect of the sound. The main frequency and playback volume enable the user to get a better voice interaction experience in any interactive environment.
此外,考虑到每个人的听力敏感度以及个人习惯会有所差异,基于相同的方法调整应答语音的主频率以及播放该应答语音的音量,对于不同的用户有可能会产生不同的语音交互效果,因此,进一步地,在本申请实施例中,还提供了另一种语音交互方法。In addition, considering that each person's hearing sensitivity and personal habits will be different, adjusting the main frequency of the response voice and the volume of playing the response voice based on the same method may have different voice interaction effects for different users. Therefore, further, in the embodiment of the present application, another voice interaction method is also provided.
具体地,请参阅图3,该方法可以包括但不限于以下步骤:Specifically, referring to FIG. 3, the method may include but is not limited to the following steps:
步骤210:当接收到语音交互指令时,检测当前交互环境的噪声信息,所述噪声信息包括噪声音量和噪声频率。Step 210: When receiving the voice interaction instruction, detecting noise information of the current interaction environment, where the noise information includes a noise volume and a noise frequency.
步骤220:根据所述噪声频率确定用于合成与所述语音交互指令对应的应答语音的主频率。Step 220: Determine, according to the noise frequency, a main frequency for synthesizing a response voice corresponding to the voice interaction instruction.
步骤230:基于所述主频率合成所述应答语音。Step 230: Synthesize the response voice based on the primary frequency.
步骤240:根据所述噪声音量、所述噪声频率和所述应答语音的主频率确定播放所述应答语音的音量。Step 240: Determine, according to the noise volume, the noise frequency, and the main frequency of the response voice, the volume of playing the response voice.
步骤250:以所确定的所述音量播放所述应答语音。Step 250: Play the response voice at the determined volume.
步骤260:获取交互体验反馈信息。Step 260: Acquire interactive experience feedback information.
在本实施例中,所述“交互体验反馈信息”是指用户对该语音交互体验的评价,用于评估用户与智能终端之间的语音交互体验。比如,该交互体验反馈信息可以包括:应答语音的音量过大、应答语音的音量合适或者应答语音的音量过小。In this embodiment, the “interactive experience feedback information” refers to the user's evaluation of the voice interaction experience, and is used to evaluate the voice interaction experience between the user and the smart terminal. For example, the interactive experience feedback information may include: the volume of the response voice is too large, the volume of the response voice is appropriate, or the volume of the response voice is too small.
其中,在一些实施例中,该交互体验反馈信息可以由用户输入智能终端,比如,在进行语音交互的过程中,或者,结束语音交互之后,用户针对该次语 音交互体验输入交互体验反馈信息,以便智能终端及时调整播放应答语音的音量,进一步提升用户体验。In some embodiments, the interactive experience feedback information may be input by the user to the smart terminal, for example, during the process of performing the voice interaction, or after the voice interaction is ended, the user inputs the interactive experience feedback information for the voice interaction experience. In order for the smart terminal to adjust the volume of the playback response voice in time, the user experience is further improved.
或者,在另一些实施例中,该交互体验反馈信息也可以由智能终端通过合适的方式对语音交互体验进行评估,进而根据评估结果得到该交互体验反馈信息。比如,智能终端可以通过评估用户与智能终端之间的交互效果,用户是否能够正确理解应答语音的内容,或者,用户在交互的过程中的面部表情变化等确定用户是否听清楚智能终端播放的应答语音。Alternatively, in other embodiments, the interactive experience feedback information may also be evaluated by the smart terminal in an appropriate manner, and then the interactive experience feedback information is obtained according to the evaluation result. For example, the smart terminal can determine whether the user can correctly understand the content of the response voice by using the interaction effect between the user and the smart terminal, or whether the user can clearly understand the response of the smart terminal during the interaction process. voice.
步骤270:根据所述交互体验反馈信息调整播放所述应答语音的音量。Step 270: Adjust the volume of playing the response voice according to the interaction experience feedback information.
在本实施例中,在获取到交互体验反馈信息时,根据该交互体验反馈信息调整播放该应答语音的音量。比如,在获取到“应答语音的音量过大”的交互体验反馈信息时,降低播放该应答语音的音量;在获取到“应答语音的音量合适”的交互体验反馈信息时,维持播放该应答语音的音量不变;在获取到“应答语音的音量过小”的交互体验反馈信息时,增大播放该应答语音的音量。In this embodiment, when the interactive experience feedback information is acquired, the volume of playing the response voice is adjusted according to the interactive experience feedback information. For example, when the interactive experience feedback information of “the volume of the response voice is too large” is obtained, the volume of playing the response voice is reduced; and when the interactive experience feedback information “the volume of the response voice is appropriate” is acquired, the response voice is maintained. The volume of the response is unchanged; when the interactive experience feedback information of "the volume of the response voice is too small" is obtained, the volume of playing the response voice is increased.
其中,可以理解的是,在本实施例中,该交互体验反馈信息可以是实时获取到的,从而,可以根据该交互体验反馈信息实时调整播放所述应答语音的音量。或者,该交互体验反馈信息也可以是完成该交互过程时获取到的,从而,智能终端可以在下一次与该用户进行语音交互时,根据该交互体验反馈信息调整播放应答语音的音量,和/或,合成应答语音的主频率。It can be understood that, in this embodiment, the interaction experience feedback information may be acquired in real time, so that the volume of playing the response voice may be adjusted in real time according to the interaction experience feedback information. Alternatively, the interaction experience feedback information may also be obtained when the interaction process is completed. Therefore, the smart terminal may adjust the volume of the playback response voice according to the interaction experience feedback information when performing the next voice interaction with the user, and/or , synthesize the main frequency of the response speech.
其中,需说明的是,上述步骤210至250分别与如图2所示的语音交互方法中的步骤110至150具有相同的技术特征,因此,其具体实施方式可以参考上述实施例的步骤110至150中相应的描述,在本实施例中便不再赘述。It should be noted that the foregoing steps 210 to 250 have the same technical features as the steps 110 to 150 in the voice interaction method shown in FIG. 2, and therefore, the specific implementation manners may refer to step 110 of the foregoing embodiment. The corresponding description in 150 will not be repeated in this embodiment.
通过上述技术方案可知,本申请实施例的有益效果在于:本申请实施例提供的语音交互方法通过在以所确定的所述音量播放所述应答语音之后,获取用户的交互体验反馈信息,并根据所述交互体验反馈信息调整播放所述应答语音的音量,能够针对交互对象的特性不断改善语音交互效果,进一步提升用户体验。According to the foregoing technical solution, the voice interaction method provided by the embodiment of the present application obtains the user's interactive experience feedback information after playing the response voice at the determined volume. The interaction experience feedback information adjusts the volume of playing the response voice, and can continuously improve the voice interaction effect for the characteristics of the interaction object, thereby further improving the user experience.
图4是本申请实施例提供的一种语音交互装置的结构示意图,该装置40可以运行在配置有语音交互功能的智能终端上,能够实现上述实施例提供的语音交互方法。FIG. 4 is a schematic structural diagram of a voice interaction apparatus according to an embodiment of the present disclosure. The apparatus 40 can be implemented on a smart terminal configured with a voice interaction function, and can implement the voice interaction method provided by the foregoing embodiment.
具体地,请参阅图4,该装置40可以包括但不限于:噪声检测单元41、主频率确定单元42、语音合成单元43、音量确定单元44以及播放单元45。Specifically, referring to FIG. 4, the apparatus 40 may include, but is not limited to, a noise detecting unit 41, a main frequency determining unit 42, a speech synthesizing unit 43, a volume determining unit 44, and a playing unit 45.
其中,噪声检测单元41用于当接收到语音交互指令时,检测当前交互环境的噪声信息,所述噪声信息包括噪声音量和噪声频率;The noise detecting unit 41 is configured to detect noise information of a current interaction environment when the voice interaction instruction is received, where the noise information includes a noise volume and a noise frequency;
主频率确定单元42,用于根据所述噪声频率确定用于合成与所述语音交互指令对应的应答语音的主频率;The main frequency determining unit 42 is configured to determine, according to the noise frequency, a main frequency for synthesizing a response voice corresponding to the voice interaction instruction;
语音合成单元43用于基于所述主频率合成所述应答语音;The speech synthesis unit 43 is configured to synthesize the response voice based on the main frequency;
音量确定单元44用于根据所述噪声音量、所述噪声频率和所述应答语音的主频率确定播放所述应答语音的音量;The volume determining unit 44 is configured to determine a volume for playing the response voice according to the noise volume, the noise frequency, and a main frequency of the response voice;
播放单元45用于以所确定的所述音量播放所述应答语音。The playing unit 45 is configured to play the response voice at the determined volume.
在实际应用中,当接收到语音交互指令时,可以首先通过噪声检测单元41检测当前交互环境的噪声信息,所述噪声信息包括噪声音量和噪声频率;然后通过主频率确定单元42根据所述噪声频率确定用于合成与所述语音交互指令对应的应答语音的主频率,进而在语音合成单元43中基于所述主频率合成所述应答语音;接着,利用音量确定单元44根据所述噪声音量、所述噪声频率和所述应答语音的主频率确定播放所述应答语音的音量;最后,通过播放单元45以所确定的所述音量播放所述应答语音。In practical applications, when the voice interaction instruction is received, the noise information of the current interaction environment may be first detected by the noise detecting unit 41, the noise information including the noise volume and the noise frequency; and then the noise is determined by the main frequency determining unit 42 according to the noise. The frequency determination is used to synthesize a main frequency of the response voice corresponding to the voice interaction command, and the voice response unit 43 synthesizes the response voice based on the master frequency; and then, using the volume determination unit 44, according to the noise volume, The noise frequency and the main frequency of the response voice determine the volume at which the response voice is played; finally, the response voice is played by the playback unit 45 at the determined volume.
其中,在一些实施例中,主频率确定单元42具体用于:确定所述噪声频率所处的临界频带;根据所述临界频带确定用于合成与所述语音交互指令对应的应答语音的主频率。In some embodiments, the main frequency determining unit 42 is specifically configured to: determine a critical frequency band in which the noise frequency is located; and determine, according to the critical frequency band, a main frequency used to synthesize a response voice corresponding to the voice interaction instruction. .
其中,在一些实施例中,音量确定单元44,包括:掩蔽量确定模块441和音量确定模块442。In some embodiments, the volume determining unit 44 includes a masking amount determining module 441 and a volume determining module 442.
其中,掩蔽量确定模块441用于根据所述噪声频率和所述应答语音的主频率确定掩蔽量;音量确定模块442用于根据所述噪声音量和所述掩蔽量确定播放所述应答语音的音量。具体地,在一些实施例中,掩蔽量确定模块441具体用于:如果所述噪声频率低于所述应答语音的主频率,则确定所述掩蔽量为第一掩蔽量;如果所述噪声频率高于所述应答语音的主频率,则确定所述掩蔽量为第二掩蔽量;所述第一掩蔽量大于所述第二掩蔽量。The masking amount determining module 441 is configured to determine a masking amount according to the noise frequency and a main frequency of the response voice; the volume determining module 442 is configured to determine, according to the noise volume and the masking amount, a volume of playing the response voice. . Specifically, in some embodiments, the masking amount determining module 441 is specifically configured to: if the noise frequency is lower than a primary frequency of the response voice, determine the masking amount as a first masking amount; if the noise frequency And above the main frequency of the response voice, determining that the masking amount is a second masking amount; the first masking amount is greater than the second masking amount.
其中,在一些实施例中,当所述语音交互指令由语音控制信息触发时,播放单元45具体用于:获取接收到基于所述语音控制信息触发的所述语音交互指令的时间节点;在间隔所述时间节点预设时长后,以所确定的所述音量播放所述应答语音。In some embodiments, when the voice interaction instruction is triggered by the voice control information, the playing unit 45 is specifically configured to: acquire a time node that receives the voice interaction instruction triggered by the voice control information; After the time node presets the duration, the response voice is played at the determined volume.
其中,在一些实施例中,该装置40还包括:反馈单元46和音量调整单元 47。In some embodiments, the device 40 further includes: a feedback unit 46 and a volume adjustment unit 47.
反馈单元46用于获取交互体验反馈信息;The feedback unit 46 is configured to obtain interaction experience feedback information.
音量调整单元47用于根据所述交互体验反馈信息调整播放所述应答语音的音量。The volume adjustment unit 47 is configured to adjust the volume of playing the response voice according to the interaction experience feedback information.
其中,需要说明的是,由于所述语音交互装置与上述方法实施例中的语音交互方法基于相同的发明构思,因此,上述方法实施例的相应内容以及有益效果同样适用于本装置实施例,此处不再详述。It should be noted that, since the voice interaction device and the voice interaction method in the foregoing method embodiments are based on the same inventive concept, the corresponding content and beneficial effects of the foregoing method embodiments are also applicable to the device embodiment. It will not be detailed.
通过上述技术方案可知,本申请实施例的有益效果在于:本申请实施例提供的语音交互装置通过在接收到语音交互指令时,由噪声检测单元41检测当前交互环境的噪声信息,所述噪声信息包括噪声音量和噪声频率,然后通过主频率确定单元42根据所述噪声频率确定用于合成与所述语音交互指令对应的应答语音的主频率,进而在语音合成单元43中基于所述主频率合成所述应答语音;接着,利用音量确定单元44根据所述噪声音量、所述噪声频率和所述应答语音的主频率确定播放所述应答语音的音量;最后,通过播放单元45以所确定的所述音量播放所述应答语音,能够基于声音的掩蔽效应,根据当前的交互环境的噪声信息动态调整其应答语音的主频率和播放音量,使得用户在任意交互环境下都可以获得较好的语音交互体验。According to the foregoing technical solution, the voice interaction device provided by the embodiment of the present application detects the noise information of the current interaction environment by the noise detecting unit 41 when receiving the voice interaction instruction, and the noise information. A noise level and a noise frequency are included, and then a main frequency for synthesizing a response voice corresponding to the voice interaction instruction is determined by the main frequency determining unit 42 according to the noise frequency, and further synthesized based on the main frequency in the speech synthesizing unit 43 The response voice; then, the volume determining unit 44 determines the volume of playing the response voice according to the noise volume, the noise frequency, and the main frequency of the response voice; finally, the determined unit through the playing unit 45 The volume of the response voice is played, and based on the masking effect of the sound, the main frequency and the playback volume of the response voice can be dynamically adjusted according to the noise information of the current interactive environment, so that the user can obtain better voice interaction in any interactive environment. Experience.
图5是本申请实施例提供的一种智能终端的结构示意图,该智能终端500可以是任意类型的电子设备,如:智能手机、机器人、个人电脑、可穿戴智能设备、智能家电等,能够执行上述方法实施例提供的语音交互方法,或者,运行上述装置实施例提供的语音交互装置。FIG. 5 is a schematic structural diagram of an intelligent terminal according to an embodiment of the present disclosure. The smart terminal 500 can be any type of electronic device, such as a smart phone, a robot, a personal computer, a wearable smart device, a smart home appliance, etc., capable of executing The voice interaction method provided by the foregoing method embodiment, or the voice interaction device provided by the foregoing device embodiment.
具体地,请参阅图5,该智能终端500包括:Specifically, referring to FIG. 5, the smart terminal 500 includes:
一个或多个处理器501以及存储器502,图5中以一个处理器501为例。One or more processors 501 and memory 502, one processor 501 is taken as an example in FIG.
处理器501和存储器502可以通过总线或者其他方式连接,图5中以通过总线连接为例。The processor 501 and the memory 502 may be connected by a bus or other means, as exemplified by a bus connection in FIG.
存储器502作为一种非暂态计算机可读存储介质,可用于存储非暂态软件程序、非暂态性计算机可执行程序以及模块,如本申请实施例中的语音交互方法对应的程序指令/模块(例如,附图4所示的噪声检测单元41、主频率确定单元42、语音合成单元43、音量确定单元44、播放单元45、反馈单元46和音量调整单元47)。处理器501通过运行存储在存储器502中的非暂态软件程序、指令以及模块,从而执行语音交互装置40的各种功能应用以及数据处理,即实现 上述任一方法实施例的语音交互方法。The memory 502 is used as a non-transitory computer readable storage medium, and can be used for storing a non-transitory software program, a non-transitory computer executable program, and a module, such as a program instruction/module corresponding to the voice interaction method in the embodiment of the present application. (For example, the noise detecting unit 41, the main frequency determining unit 42, the speech synthesizing unit 43, the volume determining unit 44, the playing unit 45, the feedback unit 46, and the volume adjusting unit 47 shown in Fig. 4). The processor 501 executes various functional applications and data processing of the voice interaction device 40 by running non-transitory software programs, instructions, and modules stored in the memory 502, i.e., implements the voice interaction method of any of the above method embodiments.
存储器502可以包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需要的应用程序;存储数据区可存储根据语音交互装置40的使用所创建的数据等。此外,存储器502可以包括高速随机存取存储器,还可以包括非暂态存储器,例如至少一个磁盘存储器件、闪存器件、或其他非暂态固态存储器件。在一些实施例中,存储器502可选包括相对于处理器501远程设置的存储器,这些远程存储器可以通过网络连接至智能终端500。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。The memory 502 can include a storage program area and an storage data area, wherein the storage program area can store an operating system, an application required for at least one function; the storage data area can store data created according to the use of the voice interaction device 40, and the like. Moreover, memory 502 can include high speed random access memory, and can also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 502 can optionally include memory remotely located relative to processor 501, which can be connected to smart terminal 500 over a network. Examples of such networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
所述一个或者多个模块存储在所述存储器502中,当被所述一个或者多个处理器501执行时,执行上述任意方法实施例中的语音交互方法,例如,执行以上描述的图2中的方法步骤110至150,图3中的方法步骤210至270,实现图4中的单元41-47的功能。The one or more modules are stored in the memory 502, and when executed by the one or more processors 501, perform a voice interaction method in any of the above method embodiments, for example, performing the above described FIG. Method steps 110 through 150, method steps 210 through 270 in FIG. 3, implement the functions of units 41-47 in FIG.
本申请实施例还提供了一种非暂态计算机可读存储介质,所述非暂态计算机可读存储介质存储有计算机可执行指令,该计算机可执行指令被一个或多个处理器执行,例如,被图5中的一个处理器501执行,可使得上述一个或多个处理器执行上述任意方法实施例中的语音交互方法,例如,执行以上描述的图2中的方法步骤110至150,图3中的方法步骤210至270,实现图4中的单元41-47的功能。The embodiment of the present application further provides a non-transitory computer readable storage medium storing computer executable instructions executed by one or more processors, for example, Executed by a processor 501 in FIG. 5, the one or more processors may be configured to perform the voice interaction method in any of the foregoing method embodiments, for example, to perform the method steps 110 to 150 in FIG. 2 described above. Method steps 210 through 270 in 3 implement the functions of units 41-47 in FIG.
以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。The device embodiments described above are merely illustrative, wherein the units described as separate components may or may not be physically separate, and the components displayed as units may or may not be physical units, ie may be located A place, or it can be distributed to multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
通过以上的实施方式的描述,本领域普通技术人员可以清楚地了解到各实施方式可借助软件加通用硬件平台的方式来实现,当然也可以通过硬件。本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程是可以通过计算机程序产品中的计算机程序来指令相关的硬件来完成,所述的计算机程序可存储于一非暂态计算机可读取存储介质中,该计算机程序包括程序指令,当所述程序指令被智能终端执行时,可使所述智能终端执行上述各方法的实施例 的流程。其中,所述的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)或随机存储记忆体(Random Access Memory,RAM)等。Through the description of the above embodiments, those skilled in the art can clearly understand that the various embodiments can be implemented by means of software plus a general hardware platform, and of course, by hardware. One of ordinary skill in the art can understand that all or part of the process of implementing the above embodiments can be completed by a computer program in a computer program product, and the computer program can be stored in a non-transitory computer. In reading a storage medium, the computer program includes program instructions, and when the program instructions are executed by the intelligent terminal, the smart terminal can be caused to execute the flow of the embodiments of the foregoing methods. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), or a random access memory (RAM).
上述产品(包括:智能终端、非暂态计算机可读存储介质以及计算机程序产品)可执行本申请实施例所提供的语音交互方法,具备执行语音交互方法相应的功能模块和有益效果。未在本实施例中详尽描述的技术细节,可参见本申请实施例所提供的语音交互方法。The foregoing products (including the smart terminal, the non-transitory computer readable storage medium, and the computer program product) can perform the voice interaction method provided by the embodiment of the present application, and have the corresponding functional modules and beneficial effects of executing the voice interaction method. For details of the technical details that are not described in detail in this embodiment, refer to the voice interaction method provided by the embodiment of the present application.
最后应说明的是:以上实施例仅用以说明本申请的技术方案,而非对其限制;在本申请的思路下,以上实施例或者不同实施例中的技术特征之间也可以进行组合,步骤可以以任意顺序实现,并存在如上所述的本申请的不同方面的许多其它变化,为了简明,它们没有在细节中提供;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present application, and are not limited thereto; in the idea of the present application, the technical features in the above embodiments or different embodiments may also be combined. The steps may be carried out in any order, and there are many other variations of the various aspects of the present application as described above, which are not provided in the details for the sake of brevity; although the present application has been described in detail with reference to the foregoing embodiments, The skilled person should understand that the technical solutions described in the foregoing embodiments may be modified, or some of the technical features may be equivalently replaced; and the modifications or substitutions do not deviate from the embodiments of the present application. The scope of the technical solution.

Claims (15)

  1. 一种语音交互方法,应用于智能终端,其特征在于,包括:A voice interaction method is applied to an intelligent terminal, and is characterized by:
    当接收到语音交互指令时,检测当前交互环境的噪声信息,所述噪声信息包括噪声音量和噪声频率;When receiving the voice interaction instruction, detecting noise information of the current interaction environment, where the noise information includes a noise volume and a noise frequency;
    根据所述噪声频率确定用于合成与所述语音交互指令对应的应答语音的主频率;Determining, according to the noise frequency, a main frequency for synthesizing a response voice corresponding to the voice interaction instruction;
    基于所述主频率合成所述应答语音;Synthesizing the response voice based on the primary frequency;
    根据所述噪声音量、所述噪声频率和所述应答语音的主频率确定播放所述应答语音的音量;Determining a volume of playing the response voice according to the noise volume, the noise frequency, and a main frequency of the response voice;
    以所确定的所述音量播放所述应答语音。The answering voice is played at the determined volume.
  2. 根据权利要求1所述的语音交互方法,其特征在于,所述根据所述噪声频率确定用于合成与所述语音交互指令对应的应答语音的主频率,包括:The voice interaction method according to claim 1, wherein the determining, according to the noise frequency, a main frequency for synthesizing a response voice corresponding to the voice interaction instruction comprises:
    确定所述噪声频率所处的临界频带;Determining a critical frequency band in which the noise frequency is located;
    根据所述临界频带确定用于合成与所述语音交互指令对应的应答语音的主频率。A main frequency for synthesizing a response voice corresponding to the voice interaction instruction is determined according to the critical band.
  3. 根据权利要求1所述的语音交互方法,其特征在于,所述根据所述噪声音量、所述噪声频率和所述应答语音的主频率确定播放所述应答语音的音量,包括:The voice interaction method according to claim 1, wherein the determining the volume of playing the response voice according to the noise volume, the noise frequency, and the main frequency of the response voice comprises:
    根据所述噪声频率和所述应答语音的主频率确定掩蔽量;Determining a masking amount according to the noise frequency and a main frequency of the response voice;
    根据所述噪声音量和所述掩蔽量确定播放所述应答语音的音量。A volume at which the response voice is played is determined based on the noise volume and the masking amount.
  4. 根据权利要求3所述的语音交互方法,其特征在于,所述根据所述噪声频率和所述应答语音的主频率确定掩蔽量,包括:The voice interaction method according to claim 3, wherein the determining the masking amount according to the noise frequency and the main frequency of the response voice comprises:
    如果所述噪声频率低于所述应答语音的主频率,则确定所述掩蔽量为第一掩蔽量;If the noise frequency is lower than a main frequency of the response voice, determining that the masking amount is a first masking amount;
    如果所述噪声频率高于所述应答语音的主频率,则确定所述掩蔽量为第二掩蔽量;If the noise frequency is higher than a main frequency of the response voice, determining that the masking amount is a second masking amount;
    所述第一掩蔽量大于所述第二掩蔽量。The first masking amount is greater than the second masking amount.
  5. 根据权利要求1-4任一项所述的语音交互方法,其特征在于,所述以所 确定的所述音量播放所述应答语音的步骤之后,还包括:The voice interaction method according to any one of claims 1 to 4, wherein after the step of playing the response voice at the determined volume, the method further comprises:
    获取交互体验反馈信息;Obtain interactive experience feedback information;
    根据所述交互体验反馈信息调整播放所述应答语音的音量。Adjusting the volume of playing the response voice according to the interactive experience feedback information.
  6. 根据权利要求1-4任一项所述的语音交互方法,其特征在于,当所述语音交互指令由语音控制信息触发时,所述以所确定的所述音量播放所述应答语音,包括:The voice interaction method according to any one of claims 1 to 4, wherein when the voice interaction instruction is triggered by voice control information, the playing the response voice at the determined volume includes:
    获取接收到基于所述语音控制信息触发的所述语音交互指令的时间节点;Obtaining a time node that receives the voice interaction instruction triggered based on the voice control information;
    在间隔所述时间节点预设时长后,以所确定的所述音量播放所述应答语音。After the time duration is preset for the time node, the response voice is played at the determined volume.
  7. 一种语音交互装置,运行于智能终端,其特征在于,包括:A voice interaction device, running on a smart terminal, comprising:
    噪声检测单元,用于当接收到语音交互指令时,检测当前交互环境的噪声信息,所述噪声信息包括噪声音量和噪声频率;a noise detecting unit, configured to detect noise information of a current interaction environment when the voice interaction instruction is received, where the noise information includes a noise volume and a noise frequency;
    主频率确定单元,用于根据所述噪声频率确定用于合成与所述语音交互指令对应的应答语音的主频率;a main frequency determining unit, configured to determine, according to the noise frequency, a main frequency for synthesizing a response voice corresponding to the voice interaction instruction;
    语音合成单元,用于基于所述主频率合成所述应答语音;a speech synthesis unit, configured to synthesize the response voice based on the main frequency;
    音量确定单元,用于根据所述噪声音量、所述噪声频率和所述应答语音的主频率确定播放所述应答语音的音量;a volume determining unit, configured to determine, according to the noise volume, the noise frequency, and a main frequency of the response voice, a volume of playing the response voice;
    播放单元,用于以所确定的所述音量播放所述应答语音。a playing unit, configured to play the answering voice at the determined volume.
  8. 根据权利要求7所述的语音交互装置,其特征在于,所述主频率确定单元具体用于:The voice interaction device according to claim 7, wherein the main frequency determining unit is specifically configured to:
    确定所述噪声频率所处的临界频带;Determining a critical frequency band in which the noise frequency is located;
    根据所述临界频带确定用于合成与所述语音交互指令对应的应答语音的主频率。A main frequency for synthesizing a response voice corresponding to the voice interaction instruction is determined according to the critical band.
  9. 根据权利要求7所述的语音交互装置,其特征在于,所述音量确定单元,包括:The voice interaction device according to claim 7, wherein the volume determining unit comprises:
    掩蔽量确定模块,用于根据所述噪声频率和所述应答语音的主频率确定掩蔽量;a masking amount determining module, configured to determine a masking amount according to the noise frequency and a main frequency of the response voice;
    音量确定模块,用于根据所述噪声音量和所述掩蔽量确定播放所述应答语音的音量。And a volume determining module, configured to determine a volume for playing the response voice according to the noise volume and the masking amount.
  10. 根据权利要求9所述的语音交互装置,其特征在于,所述掩蔽量确定模块具体用于:The voice interaction device according to claim 9, wherein the masking amount determining module is specifically configured to:
    如果所述噪声频率低于所述应答语音的主频率,则确定所述掩蔽量为第一掩蔽量;If the noise frequency is lower than a main frequency of the response voice, determining that the masking amount is a first masking amount;
    如果所述噪声频率高于所述应答语音的主频率,则确定所述掩蔽量为第二掩蔽量;If the noise frequency is higher than a main frequency of the response voice, determining that the masking amount is a second masking amount;
    所述第一掩蔽量大于所述第二掩蔽量。The first masking amount is greater than the second masking amount.
  11. 根据权利要求7-10任一项所述的语音交互装置,其特征在于,所述语音交互装置还包括:The voice interaction device according to any one of claims 7 to 10, wherein the voice interaction device further comprises:
    反馈单元,用于获取交互体验反馈信息;a feedback unit, configured to obtain interaction experience feedback information;
    音量调整单元,用于根据所述交互体验反馈信息调整播放所述应答语音的音量。a volume adjustment unit, configured to adjust a volume of playing the response voice according to the interaction experience feedback information.
  12. 根据权利要求7-10任一项所述的语音交互装置,其特征在于,当所述语音交互指令由语音控制信息触发时,所述播放单元具体用于:The voice interaction device according to any one of claims 7 to 10, wherein when the voice interaction instruction is triggered by the voice control information, the playing unit is specifically configured to:
    获取接收到基于所述语音控制信息触发的所述语音交互指令的时间节点;Obtaining a time node that receives the voice interaction instruction triggered based on the voice control information;
    在间隔所述时间节点预设时长后,以所确定的所述音量播放所述应答语音。After the time duration is preset for the time node, the response voice is played at the determined volume.
  13. 一种智能终端,其特征在于,包括:An intelligent terminal, comprising:
    至少一个处理器;以及,At least one processor; and,
    与所述至少一个处理器通信连接的存储器;其中,a memory communicatively coupled to the at least one processor; wherein
    所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行如权利要求1-6任一项所述的方法。The memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to enable the at least one processor to perform the method of any of claims 1-6 Methods.
  14. 一种非暂态计算机可读存储介质,其特征在于,所述非暂态计算机可读存储介质存储有计算机可执行指令,所述计算机可执行指令用于使智能终端执行如权利要求1-6任一项所述的方法。A non-transitory computer readable storage medium, characterized in that the non-transitory computer readable storage medium stores computer executable instructions for causing a smart terminal to perform as claimed in claims 1-6 The method of any of the preceding claims.
  15. 一种计算机程序产品,其特征在于,所述计算机程序产品包括存储在非暂态计算机可读存储介质上的计算机程序,所述计算机程序包括程序指令,当所述程序指令被智能终端执行时,使所述智能终端执行如权利要求1-6任一项所述的方法。A computer program product, comprising: a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions, when the program instructions are executed by a smart terminal, The smart terminal is caused to perform the method of any of claims 1-6.
PCT/CN2017/119039 2017-12-27 2017-12-27 Voice interaction method and device and intelligent terminal WO2019127112A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2017/119039 WO2019127112A1 (en) 2017-12-27 2017-12-27 Voice interaction method and device and intelligent terminal
CN201780003279.0A CN108369805B (en) 2017-12-27 2017-12-27 Voice interaction method and device and intelligent terminal

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2017/119039 WO2019127112A1 (en) 2017-12-27 2017-12-27 Voice interaction method and device and intelligent terminal

Publications (1)

Publication Number Publication Date
WO2019127112A1 true WO2019127112A1 (en) 2019-07-04

Family

ID=63011271

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/119039 WO2019127112A1 (en) 2017-12-27 2017-12-27 Voice interaction method and device and intelligent terminal

Country Status (2)

Country Link
CN (1) CN108369805B (en)
WO (1) WO2019127112A1 (en)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109283851A (en) * 2018-09-28 2019-01-29 广州智伴人工智能科技有限公司 A kind of smart home system based on robot
CN109489803B (en) * 2018-10-17 2020-09-01 浙江大学医学院附属邵逸夫医院 Intelligent environmental noise analysis and reminding device
CN109582271B (en) * 2018-10-26 2020-04-03 北京蓦然认知科技有限公司 Method, device and equipment for dynamically setting TTS (text to speech) playing parameters
CN109614028A (en) * 2018-12-17 2019-04-12 百度在线网络技术(北京)有限公司 Exchange method and device
CN110113497B (en) * 2019-04-12 2022-01-11 深圳壹账通智能科技有限公司 Voice call-out method, device, terminal and storage medium based on voice interaction
CN111240634A (en) * 2020-01-08 2020-06-05 百度在线网络技术(北京)有限公司 Sound box working mode adjusting method and device
CN112306448A (en) * 2020-01-15 2021-02-02 北京字节跳动网络技术有限公司 Method, apparatus, device and medium for adjusting output audio according to environmental noise
CN112307161B (en) * 2020-02-26 2022-11-22 北京字节跳动网络技术有限公司 Method and apparatus for playing audio
CN111554317B (en) * 2020-05-11 2024-04-09 美智纵横科技有限责任公司 Voice broadcasting method, equipment, computer storage medium and system
US11386888B2 (en) 2020-07-17 2022-07-12 Blue Ocean Robotics Aps Method of adjusting volume of audio output by a mobile robot device
CN114666446B (en) * 2020-12-22 2024-06-11 北京达佳互联信息技术有限公司 Information prompting method and device
CN114724319B (en) * 2021-01-04 2024-01-30 中国石油化工股份有限公司 System and method for intelligently adjusting alarm volume and frequency of well site

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1265217A (en) * 1997-07-02 2000-08-30 西莫克国际有限公司 Method and appts. for speech enhancement in speech communication system
CN1620751A (en) * 2000-08-14 2005-05-25 清晰音频有限公司 Voice enhancement system
US20080164942A1 (en) * 2007-01-09 2008-07-10 Kabushiki Kaisha Toshiba Audio data processing apparatus, terminal, and method of audio data processing
CN101578848A (en) * 2006-12-29 2009-11-11 摩托罗拉公司 Methods and devices for adaptive ringtone generation
KR20100047740A (en) * 2008-10-29 2010-05-10 주식회사 대우일렉트로닉스 Method and apparatus for controlling volume
CN107124149A (en) * 2017-05-05 2017-09-01 北京小鱼在家科技有限公司 A kind of method for regulation of sound volume, device and equipment

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1265217A (en) * 1997-07-02 2000-08-30 西莫克国际有限公司 Method and appts. for speech enhancement in speech communication system
CN1620751A (en) * 2000-08-14 2005-05-25 清晰音频有限公司 Voice enhancement system
CN101578848A (en) * 2006-12-29 2009-11-11 摩托罗拉公司 Methods and devices for adaptive ringtone generation
US20080164942A1 (en) * 2007-01-09 2008-07-10 Kabushiki Kaisha Toshiba Audio data processing apparatus, terminal, and method of audio data processing
KR20100047740A (en) * 2008-10-29 2010-05-10 주식회사 대우일렉트로닉스 Method and apparatus for controlling volume
CN107124149A (en) * 2017-05-05 2017-09-01 北京小鱼在家科技有限公司 A kind of method for regulation of sound volume, device and equipment

Also Published As

Publication number Publication date
CN108369805A (en) 2018-08-03
CN108369805B (en) 2019-08-13

Similar Documents

Publication Publication Date Title
WO2019127112A1 (en) Voice interaction method and device and intelligent terminal
US20230179160A1 (en) Compensation for ambient sound signals to facilitate adjustment of an audio volume
JP6489563B2 (en) Volume control method, system, device and program
US10123140B2 (en) Dynamic calibration of an audio system
US9875081B2 (en) Device selection for providing a response
US11068235B2 (en) Volume adjustment method, terminal device, storage medium and electronic device
CN112424864A (en) Linear filtering for noise-suppressed voice detection
JP6652978B2 (en) Sports headphones with situational awareness
US11096005B2 (en) Sound reproduction
WO2019033438A1 (en) Audio signal adjustment method and device, storage medium, and terminal
WO2018205366A1 (en) Audio signal adjustment method and system
CN112118485B (en) Volume self-adaptive adjusting method, system, equipment and storage medium
US10461712B1 (en) Automatic volume leveling
CN113949956B (en) Noise reduction processing method and device, electronic equipment, earphone and storage medium
CN109361995B (en) Volume adjusting method and device for electrical equipment, electrical equipment and medium
KR20220044204A (en) Acoustic Echo Cancellation Control for Distributed Audio Devices
CN110827863A (en) Method, device and terminal for intelligently adjusting volume and readable storage medium
WO2021244056A1 (en) Data processing method and apparatus, and readable medium
US20190104375A1 (en) Level-Based Audio-Object Interactions
CN114747233A (en) Content and context aware ambient noise compensation
WO2023287773A1 (en) Speech enhancement
CN113949955A (en) Noise reduction processing method and device, electronic equipment, earphone and storage medium
WO2016042410A1 (en) Techniques for acoustic reverberance control and related systems and methods
US20240363131A1 (en) Speech enhancement
RU2818982C2 (en) Acoustic echo cancellation control for distributed audio devices

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17936052

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 16.11.2020)

122 Ep: pct application non-entry in european phase

Ref document number: 17936052

Country of ref document: EP

Kind code of ref document: A1