Nothing Special   »   [go: up one dir, main page]

WO2019150708A1 - Dispositif de traitement d'informations, système de traitement d'informations, procédé de traitement d'informations et programme - Google Patents

Dispositif de traitement d'informations, système de traitement d'informations, procédé de traitement d'informations et programme Download PDF

Info

Publication number
WO2019150708A1
WO2019150708A1 PCT/JP2018/042409 JP2018042409W WO2019150708A1 WO 2019150708 A1 WO2019150708 A1 WO 2019150708A1 JP 2018042409 W JP2018042409 W JP 2018042409W WO 2019150708 A1 WO2019150708 A1 WO 2019150708A1
Authority
WO
WIPO (PCT)
Prior art keywords
speaker
information processing
timing
nod
unit
Prior art date
Application number
PCT/JP2018/042409
Other languages
English (en)
Japanese (ja)
Inventor
富士夫 荒井
祐介 工藤
秀憲 青木
元 濱田
佐藤 直之
邦在 鳥居
Original Assignee
ソニー株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ソニー株式会社 filed Critical ソニー株式会社
Publication of WO2019150708A1 publication Critical patent/WO2019150708A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/10Speech classification or search using distance or distortion measures between unknown speech and reference templates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques

Definitions

  • the present disclosure relates to an information processing apparatus, an information processing system, an information processing method, and a program. More specifically, the present invention relates to an information processing apparatus, an information processing system, an information processing method, and a program that perform processing and response based on a speech recognition result of a user utterance.
  • Devices that perform such voice recognition include mobile devices such as smartphones, smart speakers, agent devices, signage devices, and the like. In a configuration using smart speakers, agent devices, signage devices, etc., there are many cases where there are many people around these devices.
  • the voice recognition device is required to specify a speaker (speaking user) for the device and provide a service corresponding to the speaker.
  • Patent Document 1 Japanese Patent Application Laid-Open No. 11-24691
  • Patent Document 2 Japanese Patent Application Laid-Open No. 2008-146054
  • Patent Document 1 Japanese Patent Application Laid-Open No. 11-24691
  • Patent Document 2 Japanese Patent Application Laid-Open No. 2008-146054
  • a method of identifying a speaker based on an analysis result by analyzing sound quality (frequency identification / voiceprint) of a voice input to a device.
  • sound quality frequency identification / voiceprint
  • Patent Document 3 Japanese Patent Laid-Open No. 08-235358 discloses a method for identifying a speaker by detecting mouth movement from an image taken by a camera or the like. However, this method makes it impossible to specify the speaker when there are a plurality of people moving his mouth or when the mouth of the speaker cannot be photographed.
  • JP-A-11-24691 JP 2008-146054 A Japanese Patent Laid-Open No. 08-235358
  • the present disclosure has been made in view of the above-described problems, for example, and an information processing apparatus and an information processing system capable of specifying a speaker with high accuracy under various circumstances using both sound and images And an information processing method and a program.
  • the first aspect of the present disclosure is: It has a speaker identification unit that executes speaker identification processing,
  • the speaker specifying unit (A) A speaker identification process based on a comparison result between a speech recognition result for an uttered speech and a lip reading recognition result for analyzing the utterance from the lip movement of the speaker, or (B) A speaker identification process based on an analysis result of a speaker's motion or a listener's motion listening to the speaker's speech, It exists in the information processing apparatus which performs the speaker specific process of at least any one of said (a) and (b).
  • the second aspect of the present disclosure is: An information processing system having an information processing terminal and a server,
  • the information processing terminal A voice input unit; An image input unit; A communication unit that transmits the audio acquired through the audio input unit and the camera-captured image acquired through the image input unit to the server;
  • the server A speech recognition unit that performs speech recognition processing of the speech of the speaker received from the information processing terminal;
  • An image recognition unit that performs analysis of the camera-captured image received from the information processing terminal;
  • the voice recognition unit Generating a character string indicating the utterance content of the speaker, and generating nodding timing estimation data of the speaker and the listener based on the character string;
  • the image recognition unit Generating nod timing actual data that records the nod timing of the nod performer included in the camera-captured image;
  • the present invention is in an information processing system that performs speaker identification processing based on the degree of coincidence between the nodding timing estimation data and the nodding timing actual data.
  • the third aspect of the present disclosure is: An information processing method executed in an information processing apparatus,
  • the information processing apparatus includes: It has a speaker identification unit that executes speaker identification processing,
  • the speaker specifying unit (A) A speaker identification process based on a comparison result between a speech recognition result for an uttered speech and a lip reading recognition result for analyzing the utterance from the lip movement of the speaker, or (B) A speaker identification process based on an analysis result of a speaker's motion or a listener's motion listening to the speaker's speech,
  • a speaker identification process based on an analysis result of a speaker's motion or a listener's motion listening to the speaker's speech
  • the fourth aspect of the present disclosure is: An information processing method executed in an information processing system having an information processing terminal and a server, The information processing terminal Transmitting the voice acquired through the voice input unit and the camera-captured image acquired through the image input unit to the server; The server From the voice received from the information processing terminal, voice recognition processing of the voice of the speaker is executed, a character string indicating the utterance content of the speaker is generated, and nodding timing estimation of the speaker and the listener based on the character string Generate data, Performing analysis of the camera-captured image received from the information processing terminal; Generating nod timing actual data that records the nod timing of the nod performer included in the camera-captured image; In at least one of the information processing terminal and the server, The present invention is an information processing method for executing a speaker identification process based on the degree of coincidence between the nod timing estimation data and the actual nod timing data.
  • the fifth aspect of the present disclosure is: A program for executing information processing in an information processing apparatus;
  • the information processing apparatus includes: It has a speaker identification unit that executes speaker identification processing,
  • the program is stored in the speaker specifying unit.
  • a speaker identification process based on a comparison result between a speech recognition result for an uttered speech and a lip reading recognition result for analyzing the utterance from the lip movement of the speaker, or
  • B A speaker identification process based on an analysis result of a speaker's motion or a listener's motion listening to the speaker's speech, In the program for executing the speaker specifying process of at least one of the above (a) and (b).
  • the program of the present disclosure is a program that can be provided by, for example, a storage medium or a communication medium provided in a computer-readable format to an information processing apparatus or a computer system that can execute various program codes.
  • a program in a computer-readable format, processing corresponding to the program is realized on the information processing apparatus or the computer system.
  • system is a logical set configuration of a plurality of devices, and is not limited to one in which the devices of each configuration are in the same casing.
  • the speaker specifying unit that executes the speaker specifying process (a) compares the speech recognition result for the uttered voice with the lip reading recognition result for analyzing the utterance from the lip movement of the speaker. Or (b) the speaker identification process based on the analysis result of the operation of the speaker or the listener. For example, a process of comparing the nodding timing estimation data of a speaker or listener estimated based on an utterance character string obtained as a speech recognition result and the nodding timing actual data of the nodding performer included in the camera-captured image is executed.
  • FIG. 2 is a diagram illustrating a configuration example and a usage example of an information processing device. It is a figure explaining the process which the information processing apparatus of this indication performs.
  • FIG. 25 is a diagram for describing a configuration example of an information processing device.
  • FIG. 11 is a diagram illustrating a flowchart for describing a sequence of processing executed by the information processing apparatus. It is a figure which shows the flowchart explaining the sequence of the speech recognition process which information processing apparatus performs. It is a figure explaining the analysis process of the nod timing which an information processing apparatus performs.
  • FIG. 11 is a diagram illustrating a flowchart for describing a sequence of image recognition processing executed by the information processing apparatus. It is a figure explaining the some pattern of the information required in the speaker specific process of this indication. It is a sequence diagram explaining the sequence in the case of performing a speaker specific process using the server by the side of a cloud. It is a sequence diagram explaining the sequence in the case of performing a speaker specific process using the server by the side of a cloud. It is a figure explaining the structural example of an information processing system.
  • FIG. 25 is a diagram for describing an example hardware configuration of an information processing device.
  • FIG. 1 is a diagram illustrating a processing example of an information processing apparatus 10 that recognizes and responds to a user utterance made by a speaker 1.
  • the information processing apparatus 10 executes processing based on the speech recognition result of the user utterance.
  • the information processing apparatus 10 performs the following system response.
  • System response “Tomorrow in Osaka, the afternoon weather is fine, but there may be a shower in the evening.”
  • the information processing apparatus 10 executes speech synthesis processing (TTS: Text To Speech) to generate and output the system response.
  • TTS Text To Speech
  • the information processing apparatus 10 generates and outputs a response using knowledge data acquired from a storage unit in the apparatus or knowledge data acquired via a network.
  • An information processing apparatus 10 illustrated in FIG. 1 includes a camera 11, a microphone 12, a display unit 13, and a speaker 14, and has a configuration capable of audio input / output and image input / output.
  • the information processing apparatus 10 illustrated in FIG. 1 is called, for example, a smart speaker or an agent device.
  • the information processing apparatus 10 according to the present disclosure is not limited to the agent device 10 a but may be various device forms such as a smartphone 10 b and a PC 10 c, or a signage device installed in a public place. Is possible.
  • the information processing apparatus 10 recognizes the utterance of the speaker 1 and makes a response based on the user's utterance, and also executes control of the external device 30 such as a television and an air conditioner shown in FIG. 2 according to the user's utterance. For example, when the user utterance is a request such as “change the TV channel to 1” or “set the air conditioner temperature to 20 degrees”, the information processing apparatus 10 determines whether the user utterance is based on the voice recognition result of the user utterance. A control signal (Wi-Fi, infrared light, etc.) is output to the external device 30 to execute control according to the user utterance.
  • Wi-Fi Wi-Fi, infrared light, etc.
  • the information processing apparatus 10 is connected to the server 20 via the network, and can acquire information necessary for generating a response to the user utterance from the server 20. Moreover, it is good also as a structure which makes a server perform a speech recognition process and a semantic analysis process.
  • An information processing device such as an agent device or a signage device that responds to a user utterance is required to specify a speaker for the information processing device and provide a service corresponding to the speaker.
  • voice data and image data are used to identify a speaker.
  • the information processing apparatus 10 can photograph the mouth of the speaker 31, the speech of the speaker is input via the microphone 12 and the speech photographed by the camera 11.
  • the lip reading process based on the movement of the person 31's mouth is performed and the content of the speech is analyzed.
  • the utterance content analysis result using the lip reading process is compared with the voice recognition result of the utterance voice data, and the speaker is specified based on the matching degree.
  • the information processing apparatus 10 enables, for example, such a process to perform a speaker's highly accurate specifying process under various situations.
  • FIG. 4 is a diagram illustrating a configuration example of the information processing apparatus 10 that recognizes a user utterance and responds.
  • the information processing apparatus 10 includes a voice input unit 101, a voice recognition unit 102, an utterance meaning analysis unit 103, an image input unit 105, an image recognition unit 106, a speaker identification unit 111, and a speaker analysis unit 112. , Speaker data storage unit 113, storage unit (user database, etc.) 114, response generation unit 121, system utterance speech synthesis unit 122, display image generation unit 123, output (speech, image) control unit 124, speech output unit 125, An image output unit 126 is included. Note that all of these components can be configured inside one information processing apparatus 10, but a part of the configuration and functions may be included in another information processing apparatus or an external server.
  • the user's uttered voice and ambient sounds are input to the voice input unit 101 such as a microphone.
  • the voice input unit (microphone) 101 inputs voice data including the input user uttered voice to the voice recognition unit 102.
  • the voice recognition unit 102 has, for example, an ASR (Automatic Speech Recognition) function, and converts voice data into a character string (text data) composed of a plurality of words.
  • ASR Automatic Speech Recognition
  • the text data generated in the speech recognition unit 102 is input to the utterance meaning analysis unit 103.
  • the speech recognition unit 102 also generates a nod timing of a speaker or listener estimated based on the speech based on the speech recognition result. This process will be described in detail later.
  • the utterance meaning analysis unit 103 selects and outputs user intention candidates included in the text.
  • the utterance meaning analysis unit 103 has a natural language understanding function such as NLU (Natural Language Understanding), for example, and from text data, an intention (intent) of a user utterance and a meaningful element included in the utterance (The entity information (entity: Entity) which is a significant element) is estimated, and the utterance meaning analysis processing is performed.
  • NLU Natural Language Understanding
  • the information processing apparatus 10 can perform accurate processing on the user utterance.
  • the response generation unit 121 generates a response to the user based on the utterance meaning analysis result of the user utterance estimated by the utterance meaning analysis unit 103.
  • the response is constituted by at least one of sound and image.
  • the voice information generated by the system utterance voice synthesis unit 122 through the voice synthesis process (TTS: Text To Speech) is output via the voice output unit 125 such as a speaker.
  • TTS Text To Speech
  • the display image information generated in the display image composition unit 123 is output via the image output unit 126 such as a display. Note that the output control of sound and image is controlled by the output (sound and image) control unit 124.
  • the image output unit 126 includes, for example, a display such as an LCD or an organic EL display, or a projector that performs projection display.
  • the information processing apparatus 10 can also output and display images on externally connected devices such as televisions, smartphones, PCs, tablets, AR (Argented Reality) devices, VR (Virtual Reality) devices, and other home appliances. It is.
  • the image input unit 105 is a camera and inputs an image of a speaker or a listener.
  • the image recognition unit 106 inputs a photographed image from the image input unit 105 and performs image analysis. For example, lip reading processing is performed by analyzing the movement of the lips of the speaker, and a character string corresponding to the utterance content of the speaker is generated from the result. Furthermore, the nod timing of the speaker or listener is acquired from the image. Details of these processes will be described later.
  • the speaker specifying unit 111 executes speaker specifying processing.
  • the speaker specifying unit 111 inputs the following data. (1) Speech recognition result (character string) related to the utterance of the speaker generated by the speech recognition unit 102, nodding timing estimation information of the speaker or listener, (2) Estimated utterance content (character string) based on the lip reading result of the speaker acquired by the image recognition unit 106, nodding timing information of the speaker or listener,
  • the speaker specifying unit 111 inputs each of the above data, and executes speaker specifying processing using at least one of these pieces of information. A specific example of the speaker specifying process will be described later.
  • the speaker analysis unit 112 analyzes the external characteristics (age, sex, body shape, etc.) of the speaker specified by the speaker specifying unit 110, and includes a user database including the speaker registered in the storage unit 114 in advance. To search for a speaker and perform a specific process.
  • the speaker data storage unit 113 executes processing for storing the information of the speaker and the content of the speech in the storage unit (user database or the like) 114.
  • the storage unit (user database, etc.) 114 in addition to a user database consisting of speaker information such as age, gender, and body type for each user, the content of each speaker's speech is associated with each user (speaker). To be recorded.
  • processing shown in the flow of FIG. 5 and the subsequent steps can be executed in accordance with a program stored in the storage unit of the information processing apparatus 10, and for example, executed as a program execution process by a processor such as a CPU having a program execution function Can do.
  • a processor such as a CPU having a program execution function Can do.
  • Step S101 First, in step S101, the speech recognition unit 102 performs speech recognition processing on the speech input from the speech input unit 101.
  • the voice recognition unit 102 has, for example, an ASR (Automatic Speech Recognition) function, and converts voice data into a character string (text data) composed of a plurality of words.
  • ASR Automatic Speech Recognition
  • Step S102 the speech recognition unit 102 acquires a character string that is a speech recognition result of the speaker.
  • the voice recognition unit 102 generates a character string of the utterance content acquired as a voice recognition result and outputs it to the speaker specifying unit 111.
  • Step S103 In step S ⁇ b> 103, the voice recognition unit 102 generates estimation data of the nodding timing of the speaker or listener based on the character string that is the speech recognition result of the speaker. Details of this processing will be described later.
  • Step S104 the image recognition unit 106 that has input the captured image of the image input unit 105 analyzes the image and searches for a person who is estimated to be a speaker. Specifically, a person whose lips are moving is searched.
  • Step S105 is a step of determining whether or not the search of the person (the person who moves the lips) estimated as the speaker based on the image in Step S104 is successful. If the search of the estimated speaker is successful, Proceed to step S111. If the search for the estimated speaker fails, the process proceeds to step S121.
  • Step S111 When the search of the person who is estimated to be the speaker based on the image (person who moves the lips) is successful, the image recognition unit 106 in step S111, the lips of the estimated speaker (person who moves the lips). Analyzing the movements of the lips, the lip reading recognition process is executed. The image recognition unit 106 generates a character string of the utterance content based on the lip reading recognition process, and outputs the character string to the speaker specifying unit 111.
  • Step S112 to S113 The process in step S112 is a process executed by the speaker specifying unit 112.
  • the speaker specifying unit 112 (1) A character string of utterance content generated by the voice recognition unit 102 based on voice recognition, (2) a character string of the utterance content generated by the image recognition unit 106 based on the lip recognition processing; Input these two character strings.
  • the speaker specifying unit 112 compares these two character strings.
  • step S113 Yes
  • the process proceeds to step S131, and the estimated speaker (person who moves the lips) detected based on the image in step S104 is specified as the speaker.
  • the matching rate between the speech recognition result character string and the lip reading recognition result character string is equal to or higher than a predetermined threshold
  • the estimated speaker moving the lip is detected based on the image). Person as the speaker.
  • step S113 No
  • the estimated speaker (moving lips) detected based on the image in step S104 Is determined not to be a speaker, and the process returns to step S104 to search for an estimated speaker again.
  • the estimated speaker detected based on the image in step S104 is excluded from the search target, and only other persons are searched.
  • the process in step S121 is a process executed by the image recognition unit 106.
  • the image recognition unit 106 that has input the captured image of the image input unit 105 analyzes the image and searches for a person who nods. When a nodding person can be found, the image recognition unit 106 generates nodding time-series data, that is, a “nodding timing time table”, and outputs it to the speaker identification unit 111.
  • the “nod timing timing table” will be described in detail later.
  • Step S122 to S123 The process of step S122 is a process executed by the speaker specifying unit 112.
  • the speaker specifying unit 111 compares the following two data. (1) Nodding timing estimation data of the speaker or listener estimated by the speech recognition unit 102 based on the character string that is the speech recognition result of the speaker in step S103; (2) The “nodding timing time table” generated by the image recognition unit 106 using the captured image of the image input unit 105 in step S121; The speaker identification unit 112 compares these two nodding timings.
  • step S123 (a) coincides with the estimated nodding timing of the speaker
  • the process proceeds to step S131, and the “person who nods” detected based on the image in step S121. Identify as a speaker.
  • the nodding timing estimation data of the speaker estimated by the speech recognition unit 102 based on the character string that is the speech recognition result of the speaker, and the image recognition unit 106 use the captured image of the image input unit 105. If the matching rate of the time stamps recorded in the “nodding timing time table” generated above is equal to or higher than a predetermined threshold, the “person who nods” detected based on the image is used as the speaker. Identify.
  • step S123 (b) matches the listener's estimated nodding timing
  • the listener's nod timing estimation data estimated by the speech recognition unit 102 based on the character string that is the speech recognition result of the speaker, and the image recognition unit 106 use the captured image of the image input unit 105.
  • the coincidence rate of the time stamps recorded in the generated “nodding timing time table” is equal to or higher than a predetermined threshold value, the process proceeds to step S124.
  • the nodding timing estimation data of the speaker and the listener estimated by the voice recognition unit 102 based on the character string that is the voice recognition result of the speaker, and the image recognition unit 106 use the captured image of the image input unit 105.
  • the nodding performer detected based on the image in step S121 utters It is determined that the user is neither a speaker nor a listener, and the process returns to step S104 to search for an estimated speaker again.
  • the nod performer detected based on the image in step S121 is excluded from the search target, and a search using only other persons as the search target is executed.
  • Step S124 the listener's nod timing estimation data estimated by the speech recognition unit 102 based on the character string that is the speech recognition result of the speaker in step S123, and the image recognition unit 106 obtain the captured image of the image input unit 105. This is executed when the “nodding timing time table” generated by using the data generally matches.
  • step S124 the speaker identifying unit 111 determines that the nod performer detected from the image is a listener, acquires information on the line of sight and head orientation of the listener from the image recognition unit 106, and in step S131, The person in the direction of is identified as the speaker.
  • Step S131 is a speaker specifying process executed by the speaker specifying unit 111.
  • the speaker specifying unit 111 executes any of the following three types of speaker specifying processes.
  • Step S113 Yes
  • Step S123 (a) Matches the estimated nodding timing of the speaker The nodding timing estimation data of the speaker estimated by the speech recognition unit 102 based on the character string that is the speech recognition result of the speaker, and the image recognition unit When the “nodding timing time table” generated using the captured image of the image input unit 105 substantially matches 106, the “person who nods” detected based on the image in step S121 is identified as the speaker. To do.
  • Step S123 (b) Matches with the listener's estimated nodding timing
  • the “nodding timing time table” generated using the captured image of the image input unit 105 substantially matches
  • the nod performer detected from the image is determined to be a listener, and information on the line of sight and head orientation of the listener is acquired from the image recognition unit 106, and the person ahead of the line of sight and head direction is identified as the speaker.
  • Step S132 Finally, the speaker specifying unit 111 analyzes the characteristics of the speaker after the speaker specifying process. Then, the analysis result of the speaker and the utterance content are stored in the storage unit (user database or the like) 114 via the speaker data storage unit 113.
  • the data stored in the storage unit (user database, etc.) 114 includes, for example, the external features (age, gender, body type, etc.) of the speaker analyzed from the image acquired by the image input unit 105, and the generation of the voice recognition unit 102.
  • the speech recognition unit 102 performs speech recognition processing on the speech input from the speech input unit 101 to generate a character string indicating the utterance content. Based on the character string that is the voice recognition result of the speaker, nodding timing estimation data for the speaker or listener is generated.
  • the flow shown in FIG. 6 is a flowchart showing the effective procedure of these processes. The processing of each step will be described sequentially.
  • Step S201 First, in step S ⁇ b> 201, the voice recognition unit 102 inputs a speaker's voice from the voice input unit 101.
  • Step S202 the voice recognition unit 102 converts the voice of the speaker input from the voice input unit 101 into a character string (text data) composed of a plurality of words.
  • a character string text data
  • the utterance character string 251 shown in FIG. 6 is generated. This uttered character string 251 is input to the speaker specifying unit 111.
  • Step S203 In step S ⁇ b> 203, the speech recognition unit 102 generates estimation data of the speaker's nod timing based on the character string that is the speech recognition result of the speaker. Through this process, a speaker nodding timing estimation time table 252 shown in FIG. 6 is generated. The speaker nodding timing estimation time table 252 is input to the speaker specifying unit 111.
  • Step S204 the voice recognition unit 102 generates estimation data of the listener's nod timing based on the character string that is the voice recognition result of the speaker. This process generates the listener nodding timing estimation time table 253 shown in FIG. The listener nodding timing estimation time table 253 is input to the speaker specifying unit 111.
  • FIG. 7 shows a character string generated by the speech recognition unit 102 based on the uttered speech.
  • String Hello, It's good weather today. By the way, did you watch that movie? It's popular now and every movie theater seems to be full.
  • the voice recognition unit 102 estimates the nodding timing of the speaker and the listener according to the utterance voice made up of this character string.
  • the voice recognition unit 102 estimates the nodding timing of the speaker and the listener from the timing at which a punctuation mark that interrupts the voice enters.
  • the speaker performs “nodding” with his / her speech. For example, “nodding” associated with one's utterance is performed at the value immediately before the punctuation mark in the utterance sentence. On the other hand, the listener often performs “nodding” as a companion to the utterance of the speaker. For example, there are many nods immediately after punctuation in utterances.
  • the voice recognition unit 102 uses this characteristic to set the speaker's nod estimation timing just before the punctuation mark and the listener's nod estimation timing just after the punctuation mark.
  • the speech recognition unit 102 performs this estimation process, A speaker nod timing estimation time table 252 with the nod timing set immediately before the punctuation mark; A listener nodding timing estimation time table 253 in which the nodding timing is set immediately after the punctuation mark is generated.
  • the timing estimation time table is a table in which time information estimated to be nodding is sequentially recorded as a “time stamp”.
  • the voice recognition unit 102 obtains time information from the clock of the information processing apparatus 10 or via a network, and generates a time table in which time information estimated to be nodding is recorded as a “time stamp”.
  • the speaker nodding timing estimation time table 252 is a table in which time information (time stamp) estimated to be nodding is recorded in time series.
  • FIG. 9 is a flowchart for explaining a specific processing sequence of these processes. The processing of each step will be described sequentially.
  • step S301 In step S ⁇ b> 301, the image recognition unit 106 inputs a captured image of the image input unit 105.
  • Step S302 the image recognition unit 106 analyzes the image and searches for a person estimated to be a speaker. Specifically, a person whose lips are moving is searched.
  • Step S303 is a step of determining whether or not the search for the person (moving person's lips) estimated to be a speaker based on the image in step S302 is successful. If the search for the estimated speaker is successful, Proceed to step S304. If the search for the estimated speaker fails, the process proceeds to step S311.
  • Step S304 If the search of the person who is estimated to be the speaker based on the image (the person who moves the lips) is successful, in step S304, the image recognition unit 106 determines the lips of the estimated speaker (the person who moves the lips). Analyzing the movements of the lips, the lip reading recognition process is executed. The image recognition unit 106 generates an utterance character string 351 that is a character string indicating the utterance content based on the lip reading recognition process. The uttered character string 351 generated based on the lip reading recognition process is input to the speaker specifying unit 111.
  • step S ⁇ b> 311 the image recognition unit 106 analyzes the captured image of the image input unit 105 and searches for a person who nods. If a nodding person can be found, the image recognition unit 106 generates nodding time-series data, that is, “nodding timing time table 352”, and outputs it to the speaker identification unit 111.
  • the nod timing time table 352 is a table in which the actual nod timing of the nodding person in the image is recorded. That is, it is nod timing actual data.
  • the table configuration is the same as the configuration described above with reference to FIG.
  • the “nodding timing time table 352” generated by the image recognition unit 106 is a table in which the nodding timing observed from the image is recorded, and the actual time when the nodding is executed is recorded.
  • the nod timing timing tables 252 and 253 generated by the speech recognition unit 102 are estimated times that the nod is estimated to be executed according to the character string indicating the utterance content. Is a table in which is recorded. This is the process described above with reference to FIG.
  • FIG. 10 shows a table summarizing the minimum necessary elements for speaker identification.
  • Pattern 1 is a pattern in which the speaker is specified by the lip recognition of the speaker. In this pattern 1, Recognizing the lip reading of the speaker is required. Nodding recognition is unnecessary for both the speaker and the listener. Recognition of the listener's gaze is also unnecessary.
  • Pattern 2 is a pattern in which the speaker is identified by recognition of the speaker's nod. In this pattern 2, Recognizing the lip reading of the speaker becomes unnecessary. Nodding recognition requires the speaker's nodding recognition and no listener's nodding recognition. The viewer's gaze recognition becomes unnecessary.
  • Pattern 3 is a pattern in which the speaker is identified by the listener's nodding recognition and line-of-sight information. In this pattern 3, Recognizing the lip reading of the speaker becomes unnecessary. Nodding recognition does not require the speaker's nodding, and the listener's nodding is required. The viewer's gaze recognition is necessary.
  • the patterns 1 to 3 in FIG. 10 it is possible to improve the accuracy of speaker identification by adding information about items that are not required. For example, it is possible to improve the accuracy of speaker identification by using not only the speaker's nodding timing but also the listener's nodding timing and its gaze and head direction information.
  • An utterance rhythm is acquired from speech, and if the rhythm of the subject's head or body obtained from the image matches the rhythm, the target person is identified as the speaker.
  • a character string is acquired from speech, and a head skin vibration pattern depending on a vowel at the time of utterance is acquired from an image. The speaker is specified by the vowel of the character string and the head skin vibration pattern matching.
  • the speech break information is acquired from the voice, the blink timing is acquired from the image, and the speaker is identified by matching.
  • the body volume obtained from the image is compared with the volume level, the sound level, and the speed change.
  • the volume of the volume and the magnitude of the movement, the height of the voice and the raising and lowering of the face, the speed of the voice and the speed of the body movement are compared, and a person who shows a similar tendency is identified as the speaker.
  • a person who expresses the same emotion as the prediction result is identified by image recognition. The person is identified as the listener, and the person ahead of the line of sight or head is identified as the speaker.
  • a part of the processing of the above-described embodiment may be configured to be executed using a cloud side device (server).
  • voice recognition processing by the voice recognition unit 102 nod timing generation processing, lip reading recognition processing by the image recognition unit 106, nod detection processing, verification of character string recognition results, nod timing comparison processing, etc.
  • a part or all of them may be executed in a cloud-side device (server).
  • FIG. 11 is a processing sequence in which processing for specifying a speaker by voice recognition and lip reading recognition is executed by processing involving communication between the information processing apparatus 10 and the server 20.
  • FIG. 12 is a processing sequence in which processing for identifying a speaker using nodding timing is executed by processing involving communication between the information processing apparatus 10 and the server 20.
  • Step S401 First, in step S401, the information processing apparatus 10 transmits a voice recognition request to the server 20 together with the acquired voice.
  • Step S402 the server 20 transmits a speech recognition result to the information processing apparatus 10 in step S402.
  • the API used for the speech recognition processing in steps S401 to S402 is, for example, the following API.
  • GET / api / v1 / recognition / audio This API is an API for starting speech recognition and acquiring the recognition result (text character string).
  • the data required for this API request is audio data (audio stream data) input by the information processing apparatus 10 via the audio input unit 101.
  • the API response that is a response from the server 20 includes a speech recognition result character string (speech (string)) that is result data of the speech recognition processing executed in the speech recognition processing unit on the server 20 side.
  • step S403 In step S ⁇ b> 403, the information processing apparatus 10 transmits a lip reading recognition request to the server 20 together with the acquired image.
  • Step S404 Next, the server 20 transmits a lip reading recognition result to the information processing apparatus 10 in step S404.
  • the API used for the lip reading recognition process in steps S403 to S404 is, for example, the following API.
  • GET / api / v1 / recognition / lip This API is an API for starting lip reading recognition and acquiring a recognition result (text character string) and a speaker identifier (ID).
  • the API response that is the response from the server 20 includes a lip reading recognition character string (lip (string)) that is the result data of the lip reading recognition process executed by the lip reading recognition processing unit on the server 20 side, and a speaker ID ( spike-id (string)).
  • the server 20 holds a user database in which a user ID and user feature information are recorded in association with each other.
  • the server 20 identifies a user based on image data received from the information processing apparatus 10 and a user identifier (ID). Is provided to the information sail processing apparatus 10. However, in the case of an unregistered user, a result that it is unknown is returned.
  • ID user identifier
  • step S405 Next, in step S ⁇ b> 405, the information processing apparatus 10 transmits a comparison request between the speech recognition result (character string) and the lip reading recognition result (character string) to the server 20.
  • step S406 Next, in step S ⁇ b> 406, the server 20 transmits a comparison result (match rate) between the speech recognition result (character string) and the lip reading recognition result (character string) to the information processing apparatus 10.
  • the API used for the comparison process between the speech recognition result (character string) and the lip reading recognition result (character string) in steps S405 to S406 is, for example, the following API.
  • GET / api / v1 / recognition / audio / comparison This API is an API for comparing the character string of the voice recognition result and the character string of the lip reading recognition result and acquiring the result of matching.
  • the data required for this API request is received from the server 20 by the information processing apparatus 10.
  • Speech recognition result character string (speech (string)) Lip reading recognition result string (lip (string))
  • the API response that is a response from the server 20 includes a character string match rate (concordance-rate (integer)) that is the result data of the character string comparison processing executed in the character string comparison unit on the server 20 side. .
  • step S407 the information processing apparatus 10 causes the speaker specifying unit 111 to perform speaker specifying processing based on the character string match rate (concordance-rate (integer)) received from the server 20. That is, Speech recognition result character string (speech (string)) Lip reading recognition result string (lip (string)) When the matching rate of these character strings is equal to or higher than a predetermined threshold, an estimated speaker (a person who moves the lips) detected based on the image is specified as a speaker.
  • the character string match rate concordance-rate (integer)
  • Step S408 Next, in step S ⁇ b> 408, the information processing apparatus 10 transmits a speaker information acquisition request to the server 20.
  • Step S409 the server 20 transmits speaker information to the information processing apparatus 10 in step S409.
  • the API used for the speaker information acquisition processing in steps S408 to S409 is, for example, the following API.
  • the data required for this API request includes image data (video stream data) input by the information processing apparatus 10 via the image input unit 105 and a speaker ID (speaker-id (string)) received from the server 20. is there.
  • the API response that is a response from the server 20 includes user information corresponding to the speaker ID acquired from the user database on the server 20 side. Specifically, it is configured by the following information of the speaker. Gender (sex (string)) Age (age (integer)) Height (integer) Physical status (string) These are data registered in the user database on the server 20 side.
  • Step S410 Next, in step S ⁇ b> 410, the information processing apparatus 10 transmits a speaker information registration request to the server 20.
  • Step S411 Next, in step S411, the server 20 transmits a speaker information registration completion notification to the information processing apparatus 10.
  • the API used for the speaker information registration processing in steps S410 to S411 is, for example, the following API. POST / api / v1 / recognition / speakers / ⁇ speaker-id ⁇
  • the data required for the API request includes a speaker ID (speaker-id (string)) received by the information processing apparatus 10 from the server 20, a voice recognition result character string (speech (string)) received from the server 20.
  • the image data (video stream data) input by the information processing apparatus 10 via the image input unit 105.
  • the API response that is a response from the server 20 includes a speaker registration ID (registered-speaker-id (string)).
  • the next information of the speaker can be stored in the database, so that the speaker can be specified at the time of the next utterance content analysis.
  • A Face photograph
  • Voice quality In addition, by storing the following data, it is possible to perform recognition according to the person's eyelid at the time of recognition, and to improve recognition accuracy.
  • C Mouth movement characteristics
  • Utterance sound characteristics e
  • Utterance contents trends of utterance contents
  • the server 20 registers the speaker information (a) to (i) in the data heart.
  • Step S421 First, in step S421, the information processing apparatus 10 transmits a nod timing estimation data generation request to the server 20 along with the acquired voice.
  • Step S422 Next, in step S422, the server 20 transmits a voice-based nod timing estimation time table to the information processing apparatus 10.
  • the API used for the nod timing estimation time table generation processing in steps S421 to S422 is, for example, the following API.
  • GET / api / v1 / recognition / audio / nodes This API is an API for analyzing a character string obtained from a speech recognition result and obtaining a nodding timing estimation time table (a time stamp list of times at which nodding is estimated) and a nodding person's ID. .
  • the data required for this API request is audio data (audio stream data) input by the information processing apparatus 10 via the audio input unit 101.
  • the API response that is a response from the server 20 includes a speech-based nod timing estimation time table acquired by the speech recognition processing unit on the server 20 side analyzing the speech recognition result (character string).
  • the voice-based nod timing estimation time table includes the following two types of tables. Speaker nodding timing estimation time table (speaker-nods (string-array)) Listener nodding timing estimation timetable (listener-nodes (string-array)) These are generated in the voice recognition unit of the server 20 in accordance with the processing described above with reference to FIG. That is, the optimal nod timing is determined from the voice recognition result and generated.
  • Step S423 First, in step S423, the information processing apparatus 10 transmits a nod timing data generation request to the server 20 together with the acquired image.
  • step S424 the server 20 transmits an image-based nodding timing time table to the information processing apparatus 10.
  • the API used for the nod timing estimation time table generation processing in steps S423 to S424 is, for example, the following API.
  • GET / api / v1 / recognition / nodes This API is an API for analyzing an image and obtaining a nodding timing time table (a time stamp list of times when nodding is performed) and an ID of the nodding person.
  • Data necessary for the API request is image data (video stream data) input by the information processing apparatus 10 via the image input unit 105.
  • an API response that is a response from the server 20 includes an image-based nodding timing time table (an image recognition processing unit on the server 20 side analyzing and acquiring an image). nod-timings (string-array)) and nodding ID (string-person-id (string)).
  • the server 20 holds the user database in which the user ID and the user characteristic information are recorded in association with each other, specifies the user based on the image data received from the information processing apparatus 10, and identifies the user identifier. (ID) is provided to the information sail processing apparatus 10. However, in the case of an unregistered user, a result that it is unknown is returned.
  • step S425 the information processing apparatus 10 (A1) a voice-based speaker nodding timing estimation time table received in step S422; (A2) a voice-based listener nodding timing estimation time table; Each of these two nod timing estimation timetables, (V1) image-based nodding timing time table received in step S424; A request for comparing these timetables is transmitted.
  • Step S426 the server 20 transmits a comparison result (match rate) between the two voice-based nodding timing estimation time tables and the image-based nodding timing time table to the information processing apparatus 10.
  • the API used for the comparison process of the nod timing table in steps S425 to S426 is, for example, the following API.
  • GET / api / v1 / recognition / nodes / comparison This API is an API for comparing the nodding timings of the two audio-based nodding timing estimation time tables and the nodding timing time table of the image-based nodding timing time table to obtain a result of matching.
  • the data required for this API request is the following data received from the server 20 by the information processing apparatus 10.
  • A1 voice-based speaker nodding timing estimation time table (speaker-nod (string-array)) received in step S422,
  • A2) voice-based listener nodding timing estimation timetable (listener-nod (string-array)), These two nod timing estimation timetables,
  • V1) image-based nodding timing time table (nod-timing (string-array)) received in step S424,
  • the API response which is a response from the server 20 is the result data of the nodding timing time table comparison processing executed in the speaker specifying unit on the server 20 side.
  • Speaker-concordance-rate integer
  • Listener-concordance-rate integer
  • Speaker-concordance-rate is: (A1) a voice-based speaker nodding timing estimation time table (speaker-nod (string-array)); (V1) Image-based nodding timing timetable (nod-timing (string-array)), This is the coincidence rate (%) data obtained as a comparison result of the nod timing of these two tables.
  • the listener-concordance-rate is: (A2) a voice-based listener nodding timing estimation time table (listener-nod (string-array)); (V1) Image-based nodding timing timetable (nod-timing (string-array)), This is the coincidence rate (%) data obtained as a comparison result of the nod timing of these two tables.
  • step S427 the information processing apparatus 10 receives the speaker nod match rate (%) (speaker-concordance-rate (integer)) received from the server 20, Listener-concordance-rate (integer) Based on these data, the speaker specifying unit 111 executes speaker specifying processing.
  • V1 A nod performer detected from an image that is a generation target of an image-based nod timing timing table (node-timing (string-array)) is determined to be a speaker.
  • V1 A nod performer detected from an image that is a generation target of an image-based nod timing timing table (nod-timing (string-array)) is determined to be a listener. In this case, the processes of steps S428 to S430 are further executed to specify the speaker.
  • step S427 If a speaker is specified in step S427, the processes in steps S428 to S430 can be omitted. That is, in step S427, if the speaker-concordance-rate (integer) is equal to or greater than a predetermined threshold value and a speaker is specified, steps S428 to S430 are performed. You may abbreviate
  • Steps S428 to 430 The processing of steps S428 to S430 is executed when the listener-nodding match rate (%) (listener-concordance-rate (integer)) is equal to or greater than a predetermined threshold value in step S427. That is, (V1) This process is performed when it is determined that a nod performer detected from an image that is a generation target of an image-based nod timing timing table (node-timing (string-array)) is a listener.
  • the information processing apparatus 10 outputs a listener's gaze point detection request to the server in step S428. This is a request to detect the person (speaker) ahead of the line of sight of the nod performer (listener) detected from the image.
  • step S429 the server 20 transmits detection information of the person (speaker) ahead of the line of sight of the nod performer (listener) detected from the image to the information processing apparatus 10. This is performed using image data received from the information processing apparatus 10.
  • step S430 the information processing apparatus 10 determines, based on the detection information of the person (speaker) ahead of the gaze of the nod performer (listener) detected from the image received from the server 20 by the speaker specifying unit 111. It is determined that the person at the line of sight is the speaker.
  • Step S431 The following processes in steps S431 to S434 are the same as the processes in steps S408 to S411 described above with reference to FIG.
  • step S ⁇ b> 431 the information processing apparatus 10 transmits a speaker information acquisition request to the server 20.
  • Step S432 the server 20 transmits the speaker information to the information processing apparatus 10 in step S432.
  • the response from the server 20 includes user information corresponding to the speaker ID acquired from the user database on the server 20 side. Specifically, it is configured by the following information of the speaker. Gender (sex (string)) Age (age (integer)) Height (integer) Physical status (string) These are data registered in the user database on the server 20 side.
  • Step S433 Next, in step S 433, the information processing apparatus 10 transmits a speaker information registration request to the server 20.
  • Step S434 the server 20 transmits a speaker information registration completion notification to the information processing apparatus 10.
  • the registration information as the speaker information includes the following data.
  • A) Face photo (b) Voice quality (c) Mouth movement characteristics (d) Utterance characteristics (e) Utterance contents (trends of utterance contents)
  • F) Gender (g) Age (h) Height (i) Physical characteristics
  • the server 20 registers the speaker information of (a) to (i) in the data heart.
  • the processing described with reference to FIGS. 11 and 12 includes voice recognition processing, nod timing generation processing, lip reading recognition processing, nod detection processing, character string recognition result verification, nod timing verification processing, and the like. It is a processing sequence when it is set as the structure performed in the apparatus (server) by the side of a cloud.
  • agent device First, a usage example as an agent device will be described. If there are multiple people in front of the agent device, and one of them speaks to the agent device, the agent device captures the lip movement of the speaker with a camera and uses lip reading to convert the utterance into a string. Convert. At the same time, the voice input using a microphone is converted into a character string using voice recognition. By comparing these character strings, it can be determined whether or not the person who captured the movement of the lips is a speaker.
  • the agent device can infer information on the gender, age, and physical characteristics as the speaker's characteristics. More appropriate answers can be made. For example, if the speaker is recognized as a child, it is possible to answer in an easy-to-understand tone tailored to the child, or to introduce a high-volume restaurant to a young and well-structured man.
  • the following processing is also possible by storing the utterance feature information such as the utterance content of the speaker and the manner of speaking in the database together with the face photograph.
  • the utterance feature information is input to the speech recognition or lip recognition engine, so that recognition according to the speaker can be performed, and the recognition accuracy can be improved.
  • speaker identification is effective to prevent the agent device from malfunctioning in response to the utterance of the agent device or the utterance of TV or radio sound.
  • the speaker by specifying the speaker according to the process of the present disclosure, the following process can be performed.
  • the age and gender can be determined from the appearance of the speaker, and display according to the speaker can be performed.
  • display according to the speaker can be performed.
  • vending machines it is possible to recommend products tailored to the speaker.
  • the speaker it is possible to record both the content of the utterance and the appearance information for a person who is interested in the displayed content or purchased product, or who has given an opinion. Useful for data collection.
  • a shop or the like by using the security camera or the like and the above-described speaker specifying process together, it is possible to record the utterance content of the customer for the product in the store together with the characteristics of the speaker.
  • the child's utterance contents can be easily reported to parents, and the burden on the nursery school can be reduced.
  • the parent can check the utterance contents of the child at any time. By counting the number of utterances of each student at school, it can be used to determine whether the lesson is a class in which only some students are speaking or a class in which many students are speaking.
  • the speaker specifying process of the present disclosure can be used in various fields.
  • a speaker is specified from among an unspecified number of persons even if the speaker is not registered in advance. Even when the movement of the lip of the speaker cannot be photographed, the speaker can be specified. Then, by specifying the speaker, the following effects can be obtained.
  • A) Acquisition of external feature information such as the speaker's age and gender makes it possible to provide a service tailored to the speaker.
  • B It is possible to improve speech recognition and lip-reading recognition accuracy by recording the utterance contents and habits of the speaker and using it for learning the recognition function.
  • C By providing a service only to a specific speaker, it becomes possible to give a specific person administrator authority.
  • D By utilizing the characteristics and content of the utterance as log data, it can be used for market research data, entertainment use, and discovery of conspiracy.
  • FIG. 13 illustrates an example of a system configuration for executing the processing of the present disclosure.
  • Information processing system configuration example 1 has almost all the functions of the information processing apparatus shown in FIG. 4 as one apparatus, for example, a smartphone or PC owned by the user, or voice input / output and image input / output functions.
  • the information processing apparatus 410 is a user terminal such as an agent device.
  • the information processing apparatus 410 corresponding to the user terminal executes communication with the application execution server 420 only when an external application is used when generating a response sentence, for example.
  • the application execution server 420 is, for example, a weather information providing server, a traffic information providing server, a medical information providing server, a tourist information providing server, or the like, and is configured by a server group that can provide information for generating a response to a user utterance. .
  • FIG. 13B is an information processing system configuration example 2 in the information processing apparatus 410, which is an information processing terminal such as a smartphone, PC, agent device, etc. owned by the user, with some of the functions of the information processing apparatus shown in FIG.
  • This is an example of a system that is configured to be executed by a data processing server 460 that can communicate with an information processing apparatus.
  • the audio input unit 101, the image input unit 105, the audio output unit 125, and the image output unit 126 in the apparatus shown in FIG. 4 are provided on the information processing device 410 side on the information processing terminal side, and all other functions are provided on the server side. It is possible to perform a configuration such as
  • An audio input / output unit and an image input / output unit are configured in an information processing terminal such as a user terminal.
  • the data processing server performs a speaker specifying process based on voice and images received from the user terminal.
  • the server generates information necessary for the speaker identification process and provides the information to the information processing device. It is good also as a structure which performs a speaker specific process. For example, such various configurations are possible.
  • Information processing terminal A voice input unit; An image input unit; A communication unit that transmits the sound acquired through the sound input unit and the camera-captured image acquired through the image input unit to the server;
  • the server A speech recognition unit that performs speech recognition processing of the speech of the speaker received from the information processing terminal;
  • An image recognition unit that performs analysis of the camera-captured image received from the information processing terminal;
  • the voice recognition unit Generate a string that shows the utterance content of the speaker, and generate nodding timing estimation data for the speaker and listener based on the string,
  • the image recognition unit It is a configuration that generates nod timing actual data that records the nod timing of the nod performer included in the camera image,
  • a speaker identification process is executed based on the degree of coincidence between the nod timing estimation data and the actual nod timing data.
  • Such an information processing system can be configured.
  • the functions on the information processing terminal side such as the user terminal and the function division modes of the server side functions can be set in various different ways, and a configuration in which one function is executed on both sides is also possible.
  • FIG. 14 is an example of the hardware configuration of the information processing apparatus described above with reference to FIG. 4, and constitutes the data processing server 460 described with reference to FIG. It is an example of the hardware constitutions of information processing apparatus.
  • a CPU (Central Processing Unit) 501 functions as a control unit or a data processing unit that executes various processes according to a program stored in a ROM (Read Only Memory) 502 or a storage unit 508. For example, processing according to the sequence described in the above-described embodiment is executed.
  • a RAM (Random Access Memory) 503 stores programs executed by the CPU 501 and data.
  • the CPU 501, ROM 502, and RAM 503 are connected to each other by a bus 504.
  • the CPU 501 is connected to an input / output interface 505 via a bus 504.
  • An input unit 506 including various switches, a keyboard, a mouse, a microphone, and a sensor, and an output unit 507 including a display and a speaker are connected to the input / output interface 505.
  • the CPU 501 executes various processes in response to a command input from the input unit 506 and outputs a processing result to the output unit 507, for example.
  • the storage unit 508 connected to the input / output interface 505 includes, for example, a hard disk and stores programs executed by the CPU 501 and various data.
  • a communication unit 509 functions as a transmission / reception unit for Wi-Fi communication, Bluetooth (BT) communication, and other data communication via a network such as the Internet or a local area network, and communicates with an external device.
  • BT Bluetooth
  • the drive 510 connected to the input / output interface 505 drives a removable medium 511 such as a magnetic disk, an optical disk, a magneto-optical disk, or a semiconductor memory such as a memory card, and executes data recording or reading.
  • a removable medium 511 such as a magnetic disk, an optical disk, a magneto-optical disk, or a semiconductor memory such as a memory card
  • the technology disclosed in this specification can take the following configurations.
  • the speaker specifying unit (A) A speaker identification process based on a comparison result between a speech recognition result for an uttered speech and a lip reading recognition result for analyzing the utterance from the lip movement of the speaker, or (B) A speaker identification process based on an analysis result of a speaker's motion or a listener's motion listening to the speaker's speech, An information processing apparatus that executes at least one of the speaker identification processes of (a) and (b) above.
  • the speaker specifying unit If it ’s impossible to detect the lip movement of the speaker from the camera image, (B) A speaker identification process based on an analysis result of a speaker's motion or a listener's motion listening to the speaker's speech, The information processing apparatus according to (1) or (2).
  • the speaker specifying unit Nodding timing estimation data of the speaker or listener estimated based on the utterance character string obtained from the speech recognition result for the uttered speech, A comparison process with the actual data of the nodding person's nodding timing included in the camera image is executed.
  • the information processing apparatus according to any one of (1) to (3), wherein a nod performer included in the camera-captured image determines whether the person is a speaker or a listener.
  • the speaker specifying unit If the degree of coincidence between the utterer nod timing estimation data estimated based on the utterance character string obtained from the speech recognition result and the nod timing actual data is high, it is determined that the nod performer is a utterer, If the degree of coincidence between the listener nodding timing estimation data estimated based on the utterance character string obtained from the speech recognition result and the nodding timing actual data is high, it is determined that the nod performer is a listener (4).
  • the speaker specifying unit If it is determined that the nod performer is a listener, The information processing apparatus according to (5), wherein a person in the line of sight of the nod performer is determined as a speaker.
  • the information processing apparatus A speech recognition unit that performs speech recognition processing for analyzing the speech from the lip movement of the speaker and performing speech recognition processing for the speech speech;
  • the speaker specifying unit The information processing apparatus according to any one of (1) to (6), wherein a speech recognition result and a lipreading recognition result generated by the speech recognition processing unit are input and a speaker specifying process is executed.
  • the voice recognition unit further includes: Based on the utterance character string obtained from the speech recognition result for the uttered speech, generate nodding timing estimation data of the speaker and listener, The speaker specifying unit Nodding timing estimation data of a speaker and a listener generated by the voice recognition processing unit, A comparison process with the actual data of the nodding person's nodding timing included in the camera image is executed.
  • the information processing apparatus according to (7), wherein the nod performer included in the camera-captured image determines whether the person is a speaker or a listener.
  • the information processing apparatus From an image captured by the camera, an image recognition unit that executes an analysis process of the operation of the speaker or the operation of the listener who is listening to the speech of the speaker,
  • the speaker specifying unit The information processing apparatus according to any one of (1) to (8), wherein the analysis information generated by the image recognition processing unit is input and a speaker specifying process is executed.
  • the image recognition unit Generate a timetable that records the nod timing of the nod performer included in the camera image
  • the speaker specifying unit The information processing apparatus according to (9), wherein a speaker specifying process is executed using a nod timing timing table generated by the image recognition processing unit.
  • the information processing apparatus includes: A speech recognition unit that performs speech recognition processing on the uttered speech; An image recognition unit that performs an analysis process of a captured image of at least one of the speaker or the listener; The voice recognition unit Based on the character string obtained by executing the speech recognition process for the uttered speech, estimating the speaker's nod timing and the listener's nod timing, and recording the estimated nod timing data, Talker nod timing estimation timetable, Listener nod timing estimation timetable, Produces The image recognition unit Recorded the nodding timing of the nod performer included in the camera image, Generate a nod timing timing table, The speaker specifying unit When the degree of coincidence between the speaker nodding timing estimation time table and the nodding timing time table is high, it is determined that the nodding performer is a speaker, The information processing apparatus according to any one of (1) to (10), wherein when the degree of coincidence between the listener nod timing estimation time table and the nod timing timing table is high, the nod performer is determined to be a listener
  • An information processing system having an information processing terminal and a server, The information processing terminal A voice input unit; An image input unit; A communication unit that transmits the audio acquired through the audio input unit and the camera-captured image acquired through the image input unit to the server;
  • the server A speech recognition unit that performs speech recognition processing of the speech of the speaker received from the information processing terminal;
  • An image recognition unit that performs analysis of the camera-captured image received from the information processing terminal;
  • the voice recognition unit Generating a character string indicating the utterance content of the speaker, and generating nodding timing estimation data of the speaker and the listener based on the character string;
  • the image recognition unit Generating nod timing actual data that records the nod timing of the nod performer included in the camera-captured image;
  • An information processing system that executes speaker identification processing based on a degree of coincidence between the nod timing estimation data and the nod timing actual data.
  • the information processing apparatus includes: It has a speaker identification unit that executes speaker identification processing,
  • the speaker specifying unit (A) A speaker identification process based on a comparison result between a speech recognition result for an uttered speech and a lip reading recognition result for analyzing the utterance from the lip movement of the speaker, or (B) A speaker identification process based on an analysis result of a speaker's motion or a listener's motion listening to the speaker's speech, An information processing method for executing the speaker specifying process of at least one of (a) and (b) above.
  • the server From the voice received from the information processing terminal, voice recognition processing of the voice of the speaker is executed, a character string indicating the utterance content of the speaker is generated, and nodding timing estimation of the speaker and the listener based on the character string Generate data, Performing analysis of the camera-captured image received from the information processing terminal; Generating nod timing actual data that records the nod timing of the nod performer included in the camera-captured image;
  • a program for executing information processing in an information processing device includes: It has a speaker identification unit that executes speaker identification processing, The program is stored in the speaker specifying unit.
  • a speaker identification process based on a comparison result between a speech recognition result for an uttered speech and a lip reading recognition result for analyzing the utterance from the lip movement of the speaker or
  • a speaker identification process based on an analysis result of a speaker's motion or a listener's motion listening to the speaker's speech A program for executing at least one of the speaker specifying processes (a) and (b).
  • the series of processes described in the specification can be executed by hardware, software, or a combined configuration of both.
  • the program recording the processing sequence is installed in a memory in a computer incorporated in dedicated hardware and executed, or the program is executed on a general-purpose computer capable of executing various processing. It can be installed and run.
  • the program can be recorded in advance on a recording medium.
  • the program can be received via a network such as a LAN (Local Area Network) or the Internet and installed on a recording medium such as a built-in hard disk.
  • the various processes described in the specification are not only executed in time series according to the description, but may be executed in parallel or individually according to the processing capability of the apparatus that executes the processes or as necessary.
  • the system is a logical set configuration of a plurality of devices, and the devices of each configuration are not limited to being in the same casing.
  • the speaker specifying unit that executes the speaker specifying process (a) compares the speech recognition result for the uttered voice with the lip reading recognition result for analyzing the utterance from the lip movement of the speaker. Or (b) the speaker identification process based on the analysis result of the operation of the speaker or the listener. For example, a process of comparing the nodding timing estimation data of a speaker or listener estimated based on an utterance character string obtained as a speech recognition result and the nodding timing actual data of the nodding performer included in the camera-captured image is executed. Based on the degree of coincidence, it is determined whether the nod performer included in the camera-captured image is a speaker or a listener. With this configuration, an apparatus and a method for executing a speaker identification process with high accuracy under various circumstances are realized.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

La présente invention concerne un dispositif et un procédé d'identification précise d'un locuteur dans diverses conditions. Une unité d'identification de locuteur permettant d'identifier un locuteur effectue (a) une identification de locuteur en fonction de la comparaison entre les résultats de reconnaissance vocale pour une voix parlée et les résultats de la reconnaissance de lecture sur les lèvres pour analyser un énoncé à partir du mouvement des lèvres d'un locuteur, ou (b) une identification de locuteur en fonction des résultats d'analyse pour l'action du locuteur ou d'un auditeur. Par exemple, l'unité d'identification de locuteur : compare des données estimant les instants où un locuteur ou un auditeur hoche la tête avec des données d'instant de hochement réel pour une personne hochant la tête dans une image capturée par une caméra, l'instant de hochement estimé étant estimé en fonction d'une chaîne de texte d'énoncé obtenue en tant que résultats de reconnaissance vocale ; et détermine si la personne hochant la tête dans l'image capturée par la caméra est un locuteur ou un auditeur en fonction du niveau de coïncidence.
PCT/JP2018/042409 2018-02-01 2018-11-16 Dispositif de traitement d'informations, système de traitement d'informations, procédé de traitement d'informations et programme WO2019150708A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2018016131 2018-02-01
JP2018-016131 2018-07-24

Publications (1)

Publication Number Publication Date
WO2019150708A1 true WO2019150708A1 (fr) 2019-08-08

Family

ID=67478706

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2018/042409 WO2019150708A1 (fr) 2018-02-01 2018-11-16 Dispositif de traitement d'informations, système de traitement d'informations, procédé de traitement d'informations et programme

Country Status (1)

Country Link
WO (1) WO2019150708A1 (fr)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111091824A (zh) * 2019-11-30 2020-05-01 华为技术有限公司 一种语音匹配方法及相关设备
CN111816182A (zh) * 2020-07-27 2020-10-23 上海又为智能科技有限公司 助听语音识别方法、装置及助听设备

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007094104A (ja) * 2005-09-29 2007-04-12 Sony Corp 情報処理装置および方法、並びにプログラム
JP2011186351A (ja) * 2010-03-11 2011-09-22 Sony Corp 情報処理装置、および情報処理方法、並びにプログラム
JP2012059121A (ja) * 2010-09-10 2012-03-22 Softbank Mobile Corp 眼鏡型表示装置
JP2017116747A (ja) * 2015-12-24 2017-06-29 日本電信電話株式会社 音声処理システム、音声処理装置および音声処理プログラム
WO2017168936A1 (fr) * 2016-03-31 2017-10-05 ソニー株式会社 Dispositif de traitement d'informations, procédé de traitement d'informations, et programme

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007094104A (ja) * 2005-09-29 2007-04-12 Sony Corp 情報処理装置および方法、並びにプログラム
JP2011186351A (ja) * 2010-03-11 2011-09-22 Sony Corp 情報処理装置、および情報処理方法、並びにプログラム
JP2012059121A (ja) * 2010-09-10 2012-03-22 Softbank Mobile Corp 眼鏡型表示装置
JP2017116747A (ja) * 2015-12-24 2017-06-29 日本電信電話株式会社 音声処理システム、音声処理装置および音声処理プログラム
WO2017168936A1 (fr) * 2016-03-31 2017-10-05 ソニー株式会社 Dispositif de traitement d'informations, procédé de traitement d'informations, et programme

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111091824A (zh) * 2019-11-30 2020-05-01 华为技术有限公司 一种语音匹配方法及相关设备
WO2021104110A1 (fr) * 2019-11-30 2021-06-03 华为技术有限公司 Procédé de mise en correspondance vocale et dispositif associé
CN111091824B (zh) * 2019-11-30 2022-10-04 华为技术有限公司 一种语音匹配方法及相关设备
CN111816182A (zh) * 2020-07-27 2020-10-23 上海又为智能科技有限公司 助听语音识别方法、装置及助听设备

Similar Documents

Publication Publication Date Title
US11875820B1 (en) Context driven device arbitration
US12125483B1 (en) Determining device groups
US11600291B1 (en) Device selection from audio data
US11823679B2 (en) Method and system of audio false keyphrase rejection using speaker recognition
CN111344780B (zh) 基于上下文的设备仲裁
US10621991B2 (en) Joint neural network for speaker recognition
Czyzewski et al. An audio-visual corpus for multimodal automatic speech recognition
CN105940407B (zh) 用于评估音频口令的强度的系统和方法
Minotto et al. Multimodal multi-channel on-line speaker diarization using sensor fusion through SVM
US20190341053A1 (en) Multi-modal speech attribution among n speakers
US10838954B1 (en) Identifying user content
US10699706B1 (en) Systems and methods for device communications
JP2019533181A (ja) 通訳装置及び方法(device and method of translating a language)
KR20200090355A (ko) 실시간 번역 기반 멀티 채널 방송 시스템 및 이를 이용하는 방법
WO2019150708A1 (fr) Dispositif de traitement d'informations, système de traitement d'informations, procédé de traitement d'informations et programme
WO2019181218A1 (fr) Dispositif de traitement d'informations, système de traitement d'informations, procédé de traitement d'informations, et programme
US20220161131A1 (en) Systems and devices for controlling network applications
KR20210042520A (ko) 전자 장치 및 이의 제어 방법
JP2013257418A (ja) 情報処理装置、および情報処理方法、並びにプログラム
US11430429B2 (en) Information processing apparatus and information processing method
US20190066676A1 (en) Information processing apparatus
JP2021076715A (ja) 音声取得装置、音声認識システム、情報処理方法、及び情報処理プログラム
CN112513845B (zh) 用于将暂时账户与语音使能设备相关联的方法
WO2020017165A1 (fr) Dispositif de traitement d'informations, système de traitement d'informations, procédé de traitement d'informations, et programme
US20210271358A1 (en) Information processing apparatus for executing in parallel plurality of pieces of processing

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18903250

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18903250

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: JP