KR102241972B1

KR102241972B1 - Answering questions using environmental context

Info

Publication number: KR102241972B1
Application number: KR1020200092439A
Authority: KR
Inventors: 매튜 샤리피; 게오르게 포스텔니쿠
Original assignee: 구글 엘엘씨
Priority date: 2012-09-10
Filing date: 2020-07-24
Publication date: 2021-04-20
Also published as: KR20140034034A; CN106250508A; CN106250508B; CN103714104A; WO2014039106A1; KR20200093489A; KR102029276B1; KR102140177B1; CN103714104B; KR20190113712A

Abstract

발화 및 환경 데이터를 수신하고, 발화에 대한 전사를 획득하고, 환경 데이터를 이용하여 엔티티를 식별하고, 전사의 적어도 일부와 엔티티를 식별하는 데이터를 포함하는 쿼리를 자연 언어 쿼리 프로세싱 엔진에 제출하고, 그리고 쿼리에 대한 하나 이상의 결과를 획득하기 위한, 방법들, 시스템들, 및 컴퓨터 저장 매체 상에 인코딩된 컴퓨터 프로그램을 포함하는 장치가 개시된다. Receive speech and environmental data, obtain a transcription for the speech, identify entities using environmental data, submit a query containing at least a portion of the transcription and data identifying the entity to a natural language query processing engine, And an apparatus comprising methods, systems, and a computer program encoded on a computer storage medium for obtaining one or more results for a query.

Description

Answering questions using environment context {ANSWERING QUESTIONS USING ENVIRONMENTAL CONTEXT}

본 출원은 2012년 9월 10일자로 출원된 미국 가출원번호 61/698,934, 2012년 9월 10일자로 출원된 61/698,949호, 2012년 9월 25일자로 출원된 미국 특허 출원번호 13/626,439, 2012년 9월 25일자로 출원된 미국 특허 출원번호 13/626,351, 및 2013년 2월 15일자로 출원된 미국 특허 출원번호 13/768,232에 대해 우선권을 주장하며, 이 문서들의 모든 내용은 참조로서 본 명세서에 포함된다.This application is filed on September 10, 2012, U.S. Provisional Application No. 61/698,934, filed on September 10, 2012, 61/698,949, U.S. Patent Application No. 13/626,439, filed on September 25, 2012, Priority is claimed to U.S. Patent Application No. 13/626,351, filed Sep. 25, 2012, and U.S. Patent Application No. 13/768,232, filed Feb. 15, 2013, all contents of which are incorporated herein by reference. Included in the specification.

본 명세서는 자연 언어 쿼리(natural language query)와 환경 정보에 기초하여 쿼리의 결과들을 식별하는 것에 관한 것으로, 환경 정보(예를 들어, 콘텍스트(context))를 이용하여 질문들에 답변하기 위한 것이다. The present specification relates to identifying results of a query based on a natural language query and environmental information, and is intended to answer questions using environmental information (eg, context).

일반적으로, 탐색 쿼리는 사용자가 탐색 엔진에 탐색의 실행을 요청할 때, 탐색 엔진에 제출하는 하나 이상의 용어를 포함한다. 그 밖의 접근법 중에서는, 사용자가 키보드 상에서 타이핑하거나, 또한 음성 쿼리의 콘텍스트에서 모바일 디바이스의 마이크로폰에 쿼리 용어들을 말함으로써 탐색 쿼리의 쿼리 용어들을 입력할 수도 있다. 음성 쿼리들은 음성 인식 기술을 이용하여 처리될 수 있다. In general, a search query includes one or more terms submitted to the search engine when a user requests the search engine to perform the search. Among other approaches, the user may enter the query terms of a search query by typing on the keyboard, or also speaking the query terms to the microphone of the mobile device in the context of the voice query. Speech queries can be processed using speech recognition technology.

본 명세서에서 설명된 주제의 일부 혁신적 양태에 따르면, 환경 정보(예컨대, 주변 노이즈)는 쿼리 프로세싱 시스템이 자연 언어 쿼리를 답변하는데 도움을 줄 수 있다. 예를 들어, 사용자는 자신이 보고 있는 텔레비전 프로그램에 관한 질문(에컨대, "이 영화에 나오는 주인공은 누구인가요?")을 물어볼 수 있다. 사용자의 모바일 디바이스는 사용자의 발화(utterance)와, 텔레비전 프로그램의 사운드트랙 오디오를 포함할 수 있는 환경 정보를 검출한다. 모바일 컴퓨팅 디바이스는 발화 및 환경 정보를 파형 데이터(waveform data)로서 인코딩하고, 및 상기 파형 데이터를 서버-기반 컴퓨팅 환경으로 제공한다. According to some innovative aspects of the subject matter described herein, environmental information (eg, ambient noise) can help a query processing system answer natural language queries. For example, a user may ask a question about the television program he is watching (eg, "Who is the main character in this movie?"). The user's mobile device detects the user's utterance and environmental information, which may include the soundtrack audio of the television program. The mobile computing device encodes speech and environmental information as waveform data, and provides the waveform data to a server-based computing environment.

컴퓨팅 환경은 상기 파형 데이터의 환경 데이터로부터 상기 발화를 분리한 다음, 상기 발화에 대한 전사(transcription)를 획득한다. 컴퓨팅 환경은 예를 들어, 영화의 이름을 식별함으로써 상기 환경 데이터 및 상기 발화에 관련된 엔티티 데이터를 추가로 식별한다. 이어서 상기 전사와 상기 엔티티 데이터로부터, 컴퓨팅 환경은 하나 이상의 결과(예컨대, 상기 사용자의 질문에 응답하는 결과들)를 식별할 수 있다. 특히, 상기 하나 이상의 결과들은 "이 영화에 어떤 배우가 나오나요?"(예컨대, 배우의 이름)에 대한 사용자의 질문에 대한 답변을 포함할 수 있다. 상기 컴퓨팅 환경은 이러한 결과들을 모바일 컴퓨팅 디바이스의 사용자에게 제공할 수 있다. The computing environment separates the utterance from environmental data of the waveform data and then obtains a transcription for the utterance. The computing environment further identifies the environmental data and entity data related to the utterance, for example by identifying the name of the movie. From the transcription and the entity data, the computing environment can then identify one or more results (eg, results in response to the user's question). In particular, the one or more results may include an answer to the user's question about "What actor is in this movie?" (eg, the actor's name). The computing environment can provide these results to a user of a mobile computing device.

본 명세서에서 설명된 주제의 혁신적인 양태들은 방법으로 구현될 수 있으며, 이 방법은 발화 및 환경 데이터를 인코딩하는 오디오 데이터를 수신하는 동작, 상기 발화의 전사를 획득하는 동작, 상기 환경 데이터를 이용하여 엔티티를 식별하는 동작, 상기 전사의 적어도 일부와 상기 엔티티를 식별하는 데이터를 포함하는 쿼리를 자연 언어 쿼리 프로세싱 엔진에 제출하는 동작, 및 상기 쿼리에 대한 하나 이상의 결과를 획득하는 동작을 포함한다. Innovative aspects of the subject matter described herein can be implemented in a method, the method comprising receiving audio data encoding utterance and environmental data, obtaining a transcription of the utterance, using the environmental data to an entity Identifying an entity, submitting a query comprising data identifying the entity and at least a portion of the transcription to a natural language query processing engine, and obtaining one or more results for the query.

이러한 양태들의 다른 실시예들은 상응하는 시스템들, 장치들, 및 컴퓨터 저장 디바이스들 상에 인코딩되어, 본 발명의 동작들을 수행하도록 구성된 컴퓨터 프로그램을 포함한다.Other embodiments of these aspects include corresponding systems, apparatuses, and computer programs encoded on computer storage devices and configured to perform the operations of the present invention.

이러한 및 다른 실시예들은 하나 이상의 후술하는 특징을 선택적으로 각각 포함할 수 있다. 예를 들어, 적어도 하나의 결과의 표현(representation)을 출력한다. 상기 엔티티가 상기 발화를 추가로 이용하여 식별될 수 있다. 상기 쿼리를 생성한다. 상기 쿼리를 생성하는 동작은 상기 전사와 상기 엔티티를 식별하는 데이터를 연관시키는 동작을 포함한다. 상기 연관시키는 동작은 상기 전사를 엔티티를 식별하는 데이터로 태깅하는 동작을 포함한다. 상기 연관시키는 동작은 상기 전사의 일부를 상기 엔티티를 식별하는 데이터로 대체하는 동작을 더 포함한다. 상기 대체하는 동작은 상기 전사의 하나 이상의 단어를 상기 엔티티를 식별하는 데이터로 대체하는 동작을 더 포함한다. 상기 환경 데이터를 수신하는 동작은 환경 오디오 데이터, 환경 이미지 데이터, 또는 이 둘 모두를 수신하는 동작을 더 포함한다. 상기 환경 오디오 데이터를 수신하는 동작은 배경 노이즈를 포함하는 추가 오디오 데이터를 수신하는 동작을 더 포함한다. These and other embodiments may each optionally include one or more of the features described below. For example, at least one representation of the result is output. The entity can be identified further using the utterance. Create the above query. Generating the query includes associating the transcription with data identifying the entity. The associating operation includes tagging the transcription with data identifying an entity. The associating operation further includes replacing a portion of the transcription with data identifying the entity. The replacing operation further includes replacing one or more words of the transcription with data identifying the entity. The operation of receiving the environmental data further includes receiving environmental audio data, environmental image data, or both. The receiving of the environmental audio data further includes receiving additional audio data including background noise.

본 명세서에서 설명된 주제의 일부 혁신적인 양태들에 따르면, 미디어 콘텐츠의 아이템이 환경 오디오 데이터와 발화된 자연 언어 쿼리에 기초하여 식별된다. 예를 들어, 사용자는 자신이 보고 있는 텔레비전 프로그램에 관한 질문 예컨대, "우리가 지금 뭘 보고 있습니까?")를 물을 수 있다. 그 질문은 그 질문이 텔레비전 쇼에 관한 것이며 다른 형태의 미디어 콘텐츠에 대한 것이 아니라는 것을 제안하는 키워드들(예컨대, "보기(watching)")을 포함할 수 있다. 사용자의 모바일 디바이스는 사용자의 발화와, 텔레비전 프로그램의 배경 오디오를 포함할 수 있는 환경 데이터를 검출한다. 모바일 컴퓨팅 디바이스는 상기 발화와 상기 환경 데이터를 파형 데이터로 인코딩하고, 및 그 파형 데이터를 서버-기반 컴퓨팅 환경에 제공한다. According to some innovative aspects of the subject matter described herein, an item of media content is identified based on environmental audio data and a spoken natural language query. For example, the user may ask a question about the television program he is watching, such as "What are we watching now?"). The question may include keywords (eg, “watching”) suggesting that the question is for a television show and not for other forms of media content. The user's mobile device detects the user's speech and environmental data, which may include the background audio of the television program. A mobile computing device encodes the utterance and the environmental data into waveform data, and provides the waveform data to a server-based computing environment.

컴퓨팅 환경은 상기 파형 데이터의 환경 데이터로부터 상기 발화를 분리한 다음, 상기 발화에 대한 전사를 획득하기 위해 상기 발화를 처리한다. 상기 전사로부터, 컴퓨팅 환경은 임의의 콘텐트 유형-특정 키워드들(예컨대, 키워드 "보기")를 검출한다. 이어 컴퓨팅 환경은 상기 환경 데이터에 기초하여 미디어 콘텐츠의 아이템들을 식별할 수 있고, 상기 식별된 아이템들로부터 키워드들과 연관된 특정 콘텐츠 유형에 매칭되는 멀티 콘텐츠의 특정 아이템을 선택할 수 있다. 컴퓨팅 환경은 멀티미디어 콘텐츠의 특정 아이템의 표현을 모바일 컴퓨팅 디바이스의 사용자에게 제공한다. The computing environment separates the utterance from environmental data of the waveform data and then processes the utterance to obtain a transcription for the utterance. From the transcription, the computing environment detects any content type-specific keywords (eg, keyword “view”). Subsequently, the computing environment may identify items of media content based on the environmental data, and select a specific item of multi-content that matches a specific content type associated with keywords from the identified items. The computing environment provides a representation of a specific item of multimedia content to a user of a mobile computing device.

본 명세서에서 설명된 주제의 혁신적인 양태들은 방법들로 구현될 수 있는데, 이 방법은 (ⅰ) 발화된 자연 언어 쿼리를 인코딩하는 오디오 데이터 및 (ⅱ) 환경 오디오 데이터를 수신하는 동작, 상기 발화된 자연 언어 쿼리에 대한 전사를 획득하는 동작, 상기 전사에 있는 하나 이상의 키워드에 연관된 특정 콘텐츠 유형을 판단하는 동작, 상기 환경 오디오 데이터의 적어도 일부를 콘텐츠 인식 엔진에 제공하는 동작, 및 상기 콘텐츠 인식 엔진에 의해 출력되었고, 상기 특정 콘텐츠 유형에 매칭되는 콘텐츠 아이템을 식별하는 동작을 포함한다. Innovative aspects of the subject matter described herein can be implemented in methods, which method comprises (i) receiving audio data encoding a spoken natural language query and (ii) environmental audio data, the spoken nature Acquiring a transcription for a language query, determining a specific content type associated with one or more keywords in the transcription, providing at least part of the environmental audio data to a content recognition engine, and by the content recognition engine And identifying a content item that has been output and matches the specific content type.

이러한 양태들의 다른 실시예들은 상응하는 시스템들, 장치들, 및 컴퓨팅 저장 디바이스들에 인코딩되고 상기 방법들의 동작들을 수행하도록 구성된 컴퓨터 프로그램들을 포함한다. Other embodiments of these aspects include computer programs encoded in corresponding systems, apparatuses, and computing storage devices and configured to perform the operations of the methods.

이러한 및 다른 실시예들은 하나 이상의 후술하는 특징을 선택적으로 각각 포함할 수 있다. 예를 들어, 특정 콘텐츠 유형은 영화 콘텐츠 유형, 음악 콘텐츠 유형, 텔레비전 쇼 콘텐츠 유형, 오디오 팟캐스트 콘텐츠 유형, 북 콘텐츠 유형, 미술작품 콘텐츠 유형, 예고편(trailer) 콘텐츠 유형, 비디오 팟캐스트 콘텐츠 유형, 인터넷 비디오 콘텐츠 유형, 또는 비디오 게임 콘텐츠 유형이다. 상기 환경 오디오 데이터를 수신하는 동작은 배경 소음을 포함하는 추가 오디오 데이터를 수신하는 동작을 더 포함한다. 상기 배경 소음은 특정 콘텐츠 유형과 연관된다. 비디오 데이터 또는 이미지 데이터를 포함하는 추가 환경 데이터를 수신한다. 상기 비디오 데이터 또는 상기 이미지 데이터는 상기 특정 콘텐츠 유형과 연관된다. 상기 환경 오디오 데이터의 적어도 일부를 상기 콘텐츠 인식 엔진에 제공하는 동작은 상기 환경 오디오 데이터의 상기 일부를 오디오 핑거프린팅 엔진(fingerprinting engine)에 제공하는 동작을 더 포함한다. 상기 특정 콘텐츠 유형을 판단하는 동작은 하나 이상의 데이터베이스를 이용하여, 복수의 콘텐츠 유형 각각에 대해, 상기 키워드들 중 적어도 하나가 상기 복수의 콘텐츠 유형 중 적어도 하나에 매핑되는 하나 이상의 키워드를 식별하는 동작을 더 포함한다. 상기 복수의 콘텐츠 유형은 상기 특정 콘텐츠 유형을 포함하고, 상기 매핑되는 동작은 상기 키워드들 중 적어도 하나가 상기 특정 콘텐츠 유형에 매핑되는 동작을 더 포함한다. 상기 콘텐츠 아이템을 식별하는 데이터를 출력한다. These and other embodiments may each optionally include one or more of the features described below. For example, certain types of content include Movie Content Type, Music Content Type, Television Show Content Type, Audio Podcast Content Type, Book Content Type, Artwork Content Type, Trailer Content Type, Video Podcast Content Type, Internet It is a video content type, or a video game content type. The receiving of the environmental audio data further includes receiving additional audio data including background noise. The background noise is associated with a specific content type. Receive additional environmental data including video data or image data. The video data or the image data is associated with the specific content type. The operation of providing at least a portion of the environmental audio data to the content recognition engine further includes an operation of providing the portion of the environmental audio data to an audio fingerprinting engine. The determining of the specific content type includes an operation of identifying one or more keywords in which at least one of the keywords is mapped to at least one of the plurality of content types, for each of a plurality of content types, using one or more databases. Include more. The plurality of content types includes the specific content type, and the mapped operation further includes an operation in which at least one of the keywords is mapped to the specific content type. Outputs data identifying the content item.

상기 특징들은 예를 들어, 상기 특정 콘텐츠 유형을 식별하는 데이터를 상기 콘텐츠 인식 엔진에 제공하는 동작, 및 상기 콘텐츠 아이템을 식별하는 동작은 상기 콘텐츠 인식 엔진으로부터 상기 콘텐츠 아이템을 식별하는 데이터를 수신하는 동작을 더 포함한다. 상기 콘텐츠 인식 시스템으로부터 2개 이상의 콘텐츠 인식 후보를 수신하는 동작과, 상기 콘텐츠 아이템을 식별하는 동작은 상기 특정 콘텐츠 유형에 기초하여 특정 콘텐츠 인식 후보를 선택하는 동작을 더 포함한다. 2개 이상의 콘텐츠 인식 후보 각각은 랭킹 스코어와 연관되고, 상기 방법은 상기 특정 콘텐츠 유형에 기초하여 상기 2개 이상의 콘텐츠 인식 후보의 랭킹 스코어를 조정하는 동작을 더 포함한다. 상기 조정된 랭킹 스코어들에 기초하여 상기 2개 이상의 콘텐츠 인식 후보들의 순위를 부여한다.The features include, for example, the operation of providing data identifying the specific content type to the content recognition engine, and the operation of identifying the content item is an operation of receiving data identifying the content item from the content recognition engine. It includes more. The operation of receiving two or more content recognition candidates from the content recognition system, and the operation of identifying the content item further include an operation of selecting a specific content recognition candidate based on the specific content type. Each of the two or more content recognition candidates is associated with a ranking score, and the method further includes adjusting the ranking score of the two or more content recognition candidates based on the specific content type. The two or more content recognition candidates are ranked based on the adjusted ranking scores.

본 명세서에서 설명된 주제의 하나 이상의 세부사항들은 첨부 도면들 및 이하의 상세한 설명에 개시되어 있다. 본 주제의 다른 잠재적 특징들, 양태들, 및 이점들은 상기 상세한 설명, 도면들 및 청구항들로부터 명백해 질 것이다.One or more details of the subject matter described herein are set forth in the accompanying drawings and the detailed description below. Other potential features, aspects, and advantages of the subject matter will become apparent from the above detailed description, drawings, and claims.

도 1은 환경 오디오 데이터와 발화된 자연 언어 쿼리에 기초하여 콘텐츠 아이템 데이터를 식별하는 예시적 시스템을 묘사한다.
도 2는 환경 오디오 데이터와 발환된 자연 언어 쿼리에 기초하여 콘텐츠 아이템 데이터를 식별하는 예시적 프로세스를 위한 순서도를 묘사한다.
도 3a 및 도 3b는 콘텐츠 아이템을 식별하는 예시적 시스템의 부분들을 묘사한다.
도 4는 환경 이미지 데이터 및 발화된 자연 언어 쿼리에 기초하여 미디어 콘텐츠 아이템들을 식별하는 예시적 시스템을 묘사한다.
도 5는 환경 오디오 데이터 및 발화에 기초하여 하나 이상의 결과를 식별하는 시스템을 묘사한다.
도 6은 환경 데이터 및 발화에 기초하여 하나 이상의 결과를 식별하는 예시적 프로세스를 위한 순서도를 묘사한다.
도 7은 본 명세서에서 설명되는 기술들을 구현하는데 이용될 수 있는 컴퓨터 시스템 및 모바일 컴퓨터 디바이스를 묘사한다.
여러 도면들에서 유사한 참조 기호는 유사한 구성요소를 가리킨다.1 depicts an example system for identifying content item data based on environmental audio data and spoken natural language queries.
2 depicts a flow chart for an exemplary process of identifying content item data based on environmental audio data and an issued natural language query.
3A and 3B depict portions of an example system for identifying an item of content.
4 depicts an example system for identifying media content items based on environmental image data and spoken natural language queries.
5 depicts a system for identifying one or more results based on environmental audio data and speech.
6 depicts a flowchart for an exemplary process of identifying one or more results based on environmental data and utterances.
7 depicts a computer system and mobile computer device that may be used to implement the techniques described herein.
Similar reference symbols in the various drawings indicate similar elements.

환경 정보를 콘텍스트(context)로서 사용하여 발화된 자연 언어 쿼리(spoken natural language query)들에 답변하는 컴퓨팅 환경은 복수의 프로세스들을 사용하여 쿼리들을 처리할 수 있다. 도 1 내지 도 4에 도시된 일부 프로세스들의 예에서, 컴퓨팅 환경은 주변 노이즈와 같은 환경 정보에 기초하여 미디어 콘텐츠를 식별할 수 있다. 도 5 및 도 6에 도시된 다른 프로세스들의 예에서, 컴퓨팅 환경은 발화된 자연 언어 쿼리에 대하여 더욱 만족스러운 답변을 제공하기 위하여, 미디어 콘텐츠를 식별하는 데이터 같은 환경 정보로부터 도출되는 콘텍스트로 발화된 자연 언어 쿼리를 증가시킬 수 있다.A computing environment that answers spoken natural language queries using environmental information as a context can process queries using multiple processes. In the example of some of the processes shown in FIGS. 1-4, the computing environment may identify media content based on environmental information such as ambient noise. 5 and 6, in order to provide a more satisfactory answer to the spoken natural language query, the computing environment is Language queries can be increased.

더 구체적으로, 도 1은 환경 오디오 데이터 및 발화된 자연 언어 쿼리에 기초한 콘텐츠 아이템 데이터를 식별하는 시스템(100)을 나타낸다. 간단히, 시스템(100)은, 환경 오디오 데이터에 기초하며 특정 콘텐츠 유형을 발화된 자연 언어 쿼리와 연관시켜 매칭하는, 콘텐츠 아이템 데이터를 식별할 수 있다. 시스템(100)은 모바일 컴퓨팅 디바이스(102), 명확화 엔진(disambiguation engine, 104), 음성 인식 엔진(106), 키워드 맵핑 엔진(108), 및 콘텐츠 인식 엔진(110)을 포함한다. 모바일 컴퓨팅 디바이스(102)는 하나 또는 그 이상의 네트워크들을 통하여 명확화 엔진(104)과 통신한다. 모바일 디바이스(110)는 마이크로폰, 카메라 또는, 사용자(112) 및/또는 사용자(112)와 연관된 환경 데이터로부터의 발화들을 검출하기 위한 다른 검출 메커니즘들을 포함할 수 있다.More specifically, FIG. 1 shows a system 100 for identifying content item data based on environmental audio data and spoken natural language queries. Briefly, system 100 may identify content item data that is based on environmental audio data and matches specific content types by associating them with spoken natural language queries. The system 100 includes a mobile computing device 102, a disambiguation engine 104, a speech recognition engine 106, a keyword mapping engine 108, and a content recognition engine 110. Mobile computing device 102 communicates with disambiguation engine 104 via one or more networks. Mobile device 110 may include a microphone, camera, or other detection mechanisms for detecting utterances from user 112 and/or environmental data associated with user 112.

일부 예들에 있어서, 사용자(112)는 TV 프로그램을 시청하고 있다. 도시된 예에 있어서, 사용자(112)는 현재 디스플레이되는 TV 프로그램을 누가 감독했는지를 알고 싶어한다. 일부 예들에 있어서, 사용자(112)는 현재 디스플레이되는 TV 프로그램의 이름을 알지 못할 수도 있으며, 따라서 "누가 이 쇼를 감독했나요?"라고 물을 수 있다. 모바일 컴퓨팅 디바이스(102)는 사용자(112)의 환경에 연관된 환경 오디오 데이터와 함께 이 발화를 검출한다.In some examples, user 112 is watching a TV program. In the illustrated example, the user 112 wants to know who supervised the currently displayed TV program. In some examples, user 112 may not know the name of the TV program currently being displayed, and thus may ask "Who directed this show?" Mobile computing device 102 detects this utterance along with environmental audio data associated with user 112's environment.

일부 예들에 있어서, 사용자(112)의 환경과 연관된 환경 오디오 데이터는 사용자(112)의 환경의 배경 노이즈를 포함할 수 있다. 예를 들어, 환경 오디오 데이터는 TV 프로그램의 소리들을 포함할 수 있다. 일부 예들에 있어서, 현재 디스플레이되는 TV 프로그램과 연관된 환경 오디오 데이터는 현재 디스플레이되는 TV 프로그램의 오디오(예를 들어, 현재 디스플레이되는 TV 프로그램의 대화, 현재 디스플레이되는 TV 프로그램과 연관된 사운드트랙 오디오, 기타 등등)를 포함할 수 있다.In some examples, environmental audio data associated with the environment of the user 112 may include background noise of the environment of the user 112. For example, the environmental audio data may include sounds of a TV program. In some examples, the environmental audio data associated with the currently displayed TV program is the audio of the currently displayed TV program (e.g., the dialog of the currently displayed TV program, the soundtrack audio associated with the currently displayed TV program, etc.) It may include.

일부 예들에 있어서, 모바일 컴퓨팅 디바이스(102)는 발화를 검출한 후에 환경 오디오 데이터를 검출하거나, 발화의 검출과 동시에 환경 오디오 데이터를 검출하거나, 혹은 둘 다에 의할 수 있다. 동작 (A) 동안, 모바일 컴퓨팅 디바이스(102)는 검출된 발화 및 환경 오디오 데이터를 처리하여 검출된 발화 및 환경 오디오 데이터를 나타내는 파형 데이터(114)를 생성하고, 파형 데이터(114)를 (예를 들어, 네트워크를 통하여) 명확화 엔진(104)으로 전송한다. 일부 예들에 있어서, 환경 오디오 데이터는 모바일 컴퓨팅 디바이스(110)로부터 스트리밍 된다.In some examples, the mobile computing device 102 may detect environmental audio data after detecting the utterance, detect environmental audio data simultaneously with detection of the utterance, or both. During operation (A), the mobile computing device 102 processes the detected speech and environmental audio data to generate waveform data 114 representing the detected speech and environmental audio data, and generates the waveform data 114 (e.g. For example, it is transmitted to the disambiguation engine 104 (via the network). In some examples, environmental audio data is streamed from mobile computing device 110.

명확화 엔진(104)은 모바일 컴퓨팅 디바이스(102)로부터 파형 데이터(114)를 수신한다. 동작 (B) 동안, 명확화 엔진(104)은 파형 데이터(114)의 다른 부분으로부터 발화의 분리(또는 추출)를 포함하여 파형 데이터(114)를 처리하고, 발화를 (예를 들어, 네트워크를 통하여) 음성 인식 엔진(106)으로 전송한다. 예를 들어, 명확화 엔진(104)은 사용자(112)의 환경의 배경 노이즈(예를 들어, 현재 디스플레이되는 TV 프로그램의 오디오)로부터 발화("누가 이 쇼를 감독했나요?")를 분리한다.Disambiguation engine 104 receives waveform data 114 from mobile computing device 102. During operation (B), the disambiguation engine 104 processes the waveform data 114, including separation (or extraction) of the speech from other portions of the waveform data 114, and processes the speech (e.g., via a network). ) To the speech recognition engine 106. For example, the disambiguation engine 104 separates the utterance ("Who directed this show?") from the background noise of the user 112's environment (eg, the audio of the currently displayed TV program).

일부 예들에 있어서, 음성 활동 또는 컴퓨팅 장치(102)의 사용자와 연관된 음성 활동을 포함하는 파형 데이터(114)의 일부를 식별함으로써 배경 노이즈로부터 발화의 분리를 가능하게 하기 위하여 명확화 엔진(104)은 음성 검출기를 사용한다. 일부 예들에 있어서, 발화는 쿼리(예를 들어, 현재 디스플레이되는 TV에 관련된 쿼리)에 관련된다. 일부 예들에 있어서, 파형 데이터(114)는 검출된 발화를 포함한다. 이에 대응하여, 명확화 엔진(104)은 모바일 컴퓨팅 디바이스(102)로부터 발화와 관련된 환경 오디오 데이터를 요구할 수 있다.In some examples, disambiguation engine 104 may be used to enable separation of speech from background noise by identifying portions of waveform data 114 that include speech activity or speech activity associated with a user of computing device 102. Use a detector. In some examples, the utterance is related to a query (eg, a query related to the currently displayed TV). In some examples, waveform data 114 includes a detected utterance. In response, the disambiguation engine 104 may request environmental audio data related to the speech from the mobile computing device 102.

음성 인식 엔진(106)은 명확화 엔진(104)으로부터 발화에 대응하는 파형 데이터(114)의 일부를 수신한다. 동작 (C) 동안, 음성 인식 엔진(106)은 발화의 전사(transcription)를 획득하고, 그 전사를 키워드 맵핑 엔진(108)으로 제공한다. 구체적으로, 음성 인식 엔진(106)은 음성 인식 엔진(106)으로부터 수신한 발화를 처리한다. 일부 예들에 있어서, 음성 인식 엔진(106)에 의한 발화의 처리는 발화의 전사를 생성하는 것을 포함한다. 발화의 전사를 생성하는 것은 발화를 텍스트 또는 텍스에 관련된 데이터로 전사하는 것을 포함한다. 다시 말해, 음성 인식 시스템(106)은 발화의 문자적 형태의 언어 표현을 제공할 수 있다.The speech recognition engine 106 receives a portion of the waveform data 114 corresponding to the speech from the disambiguation engine 104. During operation (C), the speech recognition engine 106 obtains a transcription of the utterance and provides the transcription to the keyword mapping engine 108. Specifically, the speech recognition engine 106 processes the speech received from the speech recognition engine 106. In some examples, processing of the utterance by the speech recognition engine 106 includes generating a transcription of the utterance. Generating the transcription of the utterance involves transcribing the utterance into text or text related data. In other words, the speech recognition system 106 may provide a linguistic expression in the literal form of the speech.

예를 들어, 음성 인식 시스템(106)은 발화를 전사하여 "누가 이 쇼를 감독했나요?"라는 전사를 생성한다. 다른 실시들예에 있어서, 음성 인식 시스템(106)은 둘 또는 그 이상의 발화의 전사들을 제공한다. 예를 들어, 음성 인식 시스템(106)은 발화를 전사하여 "누가 이 쇼(show)를 감독했나요?" 및 "누가 이 슈(shoe)를 감독했나요?"라는 전사를 생성한다.For example, the speech recognition system 106 transcribes the utterance to create a transcription "Who directed this show?" In other embodiments, the speech recognition system 106 provides transcriptions of two or more utterances. For example, the speech recognition system 106 transcribes the utterance to "Who directed this show?" And "Who directed this shoe?" creates a warrior.

키워드 맵핑 엔진(108)은 음성 인식 엔진(106)으로부터 전사를 수신한다. 동작 (D) 동안, 키워드 맵핑 엔진(108)은 특정 콘텐츠 유형에 연관된 전사 내의 하나 또는 그 이상의 키워드들을 식별하고, 상기 특정 콘텐츠 유형을 명확화 엔진(104)에 제공한다. 일부 실시예들에 있어서, 하나 또는 그 이상의 콘텐츠 유형들은 '영화', '음악', 'TV 쇼', '오디오 팟캐스트(audio podcast)', '영상(image)', '미술품(artwork)', '책', '잡지', '트레일러(trailer)', '비디오 팟캐스트(video podcast)', '인터넷 비디오', 또는 '비디오 게임'을 포함할 수 있다.The keyword mapping engine 108 receives the transcription from the speech recognition engine 106. During operation (D), the keyword mapping engine 108 identifies one or more keywords in the transcription associated with a particular content type and provides the particular content type to the disambiguation engine 104. In some embodiments, one or more types of content are'movie','music','TV show','audio podcast','image','artwork'. ,'Book','magazine','trailer','video podcast','internet video', or'video game'.

예를 들어, 키워드 맵핑 엔진(108)은 키워드 "감독했나(directed)"를 "누가 이 쇼를 감독했나요?"라는 전사로부터 식별한다. 키워드 "감독했나"는 'TV 쇼' 콘텐츠 유형에 연관된다. 일부 실시예들에 있어서, 키워드 맵핑 엔진(108)에 의하여 식별되는 전사의 키워드는 둘 또는 그 이상의 콘텐츠 유형들과 연관된다. 예를 들어, 키워드 "감독했나"는 'TV 쇼' 및 '영화' 콘텐츠 유형들에 연관된다.For example, the keyword mapping engine 108 identifies the keyword "directed" from the transcription "Who directed this show?" The keyword "supervised" is related to the type of "TV show" content. In some embodiments, the transcriptional keyword identified by the keyword mapping engine 108 is associated with two or more types of content. For example, the keyword "supervised" is related to "TV show" and "movie" content types.

일부 실시예들에 있어서, 키워드 맵핑 엔진(108)은 특정 콘텐츠 유형에 연관된 전사에서 둘 또는 그 이상의 키워드를 식별한다. 예를 들어, 키워드 맵핑 엔진(108)은 특정 콘텐츠 유형에 연관된 키워드들 "감독했나" 및 "쇼"를 식별한다. 일부 실시예들에 있어서, 식별된 둘 또는 그 이상의 키워드들은 동일한 콘텐츠 유형에 연관된다. 예를 들어, 식별된 키워드들 "감독했나" 및 "쇼"는 모두 'TV 쇼' 콘텐츠 유형에 연관된다. 일부 실시예들에 있어서, 식별된 둘 또는 그 이상의 키워드들은 다른 콘텐츠 유형들에 연관된다. 예를 들어, 식별된 키워드 "감독했나"는 '영화' 콘텐츠 유형에 연관되며, 식별된 키워드 "쇼"는 'TV 쇼' 콘텐츠 유형에 연관된다. 키워드 맵핑 엔진(108)은 상기 특정 콘텐츠 유형을 명확화 엔진(108)으로 (예를 들어, 네트워크를 통하여) 전송한다.In some embodiments, keyword mapping engine 108 identifies two or more keywords in a transcription associated with a particular content type. For example, keyword mapping engine 108 identifies keywords "supervised" and "show" associated with a particular content type. In some embodiments, the identified two or more keywords are associated with the same content type. For example, the identified keywords "supervised" and "show" are both associated with the'TV show' content type. In some embodiments, the identified two or more keywords are associated with different types of content. For example, the identified keyword "supervised" is associated with the'movie' content type, and the identified keyword "show" is associated with the'TV show' content type. The keyword mapping engine 108 transmits the specific content type to the disambiguation engine 108 (eg, via a network).

일부 실시예들에 있어서, 키워드 맵핑 엔진(108)은 복수의 콘텐츠 유형들 각각에 대하여 키워드들 중 적어도 하나를 복수의 콘텐츠 유형들 중 적어도 하나에 맵핑하는 하나 또는 그 이상의 데이터베이스를 사용하여 특정 콘텐츠 유형에 연관된 전사 내의 하나 또는 그 이상의 키워드들을 식별한다. 구체적으로, 키워드 맵핑 엔진(108)은 하나의 데이터베이스(또는 복수의 데이터베이스들)을 포함(또는 이들과 통신)한다. 데이터베이스는 키워드들과 콘텐츠 유형들 사이의 맵핑을 포함하거나 이에 연관된다. 구체적으로, 데이터베이스는 키워드 맵핑 엔진(108)이 특정 콘텐츠 유형들에 연관된 전사 내의 하나 또는 그 이상의 키워드들을 식별할 수 있도록 하는 것과 같은 키워드들과 콘텐츠 유형들 사이의 연결(예를 들어, 맵핑)을 제공한다.In some embodiments, the keyword mapping engine 108 uses one or more databases that map at least one of the keywords to at least one of the plurality of content types, for each of the plurality of content types. Identifies one or more keywords in the transcription associated with. Specifically, the keyword mapping engine 108 includes (or communicates with) one database (or multiple databases). The database contains or is associated with a mapping between keywords and content types. Specifically, the database establishes a link (e.g., mapping) between keywords and content types, such as allowing the keyword mapping engine 108 to identify one or more keywords in a transcription associated with specific content types. to provide.

일부 실시예들에 있어서, 키워드들과 콘텐츠 유형들 사이의 하나 또는 그 이상의 맵핑들은 단방향(unidirectional)(예를 들어, 한 방향 맵핑(즉, 키워드들로부터 콘텐츠 유형들로의 맵핑)을 포함할 수 있다. 일부 실시예들에 있어서, 키워드들과 콘텐츠 유형들 사이의 하나 또는 그 이상의 맵핑들은 양방향(bidirectional)(예를 들어, 쌍방향 맵핑(즉, 키워드들로부터 콘텐츠 유형들로의 맵핑 및 콘텐츠 유형들로부터 키워드들로의 맵핑)을 포함할 수 있다. 일부 실시예들에 있어서, 하나 또는 그 이상의 데이터베이스는 하나 또는 그 이상의 키워드들을 둘 또는 그 이상의 콘텐츠 유형들로 맵핑한다.In some embodiments, the one or more mappings between keywords and content types may include unidirectional (e.g., unidirectional mapping (i.e., mapping from keywords to content types)). In some embodiments, one or more mappings between keywords and content types are bidirectional (eg, interactive mapping (ie mapping from keywords to content types and content types). Mapping from to keywords) In some embodiments, one or more databases map one or more keywords to two or more content types.

예를 들어, 키워드 맵핑 엔진(108)은 키워드 "감독했나"를 '영화' 및 'TV 쇼' 콘텐츠 유형들로 맵핑하는 하나 또는 그 이상의 데이터베이스를 사용한다. 일부 실시예들에 있어서, 키워드들과 콘텐츠 유형들 사이의 맵핑은 기본 키워드(root keyword)에 대한 복수의 변화형(varying version)(즉, 단어 패밀리)과 콘텐츠 유형들 사이의 맵핑을 포함한다. 키워드의 다른 형태는 시제(예를 들어, 과거, 현재, 미래)와 품사(예를 들어, 명사, 동사)와 같은 다른 문법적 카테고리를 포함할 수 있다. 예를 들어, 데이터베이스는 "감독(directors)", "지시(direction)", "감독했나(directed)"와 같은 어근(root word) "감독하다"의 단어 패밀리의 하나 또는 그 이상의 콘텐츠 유형으로의 맵핑을 포함할 수 있다.For example, the keyword mapping engine 108 uses one or more databases that map the keyword “supervised” to “movie” and “TV show” content types. In some embodiments, the mapping between keywords and content types includes mapping between content types and multiple varying versions (ie, word families) to a root keyword. Other forms of keywords may include other grammatical categories, such as tense (eg, past, present, future) and parts of speech (eg, nouns, verbs). For example, a database may contain one or more content types of a family of words of "directed" or "directed" root words such as "directors", "direction", and "directed". May include mapping.

명확화 엔진(104)는 키워드 맵핑 엔진(108)으로부터 발화의 전사에 연관된 특정 콘텐츠 유형을 식별한 데이터를 수신한다. 더욱이, 상술한 바와 같이, 명확화 엔진(104)은 모바일 컴퓨팅 디바이스(102)로부터 발화와 연관된 환경 오디오 데이터를 포함하는 파형 데이터(114)를 수신한다. 동작 (E) 동안, 명확화 엔진(104)은 그리고 나서 환경 오디오 데이터 및 특정 콘텐츠 유형을 콘텐츠 인식 엔진(110)으로 제공한다.The disambiguation engine 104 receives data from the keyword mapping engine 108 identifying a particular content type associated with the transcription of the utterance. Moreover, as described above, the disambiguation engine 104 receives waveform data 114 from mobile computing device 102 that includes environmental audio data associated with the speech. During operation (E), the disambiguation engine 104 then provides the environmental audio data and the specific content type to the content recognition engine 110.

예를 들어, 명확화 엔진(104)은 현재 디스플레이되는 TV 프로그램의 오디오(예를 들어, 현재 디스플레이되는 TV 프로그램의 대화, 현재 디스플레이되는 TV 프로그램의 사운드트랙 오디오, 기타 등등)를 포함하는 현재 디스플레이되는 TV 프로그램에 관련된 환경 오디오 데이터 및 발화의 전사의 특정 콘텐츠 유형(예를 들어, 'TV 쇼' 콘텐츠 유형)을 콘텐츠 인식 엔진(110)으로 전송한다.For example, the disambiguation engine 104 includes audio of a currently displayed TV program (e.g., a dialog of a currently displayed TV program, soundtrack audio of a currently displayed TV program, etc.). The environmental audio data related to the program and the specific content type of transcription of the utterance (eg,'TV show' content type) are transmitted to the content recognition engine 110.

일부 실시예들에 있어서, 명확화 엔진(104)은 환경 오디오 데이터의 일부를 콘텐츠 인식 엔진(110)으로 제공한다. 일부 예들에 있어서, 상기 환경 오디오 데이터의 일부는 발화를 검출한 후에 모바일 컴퓨팅 디바이스(102)에 의하여 검출된 배경 노이즈를 포함할 수 있다. 일부 예들에 있어서, 상기 환경 오디오 데이터의 일부는 발화의 검출과 동시에 모바일 컴퓨팅 디바이스(102)에 의하여 검출된 배경 노이즈를 포함할 수 있다.In some embodiments, the disambiguation engine 104 provides some of the environmental audio data to the content recognition engine 110. In some examples, some of the environmental audio data may include background noise detected by the mobile computing device 102 after detecting the utterance. In some examples, some of the environmental audio data may include background noise detected by the mobile computing device 102 at the same time as detection of the utterance.

일부 실시예들에 있어서, (파형 데이터(114)의) 배경 노이즈는 전사의 키워드에 연관된 특정 콘텐츠 유형에 연관된다. 예를 들어, 전사 "누가 이 쇼를 감독했나요?"라는 전사의 키워드 "감독했나"는 'TV 쇼' 콘텐츠 유형에 연관되며, 배경 노이즈(예를 들어, 현재 디스플레이되는 TV 프로그램에 관련된 환경 오디오 데이터) 또한 'TV 쇼' 콘텐츠 유형에 연관된다.In some embodiments, background noise (of waveform data 114) is associated with a particular content type associated with a keyword in the transcription. For example, the transcription keyword "Director" in the transcription "Who directed this show?" is associated with the content type of the "TV Show", and background noise (e.g., environmental audio data related to the currently displayed TV program). ) It is also related to the'TV show' content type.

콘텐츠 인식 엔진(110)은 명확화 엔진(104)으로부터 환경 오디오 데이터 및 특정 콘텐츠 유형을 수신한다. 동작 (F) 동안, 콘텐츠 인식 엔진(110)은, 환경 오디오 데이터에 기초하며 특정 콘텐츠 유형을 매칭하는, 콘텐츠 아이템 데이터를 식별하고, 콘텐츠 아이템 데이터를 명확화 엔진(104)으로 제공한다. 구체적으로, 콘텐츠 인식 엔진(110)은 환경 오디오 데이터를 적절히 처리하여, 환경 오디오 데이터(예를 들어, TV 쇼의 이름, 노래의 이름, 기타 등등)에 연관된 콘텐츠 아이템 데이터를 식별한다. 더욱이, 콘텐츠 인식 엔진(110)은 식별된 콘텐츠 아이템 데이터를 특정 콘텐츠 유형(예를 들어, 발화의 전사의 콘텐츠 유형)과 매칭시킨다. 콘텐츠 인식 엔진(110)은 식별된 콘텐츠 아이템 데이터를 명확화 엔진(104)으로 (예를 들어, 네트워크를 통하여) 전송한다.Content recognition engine 110 receives environmental audio data and specific content types from disambiguation engine 104. During operation (F), the content recognition engine 110 identifies content item data, which is based on the environmental audio data and matches a particular content type, and provides the content item data to the disambiguation engine 104. Specifically, the content recognition engine 110 properly processes the environmental audio data to identify content item data associated with the environmental audio data (eg, the name of a TV show, the name of a song, etc.). Moreover, the content recognition engine 110 matches the identified content item data with a specific content type (eg, the content type of the transcription of the utterance). The content recognition engine 110 transmits the identified content item data to the disambiguation engine 104 (eg, via a network).

예를 들어, 콘텐츠 인식 엔진(110)은 현재 디스플레이되는 TV 프로그램에 관련된 환경 오디오 데이터에 기초하며, 더욱이 'TV 쇼' 콘텐츠 유형에 매칭되는 콘텐츠 아이템 데이터를 식별한다. 이를 위하여, 콘텐츠 인식 엔진(110)은 콘텐츠 인식 엔진(110)에 의하여 수신된 환경 오디오 데이터의 일부에 따라서 현재 디스플레이되는 TV 프로그램의 대화, 또는 현재 디스플레이되는 TV 프로그램에 연관된 사운드트랙 오디오에 기초한 콘텐츠 아이템 데이터를 식별한다.For example, the content recognition engine 110 is based on environmental audio data related to the currently displayed TV program, and further identifies content item data matching the'TV show' content type. To this end, the content recognition engine 110 is a content item based on a dialog of the currently displayed TV program or soundtrack audio associated with the currently displayed TV program according to a part of the environmental audio data received by the content recognition engine 110 Identify the data.

일부 실시예들에 있어서, 콘텐츠 인식 엔진(110)은 콘텐츠 아이템 데이터를 식별하기 위하여 웨이블릿(wavelet)들을 사용하는 콘텐츠 지문을 사용하는 오디오 지문(audio fingerprinting) 엔진이다. 구체적으로, 콘텐츠 인식 엔진(110)은 파형 데이터(114)를 스펙트로그램(spectrogram)으로 변환한다. 스펙트로그램으로부터 콘텐츠 인식 엔진(110)은 스펙트럼 영상(spectral image)들을 추출한다. 스펙트럼 영상들은 웨이블릿으로 표현될 수 있다. 스펙트로그램으로부터 추출된 스펙트럼 영상들 각각에 대하여 콘텐츠 인식 엔진(110)은 웨이블릿들 각각의 크기에 기초한 "top" 웨이블릿들을 추출한다. 각 스펙트럼 영상에 대하여, 콘텐츠 인식 엔진9110)은 영상의 웨이블릿 서명을 연산한다. 일부 예들에 있어서, 웨이블릿 서명은 영상의 웨이블릿 분해의 줄여지고(truncated), 양자화된(quantized) 형태이다.In some embodiments, the content recognition engine 110 is an audio fingerprinting engine that uses a content fingerprint that uses wavelets to identify content item data. Specifically, the content recognition engine 110 converts the waveform data 114 into a spectrogram. The content recognition engine 110 extracts spectral images from the spectrogram. Spectral images can be represented by wavelets. For each of the spectral images extracted from the spectrogram, the content recognition engine 110 extracts "top" wavelets based on the size of each of the wavelets. For each spectral image, the content recognition engine 9110 computes a wavelet signature of the image. In some examples, the wavelet signature is a truncated, quantized form of wavelet decomposition of an image.

예를 들어, 웨이블릿로 m × n 영상을 설명하기 위하여, m × n 웨이블릿들이 압축 없이 되돌아온다. 더욱이, 콘텐츠 인식 엔진(110)은 노래를 대부분 특징짓는 웨이블릿들의 서브세트를 사용한다. 구체적으로, t << m × n 일 때, (크기에 의한) t개의 "top" 웨이블릿들이 선택된다. 더욱이, 콘텐츠 인식 엔진(110)은, 예를 들어 스파스 비트 벡터(sparse bit vector)들을 위한 서브-지문을 연산하기 위하여 MinHash 를 사용하여 상술한 스파스 웨이블릭 벡터(sparse wavelet-vector)의 간결한 표현을 생성한다.For example, in order to describe an m × n image with wavelets, m × n wavelets are returned without compression. Moreover, the content recognition engine 110 uses a subset of wavelets that mostly characterize the song. Specifically, when t << m × n, t "top" wavelets (by size) are selected. Moreover, the content recognition engine 110 uses MinHash to compute sub-fingerprints for, for example, sparse bit vectors. Generate the expression.

일부 예들에 있어서, 환경 오디오 데이터가 적어도 현재 디스플레이되는 TV 프로그램에 연관된 사운드트랙 오디오를 포함할 때, 콘텐츠 인식 엔진(110)은, 현재 디스플레이되는 TV 프로그램에 연관된 사운드트랙 오디오에 기초하며 또한 'TV 쇼' 콘텐츠 유형을 매칭하는 콘텐츠 아이템 데이터를 식별한다. 따라서 일부 예들에 있어서, 콘텐츠 인식 엔진(110)은 현재 디스플레이되는 TV 프로그램의 이름에 관련된 콘텐츠 아이템 데이터를 식별한다. 예를 들어, 콘텐츠 인식 엔진(110)은 특정 콘텐츠 아이템(예를 들어, 특정 TV 쇼)이 테마 송(예를 들어, 사운드트랙 오디오)과 연관되고, 상기 특정 콘텐츠 아이템(예를 들어, 상기 특정 TV 쇼)이 상기 특정 콘텐츠 유형(예를 들어, 'TV 쇼' 콘텐츠 유형)과 매칭되는 것을 판단할 수 있다. 따라서, 콘텐츠 인식 엔진(110)은, 환경 오디오 데이터(예를 들어, 사운드트랙 오디오)에 기초한 상기 특정 콘텐츠 아이템(예를 들어, 현재 디스플레이되는 TV 프로그램)에 관련되며 추가적으로 상기 특정 콘텐츠 유형(예를 들어, 'TV 쇼' 콘텐츠 유형)과 매칭되는 데이터(예를 들어, 상기 특정 TV 쇼의 이름)를 식별할 수 있다.In some examples, when the environmental audio data includes at least the soundtrack audio associated with the currently displayed TV program, the content recognition engine 110 is based on the soundtrack audio associated with the currently displayed TV program and is also based on the'TV Show. 'Identify content item data matching the content type. Thus, in some examples, the content recognition engine 110 identifies content item data related to the name of the currently displayed TV program. For example, the content recognition engine 110 associates a specific content item (eg, a specific TV show) with a theme song (eg, soundtrack audio), and the specific content item (eg, the specific It may be determined that the TV show) matches the specific content type (eg, a'TV show' content type). Accordingly, the content recognition engine 110 relates to the specific content item (eg, the currently displayed TV program) based on environmental audio data (eg, soundtrack audio), and additionally relates to the specific content type (eg, soundtrack audio). For example, it is possible to identify data (eg, the name of the specific TV show) matching the content type of the'TV show'.

명확화 엔진(104)은 콘텐츠 인식 엔진(110)으로부터 식별된 콘텐츠 아이템 데이터를 수신한다. 동작 (G)에서, 명확화 엔진(104)은 그리고 나서 식별된 콘텐츠 아이템 데이터를 모바일 컴퓨팅 디바이스(102)로 제공한다. 예를 들어, 명확화 엔진(104)은 현재 디스플레이되는 TV 프로그램(예를 들어, 현재 디스플레이되는 TV 프로그램의 이름)에 관련된 식별된 콘텐츠 아이템 데이터를 모바일 컴퓨팅 디바이스(102)로 전송한다.The disambiguation engine 104 receives the identified content item data from the content recognition engine 110. In operation (G), disambiguation engine 104 then provides the identified content item data to mobile computing device 102. For example, the disambiguation engine 104 transmits the identified content item data related to the currently displayed TV program (eg, the name of the currently displayed TV program) to the mobile computing device 102.

일부 예들에 있어서, 모바일 컴퓨팅 디바이스(102), 명확화 엔진(104), 음성 인식 엔진(106), 키워드 맵핑 엔진(108), 및 콘텐츠 인식 엔진(110) 중 하나 또는 그 이상은 모바일 컴퓨팅 디바이스(102), 명확화 엔진(104), 음성 인식 엔진(106), 키워드 맵핑 엔진(108), 및 콘텐츠 인식 엔진(110)의 서브셋(또는 각각)과 통신할 수 있다. 일부 실시예들에 있어서, 명확화 엔진(104), 음성 인식 엔진(106), 키워드 맵핑 엔진(108), 및 콘텐츠 인식 엔진(110) 중 하나 또는 그 이상은 하나 이상의 컴퓨팅 서버들, 분산된 컴퓨팅 시스템, 또는 하나의 서버 팜이나 클러스터와 같은 하나 또는 그 이상의 컴퓨팅 디바이스들을 사용하여 실행될 수 있다.In some examples, one or more of the mobile computing device 102, the disambiguation engine 104, the speech recognition engine 106, the keyword mapping engine 108, and the content recognition engine 110 are mobile computing device 102. ), a disambiguation engine 104, a speech recognition engine 106, a keyword mapping engine 108, and a subset (or each) of the content recognition engine 110. In some embodiments, one or more of the disambiguation engine 104, speech recognition engine 106, keyword mapping engine 108, and content recognition engine 110 are one or more computing servers, a distributed computing system. , Or using one or more computing devices, such as a single server farm or cluster.

일부 실시예들에 있어서, 상술한 바와 같이, 환경 오디오 데이터는 모바일 컴퓨팅 디바이스(110)로부터 명확화 엔진(104)으로 스트리밍될 수 있다. 환경 오디오 데이터가 스트리밍될 때, 상술한 처리(예를 들어, 동작들 (A)-(H))는 환경 오디오 데이터가 명확화 엔진(104)에 의하여 수신될 때 수행된다(즉, 점진적으로 수행된다). 다시 말해, 명확화 엔진(104)에 의하여 환경 오디오 데이터의 각 부분이 (예를 들어, 스트리밍되어) 수신될 때, 동작들 (A)-(H)가 콘텐츠 아이템 데이터가 식별될 때까지 반복적으로 수행된다.In some embodiments, as described above, environmental audio data may be streamed from the mobile computing device 110 to the disambiguation engine 104. When the environmental audio data is streamed, the above-described processing (e.g., operations (A)-(H)) is performed when the environmental audio data is received by the disambiguation engine 104 (i.e., is performed gradually. ). In other words, when each part of the environmental audio data is received (e.g., streamed) by the disambiguation engine 104, operations (A)-(H) are repeatedly performed until the content item data is identified. do.

도 2는 환경 오디오 데이터 및 발화된 자연 언어 쿼리에 기초한 콘텐츠 아이템 데이터를 식별하는 예시적 프로세스(200)의 흐름도를 나타낸다. 예시적 프로세스(200)는 하나 또는 그 이상의 컴퓨팅 디바이스에 의하여 실행될 수 있다. 예를 들어, 모바일 컴퓨팅 디바이스(102), 명확화 엔진(104), 음성 인식 엔진(106), 키워드 맵핑 엔진(108), 및/또는 콘텐츠 인식 엔진(110)이 예시적 프로세스(200)를 실행하기 위하여 사용될 수 있다.2 shows a flow diagram of an exemplary process 200 for identifying content item data based on environmental audio data and spoken natural language queries. The example process 200 may be executed by one or more computing devices. For example, mobile computing device 102, disambiguation engine 104, speech recognition engine 106, keyword mapping engine 108, and/or content recognition engine 110 execute exemplary process 200. Can be used to

발화된 자연 언어 쿼리를 부호화하는 오디오 데이터 및 환경 오디오 데이터가 수신된다(202). 예를 들어, 명화화 엔진(104)이 모바일 컴퓨팅 디바이스(102)로부터 파형 데이터(114)를 수신한다. 파형 데이터(114)는 사용자의 발화된 자연 쿼리(예를 들어, "누가 이 쇼를 감독했나요?") 및 환경 오디오 데이터(예를 들어, 현재 디스플레이되는 TV 프로그램의 오디오)를 포함한다. 명확화 엔진(104)은 사용자(112)의 환경의 배경 노이즈(예를 들어, 현재 디스플레이되는 TV 프로그램의 오디오)로부터 발화된 자연 언어 쿼리("누가 이 쇼를 감독했나요?")를 분리한다.Audio data encoding the spoken natural language query and environmental audio data are received (202). For example, the masterpiece engine 104 receives waveform data 114 from the mobile computing device 102. Waveform data 114 includes a user's spoken natural query (eg, “Who directed this show?”) and environmental audio data (eg, audio of a currently displayed TV program). The disambiguation engine 104 separates the spoken natural language query ("Who directed this show?") from the background noise of the user 112's environment (eg, the audio of the currently displayed TV program).

자연 언어 쿼리의 전사가 획득된다(204). 예를 들어, 음성 인식 시스템(106)이 자연 언어 쿼리를 전사하여 자연 언어 쿼리의 전사(예를 들어, "누가 이 쇼를 감독했나요?")를 생성한다.The transcription of the natural language query is obtained (204). For example, the speech recognition system 106 transcribes the natural language query to generate a transcription of the natural language query (eg, “Who directed this show?”).

전사 내의 하나 또는 그 이상의 키워드들에 연관된 특정 콘텐츠 유형이 판단된다(206). 예를 들어, 키워드 맵핑 엔진(108)이 특정 콘텐츠 유형(예를 들어, 'TV 쇼' 콘텐츠 유형)에 연관된 전사(예를 들어, "누가 이 쇼를 감독했나요?") 내의 하나 또는 그 이상의 키워드(예를 들어, "감독했나")를 식별한다. 일부 실시예들에 있어서, 키워드 맵핑 엔진(108)은 복수의 콘텐츠 유형들 각각에 대하여 키워드들 중 적어도 하나를 복수의 콘텐츠 유형들 중 적어도 하나에 맵핑하는 하나 또는 그 이상의 데이터베이스를 사용하여 전사 내의 하나 또는 그 이상의 키워드들을 식별한다. 데이터베이스는 키워드들(예를 들어, "감독했나")과 콘텐츠 유형들(예를 들어, 'TV 쇼' 콘텐츠 유형) 사이의 연결(예를 들어, 맵핑)을 제공한다.A specific content type associated with one or more keywords in the transcription is determined (206). For example, one or more keywords within a transcription (eg, "Who directed this show?") associated with a particular content type (eg,'TV Show' content type) by the keyword mapping engine 108 (Eg, "supervised"). In some embodiments, the keyword mapping engine 108 uses one or more databases that map at least one of the keywords to at least one of the plurality of content types, for each of the plurality of content types. Or more keywords. The database provides a link (eg, a mapping) between keywords (eg, “supervised”) and content types (eg, a'TV show' content type).

환경 오디오 데이터의 적어도 일부는 콘텐츠 인식 엔진(208)에 제공된다. 예를 들어, 명확화 엔진(104)은 파형 데이터(114)에 의하여 부호화된 적어도 상기 환경 오디오 데이터의 일부(예를 들어, 현재 디스플레이되는 TV 프로그램의 오디오)를 콘텐츠 인식 엔진(110)에 제공한다. 일부 예들에 있어서, 명확화 엔진(104)은 또한 전사 내의 하나 또는 그 이상의 키워드들(예를 들어, "감독했나")에 연관된 특정 콘텐츠 유형(예를 들어, 'TV 쇼' 콘텐츠 유형)을 콘텐츠 인식 엔진(110)에 제공한다.At least a portion of the environmental audio data is provided to the content recognition engine 208. For example, the disambiguation engine 104 provides the content recognition engine 110 with at least a portion of the environmental audio data (eg, audio of a currently displayed TV program) encoded by the waveform data 114. In some examples, the disambiguation engine 104 also content-recognizes a particular content type (eg,'TV Show' content type) associated with one or more keywords within the transcription (eg, "supervised"). It is provided to the engine 110.

콘텐츠 인식 엔진에 의하여 출력되며 특정 콘텐츠 유형과 매칭되는 콘텐츠 아이템이 식별된다(210). 예를 들어, 콘텐츠 인식 엔진(110)은, 환경 오디오 데이터(예를 들어, 현재 디스플레이되는 TV 프로그램의 오디오)에 기초하며 특정 콘텐츠 유형(예를 들어, 'TV 쇼' 콘텐츠 유형)에 매칭되는 콘텐츠 아이템 또는 콘텐츠 아이템 데이터를 식별한다.A content item that is output by the content recognition engine and matches a specific content type is identified (210). For example, the content recognition engine 110 is based on environmental audio data (eg, audio of a currently displayed TV program) and matches a specific content type (eg, a'TV show' content type). Identify the item or content item data.

도 3a 및 도 3b는 콘텐츠 아이템 데이터를 식별하기 위한 시스템의 일부들(300a, 300b)을 각각 도시한 것이다. 특히 도 3a 및 도 3b는 명확화 엔진들(304a, 304b) 각각; 및 콘텐츠 인식 엔진들(310a, 310b)을 각각 포함한다. 명확화 엔진들(304a, 304b)은 도 1에 도시된 시스템(100)의 명확화 엔진(104)과 유사하며; 콘텐츠 인식 엔진들(310a, 310b)은 도 1에 도시된 시스템(100)의 콘텐츠 인식 엔진(110)과 유사하다. 3A and 3B illustrate portions 300a and 300b of a system for identifying content item data, respectively. In particular, FIGS. 3A and 3B show disambiguation engines 304a, 304b, respectively; And content recognition engines 310a and 310b, respectively. Disambiguation engines 304a, 304b are similar to disambiguation engine 104 of system 100 shown in FIG. 1; The content recognition engines 310a and 310b are similar to the content recognition engine 110 of the system 100 shown in FIG. 1.

도 3a는 콘텐츠 인식 엔진(310a)을 포함하는 일부(300a)를 도시한다. 콘텐츠 인식 엔진(310a)은 환경 데이터에 기초하여 특정한 콘텐츠 유형과 일치하는 콘텐츠 아이템을 식별할 수 있다. 달리 말하면, 콘텐츠 인식 엔진(310a)은 콘텐츠 아이템을 식별하기 위하여 환경 데이터에 기초하여 환경 데이터를 적절히 처리하고, 나아가 선택된 콘텐츠 아이템 데이터가 특정한 콘텐츠 유형과 일치하도록 하나 이상의 식별된 콘텐츠 아이템 데이터를 선택할 수 있다. 3A shows a portion 300a including a content recognition engine 310a. The content recognition engine 310a may identify a content item matching a specific content type based on the environmental data. In other words, the content recognition engine 310a may appropriately process the environment data based on the environment data to identify the content item, and further select one or more identified content item data so that the selected content item data matches a specific content type. have.

구체적으로, 동작 (A) 동안, 명확화 엔진(304a)은 환경 데이터 및 특정한 콘텐츠 유형을 콘텐츠 인식 엔진(310a)에 제공한다. 일부 실시예들에 있어서, 명확화 엔진(304a)은 환경 데이터의 일부를 콘텐츠 인식 엔진(310a)에 제공한다. Specifically, during operation (A), disambiguation engine 304a provides environmental data and specific content types to content recognition engine 310a. In some embodiments, disambiguation engine 304a provides some of the environmental data to content recognition engine 310a.

콘텐츠 인식 엔진(310a)은 명확화 엔진(304a)으로부터 환경 데이터 및 특정한 콘텐츠 유형을 수신한다. 그러면 동작 (B) 동안, 콘텐츠 인식 엔진(310a)은 환경 데이터에 기초하여 특정한 콘텐츠 유형과 일치하는 콘텐츠 아이템 데이터를 식별하고 식별된 콘텐츠 아이템 데이터를 명확화 엔진(304a)에 제공한다. 구체적으로, 콘텐츠 인식 엔진(310a)은 환경 데이터에 기초하여 콘텐츠 아이템 데이터(예를 들어, TV쇼의 이름, 노래 제목 등)를 식별한다. 그 후에 콘텐츠 인식 엔진(310a)은 특정한 콘텐츠 유형과 일치하는 식별된 콘텐츠 아이템 데이터 중 하나 이상을 선택한다. 달리 말하면, 콘텐츠 인식 엔진(310a)은 특정한 콘텐츠 유형에 기초하여 식별된 콘텐츠 아이템 데이터를 필터링한다. 콘텐츠 인식 엔진(310a)은 (예를 들어, 네트워크를 거쳐) 식별된 콘텐츠 아이템 데이터를 명확화 엔진(304a)으로 전송한다. Content recognition engine 310a receives environmental data and specific content types from disambiguation engine 304a. Then during operation (B), the content recognition engine 310a identifies content item data matching the specific content type based on the environmental data and provides the identified content item data to the disambiguation engine 304a. Specifically, the content recognition engine 310a identifies content item data (eg, TV show name, song title, etc.) based on the environment data. The content recognition engine 310a then selects one or more of the identified content item data matching the particular content type. In other words, the content recognition engine 310a filters the identified content item data based on a specific content type. The content recognition engine 310a transmits the identified content item data (eg, via a network) to the disambiguation engine 304a.

일부 예들에 있어서, 도 1과 관련하여 상술한 바와 같이, 환경 데이터가 적어도 현재 디스플레이되는 TV 프로그램과 관련된 사운드트랙(soundtrack) 오디오를 포함하는 경우, 콘텐츠 인식 엔진(310a)은 현재 디스플레이되는 TV 프로그램과 연관된 사운드트랙 오디오에 기초하여 콘텐츠 아이템 데이터를 식별한다. 콘텐츠 인식 엔진(310a)은 그 후에 ＇TV 쇼＇ 콘텐츠 유형에 기초하여 식별된 콘텐츠 아이템 데이터를 필터링한다. 예를 들어, 콘텐츠 인식 엔진(310a)은 사운드트랙 오디오와 연관된 ＇테마 송 이름＇ 및 ＇TV 쇼 이름＇을 식별한다. 콘텐츠 인식 엔진(310a)은 이 후에 식별된 콘텐츠 아이템 데이터가 또한 ＇TV 쇼＇ 콘텐츠 유형과도 일치하도록 식별된 콘텐츠 아이템 데이터를 필터링한다. 예를 들어, 콘텐츠 인식 엔진(310a)은 ＇TV 쇼 이름＇ 식별 데이터를 선택하고, ＇TV 쇼 이름＇ 식별 데이터를 명확화 엔진(304a)으로 전송한다.In some examples, as described above with respect to FIG. 1, when the environment data includes at least a soundtrack audio associated with a currently displayed TV program, the content recognition engine 310a may include the currently displayed TV program and Identify the content item data based on the associated soundtrack audio. The content recognition engine 310a then filters the identified content item data based on the'TV show' content type. For example, the content recognition engine 310a identifies a'theme song name' and a'TV show name' associated with the soundtrack audio. The content recognition engine 310a filters the identified content item data such that the subsequently identified content item data also matches the'TV show' content type. For example, the content recognition engine 310a selects'TV show name' identification data, and transmits the'TV show name' identification data to the disambiguation engine 304a.

일부 예들에 있어서, 콘텐츠 인식 엔진(310a)은 콘텐츠 유형 (예를 들어, ＇TV 쇼＇ 콘텐츠 유형)에 기초하여 코퍼스(corpus)(또는 인덱스)를 선택한다. 구체적으로, 콘텐츠 인식 엔진(310a)은 ＇TV 쇼＇ 콘텐츠 유형과 관련된 제1 인덱스 및 ＇영화＇ 콘텐츠 유형과 관련된 제2 인덱스에 대한 접근권(access)을 가질 수 있다. 콘텐츠 인식 엔진(310a)은 ＇TV 쇼＇ 콘텐츠 유형에 기초하여 적절히 제1 인덱스를 선택한다. 따라서, 제1 인덱스를 선택함 (그리고 제2 인덱스를 선택하지 않음)에 의하여 콘텐츠 인식 엔진(310a)은 보다 효과적으로 콘텐츠 아이템 데이터 (예를 들어, TV 쇼의 이름)를 식별할 수 있다. In some examples, the content recognition engine 310a selects a corpus (or index) based on a content type (eg, a'TV show' content type). Specifically, the content recognition engine 310a may have access to a first index related to a “TV show” content type and a second index related to a “movie” content type. The content recognition engine 310a appropriately selects the first index based on the'TV show' content type. Accordingly, by selecting the first index (and not selecting the second index), the content recognition engine 310a can more effectively identify the content item data (eg, the name of the TV show).

명확화 엔진(304a)은 콘텐츠 인식 엔진(310a)으로부터 콘텐츠 아이템 데이터를 수신한다. 예를 들어, 명확화 엔진(304a)은 콘텐츠 인식 엔진(310a)으로부터 ＇TV 쇼 이름＇ 식별 데이터를 수신한다. 명확화 엔진(304a)은 이 후에, 동작 (C) 동안, 식별 데이터를 제3 자(예를 들어, 도 1의 모바일 컴퓨팅 디바이스(102))에게 제공한다. 예를 들어, 명확화 엔진(304a)은 데이터를 ＇TV 쇼 이름＇ 식별 데이터를 제3 자에게 제공한다. The disambiguation engine 304a receives content item data from the content recognition engine 310a. For example, the disambiguation engine 304a receives'TV show name' identification data from the content recognition engine 310a. The disambiguation engine 304a then provides identification data to a third party (eg, mobile computing device 102 of FIG. 1 ), during operation (C). For example, the disambiguation engine 304a provides the data'TV show name' identification data to a third party.

도 3b는 콘텐츠 인식 엔진(310b)을 포함하는 일부(300b)를 도시한 도면이다. 콘텐츠 인식 엔진(310b)은 환경 데이터에 기초하여 콘텐츠 아이템 데이터를 식별할 수 있다. 달리 말하면, 콘텐츠 인식 엔진(310b)은 환경 데이터에 기초하여 콘텐츠 아이템 데이터를 식별하기 위하여 환경 데이터를 적절히 처리하고, 콘텐츠 아이템 데이터를 명확화 엔진(304b)에 제공한다. 콘텐츠 인식 엔진(310b)은 선택된 콘텐츠 아이템 데이터가 특정한 콘텐츠 유형과 일치하도록 식별된 콘텐츠 아이템 데이터 중 적어도 하나를 선택한다. 3B is a diagram illustrating a portion 300b including the content recognition engine 310b. The content recognition engine 310b may identify content item data based on environment data. In other words, the content recognition engine 310b properly processes the environment data to identify the content item data based on the environment data, and provides the content item data to the disambiguation engine 304b. The content recognition engine 310b selects at least one of the identified content item data so that the selected content item data matches a specific content type.

구체적으로, 동작 (A) 동안, 명확화 엔진(304b)은 환경 데이터를 콘텐츠 인식 엔진(310b)에 제공한다. 일부 실시예들에 있어서, 명확화 엔진(304b)은 환경 데이터의 일부를 콘텐츠 인식 엔진(310b)에 제공한다. Specifically, during operation (A), the disambiguation engine 304b provides environmental data to the content recognition engine 310b. In some embodiments, the disambiguation engine 304b provides some of the environmental data to the content recognition engine 310b.

콘텐츠 인식 엔진(310b)은 명확화 엔진(304b)으로부터 환경 데이터를 수신한다. 그 후에, 동작 (B) 동안, 콘텐츠 인식 엔진(310b)은 환경 데이터에 기초하여 콘텐츠 아이템 데이터를 식별하고 식별된 콘텐츠 아이템 데이터를 명확화 엔진(304b)으로 제공한다. 구체적으로, 콘텐츠 인식 엔진(310b)은 환경 데이터에 기초하여 둘 이상의 콘텐츠 아이템들(예를 들어, TV 쇼의 이름, 노래의 제목, 등)과 연관된 콘텐츠 아이템 데이터를 식별한다. 콘텐츠 인식 엔진(310b)은 (예를 들어, 네트워크를 거쳐) 식별된 콘텐츠 아이템 데이터를 표현하는 둘 이상의 후보들을 명확화 엔진(304b)으로 전송한다. The content recognition engine 310b receives environmental data from the disambiguation engine 304b. Thereafter, during operation (B), the content recognition engine 310b identifies the content item data based on the environmental data and provides the identified content item data to the disambiguation engine 304b. Specifically, the content recognition engine 310b identifies content item data associated with two or more content items (eg, a name of a TV show, a title of a song, etc.) based on the environmental data. The content recognition engine 310b sends two or more candidates representing the identified content item data (eg, via a network) to the disambiguation engine 304b.

일부 예들에 있어서, 도 1과 관련하여 상술한 바와 같이, 환경 데이터가 현재 디스플레이되는 TV 프로그램과 연관된 적어도 사운드트랙 오디오를 포함하는 경우, 콘텐츠 인식 엔진(310b)은 현재 디스플레이되는 TV 프로그램과 연관된 사운드트랙 오디오에 기초하여 둘 이상의 콘텐츠 아이템들과 관련된 콘텐츠 아이템 데이터를 식별한다. 예를 들어, 콘텐츠 인식 엔진(310b)은 사운드트랙 오디오와 연관된 ＇테마 송 이름＇ 및 ＇TV 쇼 이름＇을 식별하고, ＇테마 송 이름＇ 및 ＇TV 쇼 이름＇ 식별 데이터를 명확화 엔진(304b)으로 전송한다. In some examples, as described above with respect to FIG. 1, when the environment data includes at least soundtrack audio associated with the currently displayed TV program, the content recognition engine 310b may be configured with a soundtrack associated with the currently displayed TV program. Identify content item data associated with two or more content items based on the audio. For example, the content recognition engine 310b identifies'theme song name' and'TV show name' associated with the soundtrack audio, and disambiguates the'theme song name' and'TV show name' identification data. To send.

명확화 엔진(304b)은 콘텐츠 인식 엔진(310b)으로부터 둘 이상의 후보들을 수신한다. 예를 들어, 명확화 엔진(304b)은 콘텐츠 인식 엔진(310b)으로부터 ＇테마 송 이름＇ 및 ＇TV 쇼 이름＇ 후보들을 수신한다. 이 후에, 동작 (C) 동안, 명확화 엔진(304b)은 특정한 콘텐츠 유형에 기초하여 둘 이상의 후보들 중 하나를 선택하고 선택된 후보를 제3 자 (예를 들어, 도 1의 모바일 컴퓨팅 디바이스(102))에게 제공한다. 구체적으로, 도 1과 관련하여 상술한 바와 같이, 명확화 엔진(304b)은 사전에 (예를 들어, 발화와 연관된) 특정한 콘텐츠 유형을 수신할 수 있다. 명확화 엔진(304b)은 특정한 콘텐츠 유형에 기초하여 둘 이상의 후보들 중 특정한 후보를 선택한다. 구체적으로, 명확화 엔진(304b)은 둘 이상의 후보들 중 특정한 콘텐츠 유형과 일치하는 특정한 후보를 선택한다. 예를 들어, 명확화 엔진(304b)은 ＇TV 쇼 이름＇ 후보가 ＇TV 쇼＇ 콘텐츠 유형과 일치하기 때문에 ＇TV 쇼 이름＇ 후보를 선택한다. Disambiguation engine 304b receives two or more candidates from content recognition engine 310b. For example, the disambiguation engine 304b receives'theme song name' and'TV show name' candidates from the content recognition engine 310b. Thereafter, during operation (C), the disambiguation engine 304b selects one of the two or more candidates based on the particular content type and selects the selected candidate to a third party (e.g., mobile computing device 102 in FIG. 1). To provide. Specifically, as described above with respect to FIG. 1, the disambiguation engine 304b may previously receive a particular content type (eg, associated with a utterance). The disambiguation engine 304b selects a particular candidate from among two or more candidates based on the particular content type. Specifically, the disambiguation engine 304b selects a specific candidate matching a specific content type from among two or more candidates. For example, the disambiguation engine 304b selects the'TV show name' candidate because the'TV show name' candidate matches the'TV show' content type.

일부 실시예들에 있어서, 콘텐츠 인식 엔진(310b)으로부터의 둘 이상의 후보들은 랭킹 스코어와 연관된다. 랭킹 스코어는 명확화 엔진(304b)에 의하여 결정된 어떠한 스코어 측정(metric)과도 연관될 수 있다. 명확화 엔진(304b)은 더불어 특정한 콘텐츠 유형에 기초하여 둘 이상의 후보들의 랭킹 스코어를 조정할 수도 있다. 구체적으로, 명확화 엔진(304b)은 각각의 후보들이 특정한 콘텐츠 유형과 일치하는 경우에 하나 이상의 후보들의 랭킹 스코어를 증가시킬 수 있다. 예를 들어, 후보 ＇TV 쇼 이름＇의 랭킹 스코어는 그것이 ＇TV 쇼＇ 콘텐츠 유형과 일치함에 따라서 증가할 수 있다. 뿐만 아니라, 명확화 엔진(304b)은 각각의 후보들이 특정한 콘텐츠 유형과 일치하지 않는 경우에 하나 이상의 후보들의 랭킹 스코어를 감소시킬 수 있다. 예를 들어, 후보 ＇테마 송 이름＇의 랭킹 스코어는 그것이 ＇TV 쇼＇ 콘텐츠 유형과 일치하지 않음에 따라서 감소할 수 있다. In some embodiments, two or more candidates from the content recognition engine 310b are associated with a ranking score. The ranking score may be associated with any score metric determined by the disambiguation engine 304b. The disambiguation engine 304b may also adjust the ranking scores of two or more candidates based on a particular content type. Specifically, the disambiguation engine 304b may increase the ranking score of one or more candidates when each of the candidates matches a particular content type. For example, the ranking score of the candidate'TV Show Name' may increase as it matches the'TV Show' content type. In addition, the disambiguation engine 304b may reduce the ranking score of one or more candidates if each of the candidates does not match a particular content type. For example, the ranking score of the candidate'Theme Song Name' may decrease as it does not match the'TV Show' content type.

일부 실시예들에 있어서, 둘 이상의 후보들은, 명확화 엔진(304b)에 의하여 각각 조정된 랭킹 스코어들에 기초하여 순위가 부여될 수 있다. 예를 들어, 명확화 엔진(304b)은 ＇테마 송 이름＇ 후보의 조정된 랭킹 스코어와 비교하였을 때, ＇TV 쇼 이름＇ 후보가 더 높은 조정된 랭킹 스코어를 가짐에 따라, ＇TV 쇼 이름＇ 후보의 순위를 ＇테마 송 이름＇ 후보보다 상위로 부여할 수 있다. 일부 예들에 있어서, 명확화 엔진(304b)은 가장 높게 순위가 부여된 (예를 들어, 가장 높은 조정된 랭킹 스코어를 가지는) 후보를 선택한다. In some embodiments, two or more candidates may be ranked based on ranking scores respectively adjusted by the disambiguation engine 304b. For example, the disambiguation engine 304b is compared with the adjusted ranking score of the'theme song name' candidate, as the'TV show name' candidate has a higher adjusted ranking score, the'TV show name' candidate The ranking of can be given higher than the “Theme Song Name” candidate. In some examples, disambiguation engine 304b selects the highest ranked candidate (eg, with the highest adjusted ranking score).

도 4는 환경 이미지 데이터 및 발화된 자연 언어 쿼리에 기초하여 콘텐츠 아이템 데이터를 식별하기 위한 시스템(400)을 도시한 것이다. 요컨대, 시스템(400)은 환경 데이터에 기초하여 발화된 자연 언어 쿼리와 연관된 특정한 콘텐츠 유형과 일치하는 콘텐츠 아이템 데이터를 식별할 수 있다. 시스템(400)은, 도 1에 도시된 시스템(100)의 모바일 컴퓨팅 디바이스(102), 명확화 엔진(104), 음성 인식 엔진(106), 키워드 맵핑 엔진(108) 및 콘텐츠 인식 엔진(110)과 각각 유사한, 모바일 컴퓨팅 디바이스(402), 음성 인식 엔진(406), 키워드 맵핑 엔진(408), 및 콘텐츠 인식 엔진(410)을 포함한다.4 shows a system 400 for identifying content item data based on environmental image data and spoken natural language queries. In short, system 400 may identify content item data that matches a particular content type associated with a spoken natural language query based on the environmental data. The system 400 includes a mobile computing device 102, a disambiguation engine 104, a speech recognition engine 106, a keyword mapping engine 108, and a content recognition engine 110 of the system 100 shown in FIG. 1. Each similar, includes a mobile computing device 402, a speech recognition engine 406, a keyword mapping engine 408, and a content recognition engine 410.

일부 예들에 있어서, 사용자(112)는 영화의 사운드트랙의 CD 앨범 커버를 보고 있다. 도시된 예에 있어서, 사용자(112)는 사운드트랙에 어떠한 노래들이 수록되어 있는 지를 알고 싶다. 일부 예들에 있어서, 사용자(112)는 영화 사운드트랙의 이름을 모를 수 있고, 따라서 ＂여기에 나오는 노래가 무엇입니까?＂ 또는 ＂이 영화에서 연주된 노래는 무엇입니까?＂라는 질문을 할 수 있다. 모바일 컴퓨팅 디바이스(402)는 사용자(112)의 환경과 연관된 환경 이미지 데이터뿐만 아니라 이러한 발화를 감지한다. In some examples, user 112 is viewing the CD album cover of the movie's soundtrack. In the illustrated example, the user 112 wants to know what songs are included in the soundtrack. In some examples, user 112 may not know the name of the movie soundtrack, and thus may ask the question “What song is here?” or “What song is played in this movie?” . The mobile computing device 402 detects such utterances as well as environmental image data associated with the environment of the user 112.

일부 예들에 있어서, 사용자(112)의 환경과 연관된 환경 이미지 데이터는 사용자(112) 환경의 이미지 데이터를 포함할 수 있다. 예를 들어, 환경 이미지 데이터는 영화와 관련된 이미지들을 묘사한 (예를 들어, 연관된 영화의 영화 포스터의 이미지) CD 앨범 커버의 이미지를 포함한다. 일부 예들에 있어서, 모바일 컴퓨팅 디바이스(402)는 CD 앨범 커버의 이미지 (또는 비디오)를 캡쳐하는 모바일 컴퓨팅 디바이스(402)의 카메라를 활용하여 환경 이미지 데이터를 감지한다. In some examples, environmental image data associated with the environment of the user 112 may include image data of the environment of the user 112. For example, the environmental image data includes an image of a CD album cover depicting images related to a movie (eg, an image of a movie poster of an associated movie). In some examples, mobile computing device 402 detects environmental image data utilizing a camera of mobile computing device 402 that captures an image (or video) of a CD album cover.

동작 (A) 동안, 모바일 컴퓨팅 디바이스(402)는 감지된 발화를 처리하여 감지된 발화를 나타내는 파형 데이터(414)를 생성하고 파형 데이터(414) 및 환경 이미지 데이터를 (예를 들어, 네트워크를 거쳐) 명확화 엔진(404)으로 전송한다. During operation (A), the mobile computing device 402 processes the detected utterances to generate waveform data 414 representing the detected utterances, and transmits the waveform data 414 and environmental image data (e.g., via a network). ) To the disambiguation engine 404.

동작 (B) 동안, 명확화 엔진(404)은 모바일 컴퓨팅 디바이스(402)로부터 파형 데이터(414) 및 환경 이미지 데이터를 수신한다. 명확화 엔진(404)은 파형 데이터(414)를 처리하고 (예를 들어, 네트워크를 거쳐) 음성 인식 엔진(406)으로 발화를 전송한다. 일부 예들에 있어서, 발화는 쿼리 (예를 들어, 영화 사운드트랙과 관련된 쿼리)와 관련된다. During operation (B), disambiguation engine 404 receives waveform data 414 and environmental image data from mobile computing device 402. The disambiguation engine 404 processes the waveform data 414 and transmits the speech to the speech recognition engine 406 (eg, via a network). In some examples, the utterance is related to a query (eg, a query related to a movie soundtrack).

음성 인식 엔진(406)은 명확화 엔진(404)으로부터 발화를 수신한다. 동작 (C) 동안, 음성 인식 엔진(406)은 발화의 전사(transcription)를 획득하고 전사를 키워드 맵핑 엔진(408)에 제공한다. 구체적으로, 음성 인식 엔진(406)은 발화의 전사를 생성함으로써 음성 인식 엔진(406)으로부터 수신한 발화를 처리한다. Speech recognition engine 406 receives speech from disambiguation engine 404. During operation (C), the speech recognition engine 406 obtains a transcription of the utterance and provides the transcription to the keyword mapping engine 408. Specifically, the speech recognition engine 406 processes the speech received from the speech recognition engine 406 by generating a transcription of the speech.

예를 들어, 음성 인식 엔진(406)은 발화를 전사하여 "여기에 나오는 노래가 무엇입니까?"라는 전사를 생성한다. 일부 실시예들에 있어서, 음성 인식 엔진(406)은 발화의 둘 이상의 전사들을 제공한다. 예를 들어, 음성 인식 엔진(406)은 발화를 전사하여 ＂여기에 나오는 노래가 무엇입니까?(What songs are on this?)＂와 ＂여기에 나오는 놀이가 무엇입니까?(What sinks are on this?)＂의 전사들을 생성한다. For example, the speech recognition engine 406 transcribes the utterance to create a transcription "What's the song here?" In some embodiments, the speech recognition engine 406 provides two or more transcriptions of the utterance. For example, the speech recognition engine 406 transcribes the utterances so that “What songs are on this?” and “What sinks are on this?” )'warriors are created.

키워드 맵핑 엔진(408)은 음성 인식 엔진(406)으로부터 전사를 수신한다. 동작 (D) 동안, 키워드 맵핑 엔진(408)은 특정한 콘텐츠 유형과 연관된 전사 내의 하나 이상의 키워드들을 식별하고 특정한 콘텐츠 유형을 명확화 엔진(404)으로 제공한다. The keyword mapping engine 408 receives the transcription from the speech recognition engine 406. During operation (D), the keyword mapping engine 408 identifies one or more keywords in the transcription associated with the particular content type and provides the particular content type to the disambiguation engine 404.

예를 들어, 키워드 맵핑 엔진(408)은 ＂여기에 나오는 노래가 무엇입니까?＂의 전사로부터 키워드 ＂노래＂를 식별한다. 키워드 ＂노래＂는 ＇음악＇ 콘텐츠 유형과 연관된다. 일부 실시예들에 있어서, 키워드 맵핑 엔진(408)에 의하여 식별된 전사의 키워드는 둘 이상의 콘텐츠 유형들과 연관된다. 예를 들어, 키워드 ＂노래＂는 ＇음악＇ 및 ＇가수＇ 콘텐츠 유형들과 연관된다. 키워드 맵핑 엔진(408)은 (예를 들어, 네트워크를 거쳐) 특정한 콘텐츠 유형을 명확화 엔진(408)으로 전송한다. For example, the keyword mapping engine 408 identifies the keyword “song” from the transcription of “What is the song here?”. The keyword “song” is associated with the “music” content type. In some embodiments, the keyword of the transcription identified by the keyword mapping engine 408 is associated with two or more types of content. For example, the keyword “song” is associated with “music” and “singer” content types. The keyword mapping engine 408 transmits (eg, over a network) a particular content type to the disambiguation engine 408.

일부 실시예들에 있어서, 상술한 바와 유사하게, 키워드 맵핑 엔진(408)은, 다수의 콘텐츠 유형들 각각에 대하여, 적어도 하나의 키워드들과 다수의 콘텐츠 유형들 중 적어도 하나와 맵핑한, 하나 이상의 데이터베이스들을 사용하여, 전사 내에서 특정한 콘텐츠 유형과 연관된 하나 이상의 키워드들을 식별한다. 예를 들어, 키워드 맵핑 엔진(408)은 ＂노래＂를 ＇음악＇ 및 ＇가수＇ 콘텐츠 유형들과 맵핑한 하나 이상의 데이터베이스들을 사용한다. In some embodiments, similar to the above, the keyword mapping engine 408 maps at least one keyword and at least one of a plurality of content types for each of the plurality of content types. Databases are used to identify one or more keywords associated with a particular content type within a transcription. For example, the keyword mapping engine 408 uses one or more databases that map “song” to “music” and “singer” content types.

명확화 엔진(404)은 키워드 맵핑 엔진(408)으로부터 발화의 전사와 연관된 특정한 콘텐츠 유형을 수신한다. 뿐만 아니라, 상술한 바와 같이, 명확화 엔진(404)은 발화와 연관된 환경 이미지 데이터를 수신한다. 동작 (E) 동안, 명확화 엔진(404)은 환경 이미지 데이터와 특정한 콘텐츠 유형을 콘텐츠 인식 엔진(410)에 제공한다. The disambiguation engine 404 receives from the keyword mapping engine 408 a particular content type associated with the transcription of the utterance. In addition, as described above, the disambiguation engine 404 receives environmental image data associated with the speech. During operation (E), the disambiguation engine 404 provides the environmental image data and the specific content type to the content recognition engine 410.

예를 들어, 명확화 엔진(404)은 영화 사운드트랙과 관련된 환경 이미지 데이터 (예를 들어, 영화 포스터 CD 앨범 커버의 이미지) 및 발화 전사의 특정한 콘텐츠 유형 (예를 들어, ＇음악＇ 콘텐츠 유형)을 콘텐츠 인식 엔진(410)에 전송한다. For example, the disambiguation engine 404 identifies environmental image data associated with a movie soundtrack (eg, an image of a movie poster CD album cover) and a specific content type of the utterance transcription (eg, “music” content type). It is transmitted to the content recognition engine 410.

콘텐츠 인식 엔진(410)은 명확화 엔진(404)으로부터 환경 이미지 데이터 및 특정한 콘텐츠 유형을 수신한다. 그러면 동작 (F) 동안, 콘텐츠 인식 엔진(410)은 환경 이미지 데이터에 기초하며 특정한 콘텐츠 유형과 일치하는 콘텐츠 아이템 데이터를 식별하고 식별된 콘텐츠 아이템 데이터를 명확화 엔진(404)에 제공한다. 구체적으로, 콘텐츠 인식 엔진(410)은 환경 이미지 데이터를 적절히 처리하여 콘텐츠 아이템 데이터 (예를 들어 콘텐츠 아이템의 이름)를 식별한다. 덧붙여, 콘텐츠 인식 엔진(410)은 식별된 콘텐츠 아이템과 특정한 콘텐츠 유형 (예를 들어, 발화 전사의 콘텐츠 유형)과 일치시킨다. 콘텐츠 인식 엔진(408)은 (예를 들어, 네트워크를 거쳐) 식별된 콘텐츠 아이템 데이터를 명확화 엔진(408)으로 전송한다. Content recognition engine 410 receives environmental image data and specific content types from disambiguation engine 404. Then, during operation (F), the content recognition engine 410 identifies content item data that is based on the environmental image data and matches a particular content type and provides the identified content item data to the disambiguation engine 404. Specifically, the content recognition engine 410 properly processes environmental image data to identify content item data (eg, the name of a content item). In addition, the content recognition engine 410 matches the identified content item with a specific content type (eg, the content type of the utterance transcription). The content recognition engine 408 transmits the identified content item data (eg, via a network) to the disambiguation engine 408.

예를 들어, 콘텐츠 인식 엔진(410)은 영화 포스터 CD 앨범 커버의 이미지와 관련된 환경 이미지 데이터에 기초하며, 나아가 ＇음악＇ 콘텐츠 유형과 일치하는 데이터를 식별한다. For example, the content recognition engine 410 is based on environmental image data related to an image of a movie poster CD album cover, and further identifies data matching the'music' content type.

일부 예들에 있어서, 환경 이미지 데이터가 적어도 CD 앨범 커버와 연관된 영화 포스터를 포함하는 경우, 콘텐츠 인식 엔진(410)은 CD 앨범 커버와 연관된 영화 포스터에 기초하며 또한 ＇음악＇ 콘텐츠 유형과도 일치하는 콘텐츠 아이템 데이터를 식별한다. 따라서, 일부 예들에 있어서, 콘텐츠 인식 엔진(410)은 영화 사운드트랙의 제목과 관련된 콘텐츠 아이템 데이터를 식별한다. 예를 들어, 콘텐츠 인식 엔진(410)은 특정한 콘텐츠 아이템(예를 들어, 특정 영화 사운드트랙)이 영화 포스터와 연관되며, 특정한 콘텐츠 아이템(예를 들어, 특정 영화 사운드 트랙)이 특정한 콘텐츠 유형(예를 들어, ＇음악＇ 콘텐츠 유형)과 일치한다고 판단할 수 있다. 따라서, 콘텐츠 인식 엔진(410)은 환경 이미지 데이터 (예를 들어, CD 앨범 커버의 이미지)에 기초하며 나아가 특정한 콘텐츠 유형(예를 들어, ＇음악＇ 콘텐츠 유형)과 일치하는, 특정한 콘텐츠 아이템(예를 들어, 특정 영화 사운드트랙)과 관련된 데이터(예를 들어, 특정 영화 사운드트랙의 제목)를 식별할 수 있다.In some examples, if the environmental image data includes at least a movie poster associated with the CD album cover, the content recognition engine 410 is based on the movie poster associated with the CD album cover and also matches the'music' content type. Identify item data. Thus, in some examples, the content recognition engine 410 identifies content item data associated with the title of the movie soundtrack. For example, the content recognition engine 410 may associate a particular content item (e.g., a particular movie soundtrack) with a movie poster, and a particular content item (e.g., a particular movie soundtrack) is For example, it can be determined that it matches the'music' content type). Accordingly, the content recognition engine 410 is based on environmental image data (e.g., an image of a CD album cover) and further matches a particular content type (e.g.,'music' content type), a specific content item (e.g. For example, it is possible to identify data related to a specific movie soundtrack (eg, the title of a specific movie soundtrack).

명확화 엔진(404)은 콘텐츠 인식 엔진(410)으로부터 식별된 콘텐츠 아이템 데이터를 수신한다. 그 후 동작 (G)에서, 명확화 엔진(404)은 식별된 콘텐츠 아이템 데이터를 모바일 컴퓨팅 디바이스(402)에 제공한다. 예를 들어, 명확화 엔진(404)은 영화 사운드트랙과 관련된 식별된 콘텐츠 아이템 데이터 (예를 들어, 영화 사운드트랙의 제목)를 모바일 컴퓨팅 디바이스(402)로 전송한다. The disambiguation engine 404 receives the identified content item data from the content recognition engine 410. Then, in operation (G), the disambiguation engine 404 provides the identified content item data to the mobile computing device 402. For example, the disambiguation engine 404 transmits the identified content item data associated with the movie soundtrack (eg, the title of the movie soundtrack) to the mobile computing device 402.

위에서 언급한 바와 같이, 도 1 내지 도 4는 주변 노이즈와 같은 환경 정보에 기초하여 미디어 콘텐츠 (또는 다른 콘텐츠)를 식별할 수 있는 컴퓨팅 환경 내에서의 여러가지 예시적인 프로세스들을 도시한 것이다. 콘텐츠를 식별하기 위한 다른 프로세스들 또한 사용될 수 있다. 일반적으로, 도 5 및 6은, 발화된 자연 언어 쿼리에 대한 더욱 만족스러운 답변을 제공하기 위해, 미디어 콘텐츠를 식별하는 데이터와 같은, 컴퓨팅 환경이 환경 정보로부터 비롯된 콘텍스트에 대한 발화된 자연 언어 쿼리를 증가시킬 수 있는 예시적인 프로세스를 나타낸다. As mentioned above, FIGS. 1-4 illustrate various exemplary processes within a computing environment capable of identifying media content (or other content) based on environmental information such as ambient noise. Other processes for identifying content may also be used. In general, Figures 5 and 6 show that, in order to provide a more satisfactory answer to the spoken natural language query, the computing environment may generate a spoken natural language query for a context derived from environmental information, such as data identifying media content. It shows an exemplary process that can be increased.

보다 상세하게, 도 5는 환경 오디오 및 발화에 기초하여 하나 이상의 결과들을 식별하기 위한 시스템(500)을 도시한다. 일부 예시들에서, 상기 하나 이상의 결과들은 자연 언어 쿼리에 대한 하나 이상의 답변을 표현할 수 있다. 시스템(500)은 모바일 컴퓨팅 디바이스(502), 코디네이션 엔진(504), 음성 인식 엔진(506), 콘텐츠 식별 엔진(508), 및 자연 언어 쿼리 프로세싱 엔진(510)을 포함한다. 모바일 컴퓨팅 디바이스(502)는 하나 이상의 네트워크를 통해 코디네이션 엔진(504)과 통신한다. 모바일 디바이스(510)는 마이크로폰, 카메라, 또는 사용자(512) 및/또는 사용자(512)와 연관된 환경 데이터로부터의 발화를 감지하기 위한 다른 감지 메커니즘을 포함할 수 있다. More specifically, FIG. 5 shows a system 500 for identifying one or more results based on environmental audio and speech. In some examples, the one or more results may represent one or more answers to a natural language query. The system 500 includes a mobile computing device 502, a coordination engine 504, a speech recognition engine 506, a content identification engine 508, and a natural language query processing engine 510. The mobile computing device 502 communicates with the coordination engine 504 through one or more networks. Mobile device 510 may include a microphone, camera, or other sensing mechanism to detect utterances from user 512 and/or environmental data associated with user 512.

도 1의 시스템(100)과 유사하게, 사용자(512)는 텔레비전 프로그램 시청중이다. 도시된 예시에서, 사용자(512)는 누가 현재 재생중(playing)인 텔레비전 프로그램(예를 들어, 엔티티(entity))를 감독하였는지 알고 싶어한다. 일부 예시들에서, 사용자(512)는 현재 재생중인 텔레비전 프로그램의 이름을 알지 못할 수 있고, 따라서 질문 "누가 이 쇼를 감독했나요?"를 물어볼 수 있다. 모바일 컴퓨팅 디바이스(502)는, 사용자(512) 환경과 연관된 환경 데이터뿐만 아니라, 이 발화를 감지한다.Similar to system 100 of FIG. 1, user 512 is watching a television program. In the illustrated example, the user 512 wants to know who has supervised the television program currently playing (eg, an entity). In some examples, user 512 may not know the name of the television program currently playing, and thus may ask the question “Who directed this show?”. The mobile computing device 502 detects this utterance, as well as environmental data associated with the user 512 environment.

일부 예시들에서, 사용자(512)의 환경과 연관된 환경 데이터는 사용자(512) 의 환경의 배경 노이즈를 포함할 수 있다. 예를 들어, 환경 데이터는 텔레비전 프로그램(예를 들어, 엔티티) 사운드를 포함한다. 일부 예시들에서, 현재 디스플레이된 텔레비전 프로그램과 연관된 환경 데이터는 현재 디스플레이된 텔레비전 프로그램의 오디오(예를 들어, 현재 디스플레이된 텔레비전 프로그램의 대화, 현재 디스플레이 텔레비전 프로그램과 연관된 사운드트랙 오디오, 등)를 포함할 수 있다. 일부 예시들에서, 환경 데이터는 환경 오디오 데이터, 환경 이미지 데이터, 또는 둘 다를 포함할 수 있다. 일부 예시들에서, 모바일 컴퓨팅 디바이스(502)는 발화를 감지한 후에 환경 오디오 데이터를 감지하고; 발화를 감지하면서 동시에 환경 오디오 데이터를 감지하며; 또는 양자를 모두 감지한다. 동작 (A) 동안, 모바일 컴퓨팅 디바이스(502)는 감지된 발화 및 감지된 환경 오디오 데이터(예를 들어, 텔레비전 프로그램의 사운드)를 나타내는 파형 데이터(514)를 생성하기 위해 감지된 발화 및 환경 데이터를 처리하고 파형 데이터(514)를 코디네이션 엔진(504)으로 (예를 들어, 네트워크를 통해) 전송한다.In some examples, environmental data associated with the environment of the user 512 may include background noise of the environment of the user 512. For example, environmental data includes television program (eg, entity) sound. In some examples, environmental data associated with a currently displayed television program may include audio of a currently displayed television program (e.g., a dialog of a currently displayed television program, soundtrack audio associated with a currently displayed television program, etc.). I can. In some examples, environmental data may include environmental audio data, environmental image data, or both. In some examples, mobile computing device 502 detects environmental audio data after sensing the utterance; Sensing the utterance while simultaneously sensing the environmental audio data; Or it detects both. During operation (A), the mobile computing device 502 generates the detected speech and environmental data to generate waveform data 514 representing the detected speech and the detected environmental audio data (e.g., sound of a television program). Processes and transmits the waveform data 514 to the coordination engine 504 (eg, over a network).

코디네이션 엔진(504)은 모바일 컴퓨팅 디바이스(502)로부터 파형 데이터(514)를 수신한다. 동작 (B) 동안, 코디네이션 엔진(504)은, 파형 데이터(514)의 다른 부분들로부터 발화를 분리(또는 추출)하는 것을 포함하여, 파형 데이터(514)를 처리하고 파형에 대응하는 파형 데이터(514)의 부분을 음성 인식 엔진(506)으로 (예를 들어, 네트워크를 통해) 전송한다. 예를 들어, 코디네이션 엔진(504)은 사용자(512) 환경의 배경 노이즈(예를 들어, 현재 디스플레이된 텔레비전 프로그램의 오디오)로부터 발화("누가 이 쇼를 감독했나요?")를 분리한다. 일부 예시들에서, 코디네이션 엔진(504)은 음성 활동(voice activity)을 포함하는 파형 데이터(514)의 일 부분을 식별하여 배경 노이즈로부터 발화의 분리를 용이하게 하기 위해 음성 감지기(voice detector)를 이용할 수 있다. 일부 예시들에서, 발화는 쿼리(예를 들어, 현재 디스플레이된 텔레비전 프로그램과 관련있는 쿼리)와 연관된다.Coordination engine 504 receives waveform data 514 from mobile computing device 502. During operation (B), the coordination engine 504 processes the waveform data 514 and processes the waveform data 514 corresponding to the waveform, including separating (or extracting) the utterance from other portions of the waveform data 514 ( 514) to the speech recognition engine 506 (eg, over a network). For example, the coordination engine 504 separates the utterance ("Who directed this show?") from the background noise of the user 512 environment (eg, the audio of the currently displayed television program). In some examples, the coordination engine 504 uses a voice detector to facilitate separation of speech from background noise by identifying a portion of waveform data 514 that includes voice activity. I can. In some examples, the utterance is associated with a query (eg, a query related to the currently displayed television program).

음성 인식 엔진(506)은 코디네이션 엔진(504)로부터 발화에 대응하는 파형 데이터(514)의 일부를 수신한다. 동작 (C) 동안, 코디네이션 엔진(506)은 발화의 전사를 획득하고 코디네이션 엔진(504)으로 상기 전사를 제공한다. 특히, 음성 인식 엔진(506)은 코디네이션 엔진(504)으로부터 수신된 발화에 대응하는 파형 데이터(514)의 일부를 적절하게 처리한다. 일부 예시들에서, 음성 인식 엔진(506)에 의한 발화에 대응하는 파형 데이터(514)의 일부를 처리하는 단계는 발화의 전사를 생성하는 단계를 포함한다. 발화의 전사를 생성하는 단계는 발화를 텍스트 또는 텍스트-관련 데이터로 전사하는 단계를 포함할 수 있다. 다시 말해서, 음성 인식 엔진(506)은 발화의 서면 형태(in written form of the utterance)의 언어 표현을 제공할 수 있다.The speech recognition engine 506 receives a part of the waveform data 514 corresponding to the speech from the coordination engine 504. During operation (C), the coordination engine 506 acquires a transfer of the ignition and provides the transfer to the coordination engine 504. In particular, the speech recognition engine 506 properly processes a part of the waveform data 514 corresponding to the speech received from the coordination engine 504. In some examples, processing a portion of the waveform data 514 corresponding to the utterance by the speech recognition engine 506 includes generating a transcription of the utterance. Generating the transcription of the utterance may include transcribing the utterance into text or text-related data. In other words, the speech recognition engine 506 may provide a linguistic representation of the in written form of the utterance.

예를 들어, 음성 인식 엔진(506)은 "누가 이 쇼를 감독했나요?"의 전사를 생성하기 위해 상기 발화를 전사한다. 일부 실시예에서, 음성 인식 엔진(506)은 상기 발화의 둘 이상의 전사를 제공한다. 예를 들어, 음성 인식 엔진(506)은 "누가 이 쇼를 감독했나요?(Who directed this show?)" 및 "누가 이 신발을 지시했나요?(Who directed this shoe?)"의 전사를 생성하기 위해 상기 발화를 전사할 수 있다.For example, the speech recognition engine 506 transcribes the utterance to create a transcription of "Who directed this show?" In some embodiments, speech recognition engine 506 provides two or more transcriptions of the utterance. For example, the speech recognition engine 506 to generate transcriptions of "Who directed this show?" and "Who directed this shoe?" The ignition can be transferred.

코디네이션 엔진(504)은 음성 인식 엔진(506)으로부터 발화의 전사를 수신한다. 나아가, 위에서 언급한 바와 같이, 코디네이션 엔진(504)은 모바일 컴퓨팅 디바이스(502)로부터 발화와 연관된 환경 오디오 데이터를 포함하는 파형 데이터(514)를 수신한다. 코디네이션 엔진(504)은 이제 환경 데이터를 이용하여 엔티티를 식별한다. 특히, 코디네이션 엔진(504)은 콘텐츠 식별 엔진(508)으로부터 엔티티를 식별하는 데이터를 획득한다. 그렇게 하기 위해, 동작 (D) 동안, 코디네이션 엔진(504)은 환경 데이터 및 발화에 대응하는 파형 데이터(514)의 부분을 콘텐츠 식별 엔진(508)으로 (예를 들어, 네트워크를 통해) 제공한다.The coordination engine 504 receives the transcription of the speech from the speech recognition engine 506. Further, as mentioned above, the coordination engine 504 receives waveform data 514 including environmental audio data associated with the speech from the mobile computing device 502. The coordination engine 504 now uses the environmental data to identify the entity. In particular, the coordination engine 504 obtains data identifying an entity from the content identification engine 508. To do so, during operation (D), the coordination engine 504 provides the environmental data and a portion of the waveform data 514 corresponding to the utterance to the content identification engine 508 (eg, via a network).

예를 들어, 코디네이션 엔진(504)은 현재 디스플레이된 텔레비전 프로그램의 오디오(예를 들어, 현재 디스플레이된 텔레비전 프로그램의 대화, 현재 디스플레이된 텔레비전 프로그램과 연관된 사운드트랙 오디오, 등)를 포함하는 현재 디스플레이된 텔레비전 프로그램(예를 들어, 엔티티)과 관련된 환경 데이터 및 발화("누가 이 쇼를 감독했나요?")에 대응하는 파형(514)의 부분을 콘텐츠 식별 엔진(508)으로 전송한다. For example, the coordination engine 504 includes audio of a currently displayed television program (e.g., a dialog of a currently displayed television program, soundtrack audio associated with a currently displayed television program, etc.). Environmental data associated with the program (eg, an entity) and a portion of the waveform 514 corresponding to the utterance (“Who directed this show?”) are sent to the content identification engine 508.

일부 실시예에서, 코디네이션 엔진(504)은 콘텐츠 식별 엔진(508)으로 환경 데이터의 일 부분을 제공한다. 일부 예시들에서, 환경 데이터의 일부는 발화 감지 후에 모바일 컴퓨팅 디바이스(502)에 의해 감지된 배경 노이즈를 포함할 수 있다. 일부 예시들에서, 환경 데이터의 일부는 발화 감지와 동시에 모바일 컴퓨팅 디바이스(502)에 의해 감지되는 배경 노이즈를 포함할 수 있다.In some embodiments, the coordination engine 504 provides a portion of the environmental data to the content identification engine 508. In some examples, some of the environmental data may include background noise detected by mobile computing device 502 after detection of the utterance. In some examples, some of the environmental data may include background noise detected by the mobile computing device 502 concurrently with the detection of the utterance.

콘텐츠 식별 엔진(508)은 코디네이션 엔진(504)으로부터 환경 데이터 및 발화에 대응하는 파형(514)의 일부를 수신한다. 동작 (E) 동안, 콘텐츠 식별 엔진(508)은 환경 데이터와 발화에 기초한 엔티티(예를 들어, 콘텐츠 아이템 데이터)를 식별하는 데이터를 식별하고, 엔티티를 식별하는 상기 데이터를 코디네이션 엔진(504)으로 (예를 들어, 네트워크를 통해) 제공한다. 특히, 콘텐츠 식별 엔진(508)은 환경 데이터(예를 들어, 텔레비전 쇼의 이름, 노래의 이름, 등)와 연관된 엔티티(예를 들어, 콘텐츠 아이템 데이터)를 식별하는 데이터를 식별하기 위해 환경 데이터 및 발화에 대응하는 파형(514)의 일부를 적절하게 처리한다.The content identification engine 508 receives environmental data from the coordination engine 504 and a portion of the waveform 514 corresponding to the utterance. During operation (E), the content identification engine 508 identifies data identifying an entity (e.g., content item data) based on environmental data and utterance, and transfers the data identifying the entity to the coordination engine 504. Provides (for example, over a network). In particular, the content identification engine 508 includes environmental data and data identifying an entity (e.g., content item data) associated with the environmental data (e.g., name of a television show, name of a song, etc.) A part of the waveform 514 corresponding to the utterance is appropriately processed.

예를 들어, 콘텐츠 식별 엔진(508)은 현재 디스플레이된 텔레비전 프로그램과 연관된 콘텐츠 아이템 데이터를 식별하기 위해 환경 오디오 데이터를 처리한다. 일부 실시예에서, 콘텐츠 식별 엔진(508)은 도 1의 시스템(100)이다. For example, the content identification engine 508 processes environmental audio data to identify content item data associated with the currently displayed television program. In some embodiments, the content identification engine 508 is the system 100 of FIG. 1.

코디네이션 엔진(504)은 콘텐츠 식별 엔진(508)으로부터 엔티티(예를 들어, 콘텐츠 아이템 데이터)를 식별하는 데이터를 수신한다. 나아가, 전술한 바와 같이, 코디네이션 엔진(504)은 음성 인식 엔진(506)으로부터 전사를 수신한다. 동작 (F) 동안, 코디네이션 엔진(504)은 이제 전사를 포함하는 쿼리와 엔티티를 식별하는 데이터를 자연 언어 쿼리 프로세싱 엔진(510)으로 (예를 들어, 네트워크를 통해) 제공한다. 예를 들어, 코디네이션 엔진(504)은 발화("누가 이 쇼를 감독했나요?")의 전사를 포함하는 쿼리 및 콘텐츠 아이템 데이터('텔레비전 쇼 이름')을 자연 언어 쿼리 프로세싱 엔진(510)으로 제공한다.The coordination engine 504 receives data identifying an entity (eg, content item data) from the content identification engine 508. Further, as described above, the coordination engine 504 receives a transcription from the speech recognition engine 506. During operation (F), the coordination engine 504 now provides (eg, via a network) to the natural language query processing engine 510 a query including transcription and data identifying the entity. For example, the coordination engine 504 provides query and content item data ('television show name') including the transcription of the utterance ("Who directed this show?") to the natural language query processing engine 510 do.

일부 예시들에서, 코디네이션 엔진(504)은 쿼리를 생성한다. 일부 예시들에서, 코디네이션 엔진(504)은 (예를 들어, 제 3 서버로부터) 쿼리를 획득한다. 예를 들어, 코디네이션 엔진(504)은 발화의 전사, 및 엔티티를 식별하는 데이터를 제 3 서버로 제출하고, 상기 전사 및 엔티티를 식별하는 데이터에 기초한 쿼리를 회신한다.In some examples, the coordination engine 504 generates a query. In some examples, the coordination engine 504 obtains a query (eg, from a third server). For example, the coordination engine 504 submits the transcription of the utterance and data identifying the entity to the third server, and returns a query based on the data identifying the transcription and the entity.

일부 실시예에서, 코디네이션 엔진(504)에 의해 쿼리를 생성하는 단계는 발화의 전사를 엔티티(예를 들어, 콘텐츠 아이템 데이터)를 식별하는 데이터와 연관시키는 단계를 포함할 수 있다. 일부 예시들에서, 발화의 전사를 콘텐츠 아이템 데이터와 연관시키는 단계는 상기 전사에 엔티티를 식별하는 데이터를 태깅하는 단계를 포함할 수 있다. 예를 들어, 코디네이션 엔진(504)은 전사 "누가 이 쇼를 감독했나요?"에 '텔레비전 쇼 이름' 또는 콘텐츠 아이템 데이터와 연관된 다른 식별 정보(예를 들어, 식별(ID) 번호)를 태그할 수 있다. 일부 예시들에서, 발화의 전사를 엔티티를 식별하는 데이터와 연관시키는 단계는 전사의 일 부분을 엔티티를 식별하는 데이터로 대체하는 단계를 포함한다. 예를 들어, 코디네이션 엔진(504)은 전사 "누가 이 쇼를 감독했나요?"의 일 부분을 '텔레비전 쇼 이름' 또는 '텔레비전 쇼 이름'을 식별하는 데이터로 대체할 수 있다. 일부 예시들에서, 전사의 일 부분을 엔티티를 식별하는 데이터로 대체하는 단계는 발화의 전사의 하나 이상의 단어들을 엔티티를 식별하는 데이터로 대체하는 단계를 포함할 수 있다. 예를 들어, 코디네이션 엔진(504)은 전사 "누가 이 쇼를 감독했나요?"에서 '텔레비전 쇼 이름' 또는 '텔레비전 쇼 이름'을 식별하는 데이터를 대체할 수 있다. 예를 들어, 상기 대체는 "누가 '텔레비전 쇼 이름'을 감독했나요?(Who directed ‘television show name')" 또는 "누가 ID 번호를 감독했나요?(Who directed ‘ID number’)"를 포함하는 전사를 초래할 수 있다.In some embodiments, generating the query by the coordination engine 504 may include associating the transcription of the utterance with data identifying an entity (eg, content item data). In some examples, associating the transcription of the utterance with content item data may include tagging the transcription with data identifying an entity. For example, the coordination engine 504 may tag the transcription "Who supervised this show?" with a'television show name' or other identifying information (e.g., an identification (ID) number) associated with the content item data. have. In some examples, associating the transcription of the utterance with data identifying the entity includes replacing a portion of the transcription with data identifying the entity. For example, the coordination engine 504 may replace a portion of the warrior “Who directed this show?” with data identifying “television show name” or “television show name”. In some examples, replacing a portion of the transcription with data identifying the entity may include replacing one or more words of the transcription of the utterance with data identifying the entity. For example, the coordination engine 504 may replace data identifying'television show name' or'television show name' in the transcription "Who directed this show?". For example, the substitution would be a transcription containing "Who directed'television show name'" or "Who directed'ID number'". Can lead to.

자연 언어 쿼리 프로세싱 엔진(510)은 코디네이션 엔진(504)으로부터 전사를 포함하는 쿼리 및 엔티티(예를 들어, 콘텐츠 아이템 데이터)를 식별하는 데이터를 수신한다. 동작 (G) 동안, 자연 언어 쿼리 프로세싱 엔진(510)은 쿼리를 적절하게 처리하고 상기 처리에 기초하여, 하나 이상의 결과들을 코디네이션 엔진(504)으로 (예를 들어, 네트워크를 통해서) 제공한다. 다시 말해서, 코디네이션 엔진(510)은 (예를 들어, 자연 언어 쿼리 프로세싱 엔진(510)으로부터) 쿼리에 대한 하나 이상의 결과를 획득한다.The natural language query processing engine 510 receives from the coordination engine 504 a query including transcription and data identifying an entity (eg, content item data). During operation (G), the natural language query processing engine 510 properly processes the query and, based on the processing, provides one or more results to the coordination engine 504 (eg, via a network). In other words, the coordination engine 510 obtains one or more results for the query (eg, from the natural language query processing engine 510).

특히, 자연 언어 쿼리 프로세싱 엔진(510)은 (정보 리소스의 컬렉션으로부터) 쿼리에 관련된 정보 리소스(발화의 전사 및 콘텐츠 아이템 데이터)를 획득한다. 일부 예시들에서, 자연 언어 쿼리 프로세싱 엔진(510)은 쿼리를 데이터베이스 정보(예를 들어, 텍스트 문서, 이미지, 오디오, 비디오, 등)에 대하여 매칭시키고 데이터베이스의 각 개체들이 쿼리에 얼마나 잘 매칭되는지에 대한 점수가 계산된다. 자연 언어 쿼리 프로세싱 엔진(510)은 상기 매칭된 개체들(예를 들어, 임계점수보다 높은 점수를 갖는 개체들)에 기초하여 하나 이상의 결과들을 식별한다. In particular, the natural language query processing engine 510 obtains information resources (transcription of speech and content item data) related to the query (from a collection of information resources). In some examples, the natural language query processing engine 510 matches the query against database information (e.g., text document, image, audio, video, etc.) and determines how well each entity in the database matches the query. The score for is calculated. The natural language query processing engine 510 identifies one or more results based on the matched entities (eg, entities with a score higher than a threshold score).

예를 들어, 자연 언어 쿼리 프로세싱 엔진(510)은 발화 "누가 이 쇼를 감독했나요?"의 전사를 포함하는 쿼리 및 '텔레비전 쇼 이름' (또는 다른 식별 정보)를 수신한다. 자연 언어 쿼리 프로세싱 엔진(510)은 상기 쿼리를 데이터베이스 정보에 대하여 매칭시키고, 쿼리를 매칭시키는 하나 이상의 결과들을 제공한다. 자연 언어 쿼리 프로세싱 엔진(510)은 각 매칭된 개체들의 점수를 계산한다.For example, the natural language query processing engine 510 receives a query containing the transcription of the utterance "Who directed this show?" and a'television show name' (or other identifying information). The natural language query processing engine 510 matches the query against database information and provides one or more results of matching the query. The natural language query processing engine 510 calculates a score for each matched entity.

코디네이션 엔진(504)은 자연 언어 쿼리 프로세싱 엔진(510)으로부터 하나 이상의 결과들을 수신한다. 동작 (H)에서, 이제 코디네이션 엔진(504)은 상기 하나 이상의 결과들을 모바일 컴퓨팅 디바이스(502)로 (예를 들어, 네트워크를 통해서) 제공한다. 예를 들어, 코디네이션 엔진(504)은 하나 이상의 결과들(예를 들어, 텔레비전 쇼의 감독의 이름)을 모바일 컴퓨팅 디바이스(502)로 전송한다.Coordination engine 504 receives one or more results from natural language query processing engine 510. In operation (H), the coordination engine 504 now provides the one or more results to the mobile computing device 502 (eg, via a network). For example, the coordination engine 504 transmits one or more results (eg, the name of the director of the television show) to the mobile computing device 502.

일부 예시들에서, 하나 이상의 모바일 컴퓨팅 디바이스(502), 코디네이션 엔진(504), 음성 인식 엔진(506), 콘텐츠 식별 엔진(508), 및 자연 언어 쿼리 프로세싱 엔진(510)은 모바일 컴퓨팅 디바이스(502), 코디네이션 엔진(504), 음성 인식 엔진(506), 콘텐츠 식별 엔진(508), 및 자연 언어 쿼리 프로세싱 엔진(510)의 서브셋(또는 각각)과 통신할 수 있다. 일부 실시예에서, 하나 이상의 코디네이션 엔진(504), 음성 인식 엔진(506), 콘텐츠 식별 엔진(508), 및 자연 언어 쿼리 프로세싱 엔진(510)은, 하나 이상의 컴퓨팅 서버, 분산 컴퓨팅 시스템, 또는 서버 팜 또는 클러스터와 같은, 하나 이상의 컴퓨팅 디바이스에 의해 구현될 수 있다.In some examples, one or more of the mobile computing device 502, coordination engine 504, speech recognition engine 506, content identification engine 508, and natural language query processing engine 510 may include mobile computing device 502 , A coordination engine 504, a speech recognition engine 506, a content identification engine 508, and a subset (or each) of the natural language query processing engine 510. In some embodiments, one or more coordination engines 504, speech recognition engines 506, content identification engines 508, and natural language query processing engines 510 may include one or more computing servers, distributed computing systems, or server farms. Or may be implemented by one or more computing devices, such as a cluster.

도 6은 환경 데이터 및 발화에 기초하여 하나 이상의 결과들을 식별하기 위한 예시 프로세스(600)의 흐름도를 도시한다. 예시 프로세스(600)은 하나 이상의 컴퓨팅 디바이스를 이용하여 실행될 수 있다. 예를 들어, 모바일 컴퓨팅 디바이스(502), 코디네이션 엔진(504), 음성 인식 엔진(506), 콘텐츠 식별 엔진(508), 및/또는 자연 언어 쿼리 프로세싱 엔진(510)은 예시 프로세스(600)을 실행하기 위해 이용될 수 있다.6 shows a flow diagram of an example process 600 for identifying one or more results based on environmental data and utterance. The example process 600 may be executed using one or more computing devices. For example, mobile computing device 502, coordination engine 504, speech recognition engine 506, content identification engine 508, and/or natural language query processing engine 510 execute example process 600. Can be used to

발화 및 환경 데이터를 인코딩한 오디오 데이터가 수신된다(602). 예를 들어, 코디네이션 엔진(504)은 모바일 컴퓨팅 디바이스(502)로부터 파형 데이터(514)를 수신한다. 파형 데이터(514)는 사용자의 발화(예를 들어, "누가 이 쇼를 감독했나요?") 및 환경 데이터(예를 들어, 현재 디스플레이된 텔레비전 프로그램의 오디오)를 포함한다. 일부 예시들에서, 환경 데이터를 수신하는 단계는 환경 오디오 데이터, 환경 이미지 데이터, 또는 둘 다를 수신하는 단계를 포함할 수 있다. 일부 예시들에서, 환경 데이터를 수신하는 단계는 배경 노이즈를 포함하는 부가적인 오디오 데이터를 수신하는 단계를 포함한다.Audio data encoding speech and environment data is received (602). For example, coordination engine 504 receives waveform data 514 from mobile computing device 502. Waveform data 514 includes the user's speech (eg, “Who directed this show?”) and environmental data (eg, audio of the currently displayed television program). In some examples, receiving environmental data may include receiving environmental audio data, environmental image data, or both. In some examples, receiving the environmental data includes receiving additional audio data including background noise.

발화의 전사가 획득된다(604). 예를 들어, 코디네이션 엔진(504)은 음성 인식 엔진(506)을 이용하여 발화의 전사를 획득한다. 음성 인식 엔진(506)은 발화의 전사(예를 들어, "누가 이 쇼를 감독했나요?")를 생성하기 위해 상기 발화를 전사한다.The transcription of the ignition is obtained (604). For example, the coordination engine 504 uses the speech recognition engine 506 to obtain a transcription of the utterance. The speech recognition engine 506 transcribes the utterance to create a transcription of the utterance (eg, “Who directed this show?”).

엔티티가 환경 데이터를 이용하여 식별된다(606). 예를 들어, 코디네이션 엔진(504)은 콘텐츠 식별 엔진(508)을 이용하여 엔티티를 식별하는 데이터를 획득한다. 콘텐츠 식별 엔진(508)은 환경 데이터(예를 들어, 텔레비전 쇼의 이름, 노래의 제목, 등)와 연관된 엔티티(예를 들어, 콘텐츠 아이템 데이터)를 식별하는 데이터를 식별하기 위해 환경 데이터(예를 들어, 디스플레이된 텔레비전 프로그램과 연관된 환경 오디오 데이터)를 적절하게 처리할 수 있다. 일부 예시들에서, 콘텐츠 식별 엔진(508)은 엔티티를 식별하기 위해 (동시에 또는 환경 데이터의 처리에 후속하여) 발화에 대응하는 파형(514)을 더 처리할 수 있다.The entity is identified using the environmental data (606). For example, the coordination engine 504 uses the content identification engine 508 to obtain data identifying an entity. The content identification engine 508 is configured to identify environmental data (e.g., content item data) associated with the environmental data (e.g., name of a television show, title of a song, etc.) For example, environmental audio data associated with the displayed television program) can be appropriately processed. In some examples, the content identification engine 508 may further process the waveform 514 corresponding to the utterance (simultaneously or subsequent to processing of environmental data) to identify the entity.

일부 예시들에서, 코디네이션 엔진(504)은 쿼리를 생성한다. 일부 예시들에서, 코디네이션 엔진(504)에 의해 쿼리가 생성되는 단계는 발화의 전사를 엔티티를 식별하는 데이터와 연관시키는 단계를 포함할 수 있다. 일부 예시들에서, 발화의 전사를 콘텐츠 아이템 데이터와 연관시키는 단계는 전사의 일 부분을 엔티티를 식별하는 데이터로 대체하는 단계를 포함할 수 있다. 일부 예시들에서, 전사의 일 부분을 엔티티를 식별하는 데이터로 대체하는 단계는 발화의 전사의 하나 이상의 단어들을 엔티티를 식별하는 데이터로 대체하는 단계를 포함할 수 있다.In some examples, the coordination engine 504 generates a query. In some examples, generating a query by the coordination engine 504 may include associating the transcription of the utterance with data identifying the entity. In some examples, associating the transcription of the utterance with the content item data may include replacing a portion of the transcription with data identifying the entity. In some examples, replacing a portion of the transcription with data identifying the entity may include replacing one or more words of the transcription of the utterance with data identifying the entity.

쿼리는 자연 언어 프로세싱 엔진으로 제출된다(608). 예를 들어, 코디네이션 엔진(504)은 쿼리를 자연 언어 프로세싱 엔진(510)으로 제출한다. 쿼리는 전사의 적어도 일 부분 및 엔티티(예를 들어, 콘텐츠 아이템 데이터)를 식별하는 데이터를 포함할 수 있다. 예를 들어, 코디네이션 엔진(504)은 발화("누가 이 쇼를 감독했나요?")의 전사를 포함하는 쿼리 및 콘텐츠 아이템 데이터('텔레비전 쇼 이름')을 자연 언어 쿼리 프로세싱 엔진(510)으로 제공한다.The query is submitted 608 to the natural language processing engine. For example, the coordination engine 504 submits the query to the natural language processing engine 510. The query may include data identifying at least a portion of the transcription and an entity (eg, content item data). For example, the coordination engine 504 provides query and content item data ('television show name') including the transcription of the utterance ("Who directed this show?") to the natural language query processing engine 510 do.

쿼리에 대한 하나 이상의 결과들이 획득된다(610). 예를 들어, 코디네이션 엔진은 자연 언어 쿼리 프로세싱 엔진(510)으로부터 쿼리에 대한 하나 이상의 결과들(예를 들어, 텔레비전 쇼의 감독의 이름)을 획득한다. 일부 예시들에서, 코디네이션 엔진(504)은 이제 상기 하나 이상의 결과들을 모바일 컴퓨팅 디바이스(502)로 제공한다.One or more results for the query are obtained (610). For example, the coordination engine obtains one or more results for the query from the natural language query processing engine 510 (eg, the name of the director of the television show). In some examples, coordination engine 504 now provides the one or more results to mobile computing device 502.

도 7은 여기에 설명된 기술이 이용될 수 있는 일반적인 컴퓨팅 디바이스(700) 및 일반적인 모바일 컴퓨팅 디바이스(750)의 일 예시를 나타낸다. 컴퓨팅 디바이스(700)는 랩탑, 데스트탑, 워크스테이션, PDA, 서버, 블레이드 서버, 메인프레임, 및 그 밖의 적절한 컴퓨터들과 같은 다양한 형태의 디지털 컴퓨터를 나타내기 위해 사용된다. 모바일 컴퓨팅 디바이스(750)는 PDA, 셀룰라 전화, 스마트폰, 및 그 밖의 유사한 컴퓨팅 디바이스와 같은 다양한 형태의 모바일 디바이스를 나타내기 위해 사용된다. 여기에 보여지는 컴포넌트들, 그 연결 및 관계, 및 그 기능들은 단지 예시를 의미하고, 본 명세서에서 설명하거나 또는 청구된 기술의 실시예를 제한하는 것을 의미하지 않는다.7 shows an example of a typical computing device 700 and a typical mobile computing device 750 in which the techniques described herein may be used. Computing device 700 is used to represent various types of digital computers, such as laptops, desktops, workstations, PDAs, servers, blade servers, mainframes, and other suitable computers. Mobile computing device 750 is used to represent various types of mobile devices, such as PDAs, cell phones, smart phones, and other similar computing devices. The components, their connections and relationships, and their functions shown herein are meant to be illustrative only, and are not meant to limit the embodiments of the technology described or claimed herein.

컴퓨팅 디바이스(700)는 프로세서(702), 메모리(704), 저장 디바이스(706), 메모리(704)에 접속하는 고속 인터페이스(708)와 고속 확장 포트(710), 및 저속 버스(714)와 저장 디바이스(706)에 접속하는 저속 인터페이스(712)를 포함한다. 각 구성요소(702, 704, 706, 708, 710, 및 512)는 다양한 버스들을 사용하여 서로 접속되고, 일반적인 마더보드 또는 적절한 경우 다른 방식으로 탑재될 수 있다. 프로세서(702)는 컴퓨팅 디바이스(700) 내에서 실행하기 위한 명령어를 처리할 수 있으며, 이러한 명령어에는, 고속 인터페이스(708)에 연결된 디스플레이(716)와 같은 외장 입/출력 디바이스상에서 GUI용 그래픽 정보를 디스플레이하기 위해, 메모리(704) 또는 저장 디바이스(706)에 저장되는 명령어가 포함된다. 다른 실시예에서, 다중 프로세서 및/또는 다중 버스는 적절한 경우, 다중 메모리 및 메모리 타입과 함께 사용될 수 있다. 또한, 다중 컴퓨팅 디바이스(700)는 각 디바이스가 필요 동작의 부분을 제공하는 형태(예를 들어, 서버 뱅크, 블레이드 서버의 그룹, 또는 다중 프로세서 시스템)로 접속될 수 있다.The computing device 700 includes a processor 702, a memory 704, a storage device 706, a high-speed interface 708 and a high-speed expansion port 710 connecting to the memory 704, and a low-speed bus 714 and storage. It includes a low speed interface 712 that connects to the device 706. Each of the components 702, 704, 706, 708, 710, and 512 are connected to each other using a variety of buses, and can be mounted on a common motherboard or in other ways as appropriate. The processor 702 can process instructions for execution in the computing device 700, and these instructions include graphic information for GUI on an external input/output device such as a display 716 connected to the high-speed interface 708. For display, instructions stored in memory 704 or storage device 706 are included. In other embodiments, multiple processors and/or multiple buses may be used with multiple memories and memory types where appropriate. Further, the multiple computing devices 700 may be connected in a form in which each device provides a portion of the required operation (eg, a server bank, a group of blade servers, or a multiprocessor system).

메모리(704)는 컴퓨팅 디바이스(700)내에 정보를 저장한다. 일 실시예에서, 메모리(704)는 휘발성 메모리 유닛 또는 유닛들이다. 또 다른 실시예에서, 메모리(704)는 비휘발성 메모리 유닛 또는 유닛들이다. 또한, 메모리(704)는 마그네틱 또는 광 디스크와 같은 다른 형태의 컴퓨터 판독가능 매체일 수 있다.Memory 704 stores information within computing device 700. In one embodiment, memory 704 is a volatile memory unit or units. In yet another embodiment, memory 704 is a non-volatile memory unit or units. Further, the memory 704 may be another type of computer-readable medium such as a magnetic or optical disk.

저장 디바이스(706)는 컴퓨팅 디바이스(700)를 위한 대용량 저장소(mass storage)를 제공할 수 있다. 일 실시예에서, 저장 디바이스(706)는 플로피 디스크 디바이스, 하드 디스크 디바이스, 광 디스크 디바이스, 또는 테이프 디바이스, 플래쉬 메모리 또는 다른 유사한 고체 상태 메모리 디바이스, 또는 저장 영역 네트워크 또는 다른 구성에 존재하는 디바이스를 포함하는 디바이스 어레이일 수 있다. 컴퓨터 프로그램 제품은 정보 매체 내에 유형적으로 구체화될 수 있다. 또한, 컴퓨터 프로그램 제품은 실행될 때, 상술한 것과 같은 하나 이상의 방법을 수행하는 명령어를 포함할 수 있다. 정보 캐리어는 메모리(704), 저장 디바이스(706), 프로세서(702)상의 메모리와 같은 컴퓨터 또는 기계 판독가능 매체이다.Storage device 706 can provide mass storage for computing device 700. In one embodiment, the storage device 706 includes a floppy disk device, a hard disk device, an optical disk device, or a tape device, a flash memory or other similar solid state memory device, or a device residing in a storage area network or other configuration. It may be a device array. The computer program product can be tangibly embodied in an information medium. Further, the computer program product may include instructions that, when executed, perform one or more methods such as those described above. The information carrier is a computer or machine readable medium such as memory 704, storage device 706, and memory on processor 702.

저속 제어부(712)가 저대역-집약적 동작(lower bandwidth-intensive operations)을 관리하는 반면, 고속 제어부(708)는 컴퓨팅 디바이스(700)에 대한 대역-집약적 동작을 관리한다. 이러한 기능들의 배치는 단지 예시적일 뿐이다. 일 실시예에서, 고속 제어부(708)는 메모리(704), 디스플레이(716)(예를 들어, 그래픽 프로세서 또는 가속기를 통함)에 연결되고, 다양한 확장 카드(도시되지 않음)을 수용할 수 있는 고속 확장 포트(710)에 연결된다. 일부 실시예에서는, 저속 제어부(712)는 저장 디바이스(706) 및 저속 확장 포트(714)에 연결된다. 다양한 통신 포트(예를 들어, USB, 블루투스, 이더넷, 무선 이더넷)를 포함할 수 있는 저속 확장 포트는 키보드, 포인팅 디바이스, 스캐너와 같은 하나 이상의 입/출력 디바이스들에 연결되거나, 또는 예컨대 네트워크 어댑터를 통하여, 스위치나 라우터와 같은 네트워킹 디바이스에 연결될 수 있다.The low speed control 712 manages the lower bandwidth-intensive operations, while the high speed control 708 manages the band-intensive operation for the computing device 700. The arrangement of these functions is merely exemplary. In one embodiment, the high-speed controller 708 is connected to the memory 704, the display 716 (for example, through a graphics processor or accelerator), and can accommodate a variety of expansion cards (not shown). It is connected to the expansion port 710. In some embodiments, the low speed control 712 is connected to the storage device 706 and the low speed expansion port 714. The slow expansion port, which may include a variety of communication ports (e.g., USB, Bluetooth, Ethernet, wireless Ethernet), connects to one or more input/output devices such as a keyboard, pointing device, scanner, or connects to, for example, a network adapter. Through this, it can be connected to a networking device such as a switch or router.

컴퓨팅 디바이스(700)는 도면에 도시된 바와 같이, 복수의 다른 형태로 구현될 수 있다. 예를 들어, 컴퓨팅 디바이스(700)는 표준 서버(720)로 구현되거나 이러한 서버들의 그룹에서 여러 번 구현될 수 있다. 또한, 컴퓨팅 디바이스(700)는 랙 서버 시스템(724)의 부분으로서 구현될 수 있다. 이에 더하여, 컴퓨팅 디바이스(700)는 랩탑 컴퓨터(722)와 같은 개인용 컴퓨터내에 구현될 수 있다. 선택적으로, 컴퓨팅 디바이스 (700)로부터의 구성요소는 디바이스(750)와 같은 모바일 디바이스(도시되지 않음)내 다른 구성요소와 조합될 수 있다. 이러한 디바이스 각각은 하나 이상의 컴퓨팅 디바이스(700, 750)를 포함하고, 전체 시스템은 서로 통신하는 다중 컴퓨팅 디바이스(700, 750)로 구성될 수 있다.The computing device 700 may be implemented in a plurality of different forms, as shown in the figure. For example, the computing device 700 may be implemented as a standard server 720 or may be implemented multiple times in a group of such servers. Additionally, computing device 700 may be implemented as part of rack server system 724. In addition, computing device 700 may be implemented in a personal computer such as laptop computer 722. Optionally, components from computing device 700 may be combined with other components in a mobile device (not shown), such as device 750. Each of these devices includes one or more computing devices 700, 750, and the entire system may consist of multiple computing devices 700, 750 in communication with each other.

컴퓨팅 디바이스(750)는 여러 구성요소 중에서 프로세서(752), 메모리(764), 디스플레이(754)와 같은 입/출력 디바이스, 통신 인터페이스(766), 및 트랜스시버(768)를 포함한다. 또한, 디바이스(750)에는 추가적인 저장소를 제공하기 위하여, 마이크로 드라이브 또는 다른 디바이스와 같은 저장 디바이스가 제공될 수 있다. 각 구성요소(750, 752, 764, 754, 766, 및 568)는 다양한 버스를 이용하여 서로 접속되고, 구성요소의 몇몇은 통상의 마더보드에 탑재되거나 적절한 다른 방법으로 탑재될 수 있다.Computing device 750 includes a processor 752, a memory 764, an input/output device such as a display 754, a communication interface 766, and a transceiver 768, among other components. Further, in order to provide additional storage to the device 750, a storage device such as a micro drive or other device may be provided. Each of the components 750, 752, 764, 754, 766, and 568 are connected to each other using a variety of buses, and some of the components may be mounted on a conventional motherboard or may be mounted in other suitable ways.

프로세서(752)는 컴퓨팅 디바이스(750) 내에서 명령어를 실행하며, 이 명령어에는 메모리(764)에 저장된 명령어가 포함된다. 프로세서는 개별적이고 다중의 아날로그 및 디지털 프로세서를 포함하는 칩들의 칩 세트로서 구현될 수 있다. 프로세서는, 예를 들어, 사용자 인터페이스의 컨트롤, 디바이스(750)에 의해 실행되는 애플리케이션, 및 컴퓨팅 디바이스(750)에 의한 무선 통신과 같은 디바이스(750)의 다른 구성요소들 사이에 조정을 제공할 수 있다.Processor 752 executes instructions within computing device 750, including instructions stored in memory 764. The processor is separate and can be implemented as a chip set of chips including multiple analog and digital processors. The processor may provide coordination between other components of the device 750, such as, for example, control of the user interface, applications executed by the device 750, and wireless communication by the computing device 750. have.

프로세서(752)는 제어 인터페이스(758) 및 디스플레이(754)에 연결된 디스플레이 인터페이스(756)를 통해 사용자와 통신할 수 있다. 디스플레이(754)는, 예를 들어, TFT LCD(Thin-Film-Tansistor Liquid Crystal Display) 또는 OLED(Organic Light Emitting Diode) 디스플레이, 또는 다른 적절한 디스플레이 기술일 수 있다. 디스플레이 인터페이스(756)는 그래픽 및 다른 정보를 사용자에게 나타내기 위해 디스플레이(754)를 구동하는 적절한 회로를 포함할 수 있다. 제어 인터페이스(758)는 사용자로부터 명령들을 수신하고, 프로세서(752)에 제출하기 위해 그 명령들을 변환한다. 더욱이, 확장 인터페이스(762)는 디바이스(750)와 다른 디바이스들간에 근거리 통신이 가능하도록 하기 위해, 프로세서(752)와의 통신에 제공될 수 있다. 확장 인터페이스(762)는, 예를 들어, 일부 실시예에서는 유선 통신을 제공하고 다른 실시예에서 무선 통신을 제공하며, 또한 다중 인터페이스가 사용될 수 있다.The processor 752 may communicate with a user through a control interface 758 and a display interface 756 coupled to the display 754. Display 754 may be, for example, a Thin-Film-Tansistor Liquid Crystal Display (TFT LCD) or Organic Light Emitting Diode (OLED) display, or other suitable display technology. Display interface 756 may include suitable circuitry to drive display 754 to present graphics and other information to a user. The control interface 758 receives instructions from the user and converts the instructions for submission to the processor 752. Moreover, the extended interface 762 may be provided in communication with the processor 752 to enable short-range communication between the device 750 and other devices. The extended interface 762, for example, provides wired communication in some embodiments and wireless communication in other embodiments, and multiple interfaces may also be used.

메모리(764)는 컴퓨팅 디바이스(750)내에 정보를 저장한다. 메모리(764)는 컴퓨터 판독가능 매체 또는 미디어, 휘발성 메모리 유닛 또는 유닛들, 또는 비휘발성 메모리 유닛 또는 유닛들 중 하나 이상으로서 구현될 수 있다. 또한, 확장 메모리(774)가 제공되어, 예를 들어 SIMM(Single In Line Memory Module) 카드 인터페이스를 포함하는 확장 인터페이스(774)를 통해 디바이스(750)에 접속될 수 있다. 이러한 확장 메모리(774)는 디바이스(750)를 위한 여분의 저장 공간을 제공할 수 있고, 또한 어플리케이션 또는 디바이스(750)를 위한 다른 정보를 저장할 수 있다. 특히, 확장 메모리(774)는 상술된 프로세스를 실행하거나 보조하기 위한 명령어를 포함하고, 또한 보안 정보를 포함할 수 있다. 따라서, 예를 들어, 확장 메모리(774)는 디바이스(750)용 보안 모듈로서 제공될 수 있고, 디바이스(750)의 안전한 사용을 가능하게 하는 명령어로 프로그램될 수 있다. 더욱이, 보안 어플리케이션은, 해킹할 수 없는 방식(non-hackable manner)으로 SIMM 카드상에 식별 정보를 위치시킨 것과 같은 추가적 정보와 함께 SIMM 카드를 통해 제공될 수 있다.Memory 764 stores information within computing device 750. The memory 764 may be implemented as one or more of a computer-readable medium or media, a volatile memory unit or units, or a nonvolatile memory unit or units. In addition, an extended memory 774 may be provided and connected to the device 750 through an extended interface 774 including, for example, a Single In Line Memory Module (SIMM) card interface. The extended memory 774 may provide an extra storage space for the device 750 and may also store an application or other information for the device 750. In particular, the extended memory 774 includes instructions for executing or assisting the above-described process, and may also include security information. Thus, for example, the extended memory 774 may be provided as a security module for the device 750 and may be programmed with instructions that enable secure use of the device 750. Moreover, security applications can be provided via the SIMM card with additional information, such as placing identification information on the SIMM card in a non-hackable manner.

메모리는 아래에서 논의되는 것과 같이, 예를 들어, 플래시 메모리 및/또는 NVRAM 메모리를 포함할 수 있다. 일 실시예에서, 컴퓨터 프로그램 제품은 정보 캐리어에 유형적으로 구체화된다. 컴퓨터 프로그램 제품은 실행될 때, 상술된 것과 같은 하나 이상의 방법을 수행하는 명령어를 포함한다. 정보 캐리어는 메모리(764), 확장 메모리(774), 프로세서(752)상의 메모리, 또는 예를 들어 트랜스시버(768) 또는 확장 인터페이스(762)를 통해 수신될 수 있는 전달된 신호와 같은 컴퓨터-또는 기계-판독가능 매체이다.The memory may include, for example, flash memory and/or NVRAM memory, as discussed below. In one embodiment, the computer program product is tangibly embodied in an information carrier. The computer program product contains instructions that, when executed, perform one or more methods such as those described above. The information carrier may be a computer-or machine such as memory 764, extended memory 774, memory on processor 752, or a transmitted signal that may be received via, for example, transceiver 768 or extended interface 762. -It is a readable medium.

디바이스(750)는 디지털 신호 처리 회로를 필요에 따라 포함하는 통신 인터페이스(766)를 통해 무선으로 통신할 수 있다. 통신 인터페이스(766)는 GSM 음성 호, SMS, EMS, 또는 MMS 메시징, CDMA, TDMA, PDC, WCDMA, CDMA2000, 또는 GPRS 등과 같은 다양한 모드 또는 프로토콜 하에서의 통신을 제공할 수 있다. 이러한 통신은 예를 들어, 무선-주파수 트랜스시버(768)를 통해 수행될 수 있다. 또한, 단거리(short range) 통신은 예를 들어, 블루투스, WiFi, 또는 다른 이러한 트랜스시버(도시되지 않음)를 사용하여 수행될 수 있다. 이에 더하여, GPS(Global Position System) 수신기 모듈(770)은 추가적인 항법- 및 위치- 관련 무선 데이터를 디바이스(750)에 제공할 수 있다. 이 무선 데이터는 디바이스(750)에서 실행중인 어플리케이션에 의해 적절하게 사용될 수 있다.Device 750 may communicate wirelessly through a communication interface 766 that includes digital signal processing circuitry as needed. Communication interface 766 may provide communication under various modes or protocols such as GSM voice calls, SMS, EMS, or MMS messaging, CDMA, TDMA, PDC, WCDMA, CDMA2000, or GPRS, and the like. Such communication may be performed via a radio-frequency transceiver 768, for example. Further, short range communication may be performed using, for example, Bluetooth, WiFi, or other such transceivers (not shown). In addition, the Global Position System (GPS) receiver module 770 may provide additional navigation- and location-related wireless data to the device 750. This wireless data can be suitably used by an application running on device 750.

또한, 디바이스(750)는 사용자로부터의 발화 정보를 수신하고, 그 발화 정보를 사용가능한 디지털 정보로 변환하는 오디오 코덱(760)을 이용하여, 청취 가능하게 통신할 수 있다. 또한, 오디오 코덱(760)은 예를 들어, 디바이스(750)의 핸드셋 내의 스피커를 통하는 것과 같이 해서, 사용자가 들을 수있는 음성을 생성한다. 이러한 음성은 음성 전화 호로부터의 음성을 포함할 수 있고, 녹음된 음성(예를 들어, 음성 메시지, 뮤직 파일 등)은 포함할 수 있고, 또한 디바이스(750) 상에서 동작하는 애플리케이션에 의해 생성된 음성을 포함할 수 있다.In addition, the device 750 may receive speech information from a user and communicate audibly using an audio codec 760 that converts the speech information into usable digital information. In addition, the audio codec 760 generates a user-audible voice, such as through a speaker in the handset of the device 750, for example. These voices may include voices from voice telephone calls, recorded voices (e.g., voice messages, music files, etc.), and voices generated by applications running on device 750 It may include.

컴퓨팅 디바이스(750)는 도면에 도시된 바와 같이, 복수의 다양한 형태로 구현될 수 있다. 예를 들어, 컴퓨팅 디바이스(750)는 셀룰러 전화(780)로서 구현될 수 있다. 또한, 컴퓨팅 디바이스(750)는 스마트폰(782), PDA, 또는 다른 유사한 모바일 디바이스의 일부로서 구현될 수 있다.As shown in the drawing, the computing device 750 may be implemented in a plurality of different forms. For example, computing device 750 may be implemented as a cellular phone 780. Further, computing device 750 may be implemented as part of a smartphone 782, PDA, or other similar mobile device.

본 명세서에 설명된 다양한 시스템과 방법의 여러 실시예는 디지털 전자 회로, 집적 회로, 특정 목적으로 설계된 ASICs(application specific integrated circuits), 컴퓨터 하드웨어, 펌웨어, 소프트웨어, 및/또는 이들의 조합으로 실현될 수 있다. 실시예에는 하나 이상의 컴퓨터 프로그램의 실시예가 포함되고, 이 컴퓨터 프로그램은 프로그램 가능한 시스템 상에서 실행가능 및/또는 해석가능하며, 프로그램 가능한 시스템은 저장 시스템에 연결되어 데이터와 명령을 송수신하는, 전용 또는 범용인 적어도 하나의 프로그램 가능한 프로세서, 적어도 하나의 입력 디바이스, 및 적어도 하나의 출력 디바이스를 포함한다.Various embodiments of the various systems and methods described herein may be realized in digital electronic circuits, integrated circuits, application specific integrated circuits (ASICs) designed for specific purposes, computer hardware, firmware, software, and/or combinations thereof. have. Examples include embodiments of one or more computer programs, which computer programs are executable and/or interpretable on a programmable system, the programmable system being connected to a storage system to transmit and receive data and instructions, either dedicated or general purpose. At least one programmable processor, at least one input device, and at least one output device.

이러한 컴퓨터 프로그램(또한, 프로그램, 소프트웨어, 소프트웨어 애플리케이션, 또는 코드라 함)은 프로그램 가능한 프로세서용 기계 명령을 포함하고, 고레벨 절차 및/또는 객체지향 프로그래밍 언어, 및/또는 어셈블리/기계 언어로 구현될 수 있다. 본 명세서에 사용되는 바와 같이, 용어 "기계-판독가능 매체", "컴퓨터-판독가능 매체"는 기계 명령 및/또는 데이터를 프로그램 가능한 프로세서에 제공하는데 사용되는 장치 및/또는 디바이스(예를 들어, 자기 디스크, 광디스크, 메모리, 프로그램 가능한 로직 디바이스(PLD))를 지칭하며, 기계-판독가능 신호로써 기계 명령을 수신하는 기계-판독가능 매체도 포함된다. 용어 "기계-판독가능 신호"는 명령어 및/또는 데이터를 프로그램 가능한 프로세서로 제공하기 위해 사용되는 어떠한 신호라도 참조한다.Such computer programs (also referred to as programs, software, software applications, or code) contain machine instructions for programmable processors, and can be implemented in high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. have. As used herein, the terms “machine-readable medium” and “computer-readable medium” refer to apparatus and/or devices used to provide machine instructions and/or data to a programmable processor (e.g., Magnetic disk, optical disk, memory, programmable logic device (PLD)), and also includes machine-readable media that receive machine instructions as machine-readable signals. The term “machine-readable signal” refers to any signal used to provide instructions and/or data to a programmable processor.

사용자와의 상호작용을 제공하기 위하여, 본 명세서에 설명되는 시스템과 방법은, 정보를 사용자에게 디스플레이 하는 디스플레이 디바이스(예를 들어, 음극선관(CRT) 또는 LCD(liquid crystal display) 모니터) 및 사용자가 컴퓨터에 입력하는데 사용하는 키보드와 포인팅 디바이스(예를 들어, 마우스 또는 트랙볼)를 구비한 컴퓨터상에서 구현될 수 있다. 다른 카테고리의 디바이스도 사용자와의 상호작용을 제공하기 위하여 사용될 수 있다. 예를 들어, 사용자에게 제공되는 피드백은 지각 피드백(시각, 청각 또는 촉각 피드백)의 임의 형태가 될 수 있고, 사용자로부터의 입력은 음향, 음성 또는 촉각 입력을 포함하는 임의 형태로 수신될 수 있다.In order to provide interaction with a user, the systems and methods described herein include a display device that displays information to a user (e.g., a cathode ray tube (CRT) or liquid crystal display (LCD) monitor) and a user It can be implemented on a computer with a keyboard and a pointing device (for example, a mouse or a trackball) used to input into the computer. Other categories of devices can also be used to provide interaction with the user. For example, the feedback provided to the user may be in any form of perceptual feedback (visual, auditory or tactile feedback), and the input from the user may be received in any form including acoustic, voice, or tactile input.

본 명세서에 설명된 다양한 시스템과 방법은, 백엔드 구성요소(예를 들어, 데이터 서버), 또는 미들웨어 구성요소(예를 들어, 애플리케이션 서버) 또는 프론트엔드 구성요소(예를 들어, 본 명세서에 설명된 시스템 및 방법의 실시예와 상호작용하기 위해 사용자가 사용할 수 있는 그래픽 사용자 인터페이스(GUI) 또는 웹브라우저를 구비한 클라이언트 컴퓨터) 또는 이러한 백엔드, 미들웨어 또는 프론트엔드 구성요소의 임의 조합을 포함하는 컴퓨팅 시스템으로 구현될 수 있다. 시스템의 구성요소는 임의 형태 또는 디지털 데이터 통신의 매체(예를 들어, 통신 네트워크)에 의해 상호접속될 수 있다. 통신 네트워크의 예는 근거리 네트워크("LAN"), 광역 네트워크("WAN"), 및 인터넷을 포함한다.The various systems and methods described herein include a backend component (e.g., a data server), or a middleware component (e.g., an application server) or a frontend component (e.g., as described herein). A graphical user interface (GUI) or a client computer with a web browser that a user can use to interact with embodiments of the system and method) or a computing system that includes any combination of such backend, middleware, or frontend components. Can be implemented. The components of the system may be interconnected by any form or medium of digital data communication (eg, a communication network). Examples of communication networks include a local area network (“LAN”), a wide area network (“WAN”), and the Internet.

컴퓨팅 시스템은 클라이언트와 서버를 포함할 수 있다. 클라이언트와 서버는 일반적으로 서로 떨어져 있고, 통신 네트워크를 통해 상호 작용한다. 클라이언트와 서버의 관계는, 각 컴퓨터에서 실행 중이며 서로에 대하여 클라이언트-서버 관계를 갖는 컴퓨터 프로그램들에 의해 발생한다.The computing system may include a client and a server. Clients and servers are generally separated from each other and interact through communication networks. The relationship between the client and the server arises by means of computer programs running on each computer and having a client-server relationship to each other.

본 명세서가 몇몇 특징들을 포함하지만, 이것들은 개시된 내용 또는 청구될 수 있는 내용의 범위를 제한하는 것으로 해석되서는 안되며, 개시된 내용의 예시 실시예들의 특징의 설명으로써 해석되어야 할 것이다. 본 명세서에서 분리된 실시예들의 콘텍스트에 기재된 어떤 특징들은 하나의 실시예에서 결합되어 제공될 수도 있다. 역으로, 하나의 실시예의 콘텍스트에 기재된 다양한 특징들은 별개의 또는 어떤 적절한 하위 조합의 다수의 실시예에서 제공될 수 있다. 나아가, 특징들이 위에서 특정 조건들에서 동작하거나 심지어 그와 같이 제기되었다 하더라도, 제기된 조합들로부터의 하나 이상의 특징들은 어떤 경우에는 조합으로부터 삭제될 수 있고, 제기된 조합은 하위 조합 또는 하위 조합의 변형으로 지시될 수 있다.While this specification includes several features, they should not be construed as limiting the scope of the disclosed or claimed subject matter, but as a description of features of exemplary embodiments of the disclosed subject matter. Certain features described in the context of separate embodiments herein may be provided in combination in one embodiment. Conversely, the various features described in the context of one embodiment may be provided in multiple embodiments, separate or in any suitable sub-combination. Furthermore, although features operate under certain conditions above or even have been raised as such, one or more features from the proposed combinations may in some cases be deleted from the combination, and the proposed combination may be a sub-combination or a modification of the sub-combination. Can be indicated as.

유사하게, 동작들이 도면에서 특정 순서로 도시되었지만, 이는 그러한 동작들이 도시된 특정 순서 또는 시계열적 순서로 수행되어야 하는 것, 또는 원하는 결과를 달성하기 위해 모든 도시된 동작들이 수행되는 것을 요구하는 것으로 이해되서는 안된다. 특정 상황에서는, 멀티태스킹 및 병행 처리가 유리할 수 있다. 나아가, 위에서 기술된 실시예들의 다양한 시스템 컴퍼넌트의 구분은 모든 실시예에서 그러한 구분이 요구되는 것으로 해석되어서는 안되며, 기술된 프로그램 컴퍼넌트 및 시스템은 일반적으로 하나의 소프트웨어 제품에 통합되거나 멀티플 소프트웨어 제품으로 패키지 될 수 있는 것으로 이해되어야 한다.Similarly, although the actions are shown in a specific order in the figure, it is understood that such actions must be performed in the specific order shown, or in a time-series order, or require all shown actions to be performed to achieve the desired result. It shouldn't be. In certain situations, multitasking and parallel processing may be advantageous. Furthermore, the classification of the various system components in the above-described embodiments should not be interpreted as requiring such classification in all embodiments, and the described program components and systems are generally integrated into one software product or packaged as multiple software products. It should be understood as possible.

즉, 본 발명의 특정 실시예들이 기술되었다. 다른 실시예들도 다음의 청구항의 범위 내에 있다. 예를 들어, 청구항에서 인용하는 동작들은 다른 순서로 수행될 수 있고 여전히 원하는 결과를 달성할 수 있다. 다수의 실시예들이 기술되었다. 그럼에도 불구하고, 본 발명의 범위 및 사상으로부터 벗어나지 않고 다양한 변형들이 가해질 수 있음이 이해될 것이다. 예를 들어, 순서 변경, 추가, 제거 단계들과 함께 위에서 보여진 다양한 형태의 플로우가 사용될 수 있다. 따라서, 다른 실시예들도 다음의 청구항들의 범위 내에 있다.That is, specific embodiments of the present invention have been described. Other embodiments are also within the scope of the following claims. For example, the operations recited in the claims may be performed in a different order and still achieve the desired result. A number of embodiments have been described. Nevertheless, it will be understood that various modifications may be made without departing from the scope and spirit of the present invention. For example, the various types of flows shown above can be used with reordering, adding, and removing steps. Accordingly, other embodiments are also within the scope of the following claims.

104: 명확화 엔진
106: 음성 인식 엔진
108: 키워드 매핑 엔진
110: 콘텐츠 인식 엔진104: disambiguation engine
106: speech recognition engine
108: keyword mapping engine
110: content recognition engine

Claims

As a computer implemented method,
Receiving, by one or more processors, (i) a spoken natural language query and (ii) audio data encoding music;
Receiving, by one or more processors, environmental image data related to music;
Determining, by the one or more processors, that one or more keywords in the transcription of the spoken natural language query are associated with a movie content type; And
Identifying, by one or more processors, a recognized movie content item based on music and environmental image data associated with the music, based on a determination that one or more keywords in the transcription of the spoken natural language query are associated with a movie content type. Computer-implemented method comprising a.

The method of claim 1,
Receiving the audio data,
And receiving audio data from a mobile computing device.

The method of claim 2,
Receiving the audio data,
And receiving environmental audio data associated with the mobile computing device.

The method of claim 1,
Audio data encoding the music,
A computer-implemented method, characterized in that it is generated within a predetermined period of time prior to receiving audio data encoding a spoken natural language query.

The method of claim 1,
Determining that one or more keywords in the transcription of the spoken natural language query are associated with a movie content type,
And identifying one or more keywords that map at least one of the keywords to a movie content type using the one or more databases.

A computer-readable medium storing software containing instructions executable by one or more computers that, when executed, cause one or more computers to perform operations, the operations comprising:
Receiving, by one or more processors, audio data encoding (i) an image or video and (ii) a spoken natural language query;
Receiving, by one or more processors, environmental image data related to music;
Determining, by the one or more processors, that one or more keywords in the transcription of the spoken natural language query are associated with a musical content type; And
Identifying, by one or more processors, an image or video, and a musical content item recognized based on environmental image data associated with the music, based on a determination that one or more keywords in the transcription of the spoken natural language query are associated with a music content type. Computer-readable medium comprising the step of.

The method of claim 6,
Receiving the (i) image or video and (ii) audio data encoding the spoken natural language query,
The computer-readable medium further comprising receiving from the mobile computing device (i) image or video and (ii) audio data encoding the spoken natural language query.

The method of claim 6,
The image or video,
A computer-readable medium generated within a predetermined period of time prior to receiving audio data encoding a spoken natural language query.

The method of claim 6,
Determining that at least one keyword in the transcription of the spoken natural language query is associated with a music content type,
And identifying one or more keywords for mapping at least one of the keywords or the like to a music content type using one or more databases.

The method of claim 7,
Receiving the image or video,
And receiving an environmental image or video associated with the mobile computing device.

As a system,
One or more computers and one or more storage devices storing instructions operable to cause the one or more computers to perform operations when executed by the one or more computers, the operations comprising:
Receiving, by one or more processors, (i) a spoken natural language query and (ii) audio data encoding music;
Receiving, by one or more processors, environmental image data related to music;
Determining, by the one or more processors, that one or more keywords in the transcription of the spoken natural language query are associated with a movie content type; And
Identifying, by one or more processors, a recognized movie content item based on music and environmental image data associated with the music, based on a determination that one or more keywords in the transcription of the spoken natural language query are associated with a movie content type. A system comprising a.

The method of claim 11,
Receiving the audio data,
And receiving audio data from a mobile computing device.

The method of claim 11,
Audio data encoding the music,
The system according to claim 1, wherein the audio data encoding the spoken natural language query is generated within a predetermined time prior to receiving.

The method of claim 11,
Determining that one or more keywords in the transcription of the spoken natural language query are associated with a movie content type,
And identifying one or more keywords that map at least one of the keywords to a movie content type using the one or more databases.

The method of claim 12,
Receiving the audio data,
And receiving environmental audio data related to the mobile computing device.