JP2017532684A

JP2017532684A - System and method for language detection

Info

Publication number: JP2017532684A
Application number: JP2017520499A
Authority: JP
Inventors: ボッジャ，ニキル; ワン，ピドン; リンダー，フレドリック; プゾン，バートロミエジ
Original assignee: Machine Zone Inc
Current assignee: Machine Zone Inc
Priority date: 2014-10-17
Filing date: 2014-10-17
Publication date: 2017-11-02
Anticipated expiration: 2034-10-17
Also published as: EP3207465A1; CN107111607A; WO2016060687A1; JP6553180B2; AU2014409115A1; CA2964331A1; CN107111607B

Abstract

本開示の実現例は、テキストメッセージの言語を検出するための方法、システムおよびコンピュータプログラム記憶装置に関する。ユーザに関連するメッセージに対して、複数の異なる言語検出テストが実行される。各言語検出テストは、メッセージが複数の異なる言語のうち１つの言語である尤度を表すスコアセットを決定する。スコアセットの１つ以上の組み合わせは、入力として１つ以上の異なる分類器に提供される。各々の分類器からの出力は、メッセージが異なる言語のうちの１つであるという提示を含む。メッセージの言語は、信用度スコアおよび／または特定された言語分野に基づいて、分類器の１つにより提示された言語として特定されてもよい。Implementations of the present disclosure relate to a method, system and computer program storage device for detecting the language of a text message. A number of different language detection tests are performed on messages associated with the user. Each language detection test determines a score set representing the likelihood that the message is one of a plurality of different languages. One or more combinations of score sets are provided as input to one or more different classifiers. The output from each classifier includes an indication that the message is one of different languages. The language of the message may be identified as the language presented by one of the classifiers based on the credit score and / or the identified language domain.

Description

背景
本開示は、言語の検出に関し、特に短文メッセージの言語を検出するためのシステムおよび方法に関する。 BACKGROUND This disclosure relates to language detection, and more particularly to a system and method for detecting the language of a short message.

一般に、言語の検出または特定とは、テキストの内容に基づいて、テキストの本文に存在する言語を自動的に検出するプロセスである。言語の検出は、自動言語翻訳に有用である。一般的には、メッセージを別の言語に正確に翻訳する前に、テキストメッセージの言語を知る必要がある。 In general, language detection or identification is the process of automatically detecting the language present in the text body based on the content of the text. Language detection is useful for automatic language translation. In general, it is necessary to know the language of the text message before the message can be accurately translated into another language.

従来の言語検出は、通常、多くの語句の集合（即ち、文書レベル）で行われ、特に対処し難い分野は、チャットテキスト分野である。この分野において、メッセージは、しばしば少くない数の単語（例えば、４つ以下）を含み、その一部または全部が非正式単語または綴りの間違えた単語である可能性もある。チャットテキスト分野の場合、メッセージに存在する情報の不足および非正式単語を考慮すると、既存手法の言語検出は、不正確であるおよび／または遅いことが分かる。 Conventional language detection is usually performed on a collection of many phrases (ie, document level), and a field that is particularly difficult to deal with is the chat text field. In this field, messages often contain a small number of words (eg, no more than four), some or all of which may be informal words or misspelled words. In the case of the chat text field, given the lack of information and informal words present in the message, it can be seen that the language detection of existing approaches is inaccurate and / or slow.

概要
本開示に記載のシステムおよび方法の実施形態は、例えば、テキストメッセージの内容、テキストメッセージを作成するときに使用されたキーボードに関する情報、および／またはテキストメッセージを作成したユーザの言語嗜好に関する情報に基づいて、メッセージの言語を検出する。従来の言語検出技術に比べて、本開示に記載のシステムおよび方法は、特に短い（例えば、４つの単語以下の）テキストメッセージに対して、一般的により高速且つより正確である。 Overview Embodiments of the systems and methods described in this disclosure include, for example, information about the content of a text message, information about the keyboard used to create the text message, and / or information about the language preference of the user who created the text message. Based on that, it detects the language of the message. Compared to conventional language detection techniques, the systems and methods described in this disclosure are generally faster and more accurate, especially for short (eg, 4 words or less) text messages.

さまざまな例において、システムおよび方法は、複数の言語検出テストおよび分類器を用いて、テキストメッセージ内の可能性のある言語に関連する確率を決定する。各言語検出テストは、可能性のある言語に関連する確率集合または確率ベクトルを出力することができる。分類器は、言語検出テストからの出力を組み合わせることによって、メッセージの最も可能性の高い言語を決定することができる。メッセージに対して特定の言語検出テストおよび分類器の選択は、メッセージの予測精度、信用度スコアおよび／または言語分野に依存する。 In various examples, the system and method use a plurality of language detection tests and classifiers to determine probabilities associated with possible languages in the text message. Each language detection test can output a probability set or probability vector associated with a potential language. The classifier can determine the most likely language of the message by combining the output from the language detection test. The selection of a particular language detection test and classifier for a message depends on the message's predictive accuracy, confidence score and / or language domain.

一態様において、本発明は、メッセージの言語を特定するためのコンピュータ実施方法に関する。この方法は、ユーザに関連するメッセージに対して、複数の異なる言語検出テストを実行するステップを含み、各言語検出テストは、それぞれ一組のスコアであるスコアセットを決定し、スコアセット内の各スコアは、メッセージが複数の異なる言語のうち１つの言語である尤度を表す。この方法はさらに、スコアセットの１つ以上の組み合わせを入力として１つ以上の異なる分類器に提供するステップと、１つ以上の分類器の各々からの出力として、メッセージが複数の異なる言語のうち１つの言語であるという提示を取得するステップとを含み、提示は、信用度スコアを含む。この方法はさらに、信用度スコアおよび特定された言語分野のうち少なくとも一方に基づいて、メッセージの言語を、１つ以上の分類器のうちの１つの分類器により示された言語に特定するステップを含む。 In one aspect, the present invention relates to a computer-implemented method for identifying the language of a message. The method includes performing a plurality of different language detection tests on a message associated with the user, each language detection test determining a score set, each of which is a set of scores, The score represents the likelihood that the message is one language among a plurality of different languages. The method further includes providing one or more combinations of score sets as input to one or more different classifiers, and as an output from each of the one or more classifiers, the message is out of a plurality of different languages. Obtaining a presentation that is in one language, the presentation including a credit score. The method further includes identifying the language of the message to the language indicated by one of the one or more classifiers based on at least one of the credit score and the identified language field. .

特定の例において、特定の分類器は、教師付き学習モデル、部分教師付き学習モデル、教師なし学習モデル、または補間法である。メッセージの言語を特定するステップは、信用度スコアに基づいて提示された言語を選択することを含むことができる。メッセージの言語を特定するステップは、特定された言語分野に基づいて分類器を選択することを含むことができる。一部の例において、言語分野は、ビデオゲーム、スポーツ、ニュース、議事、政治、健康、および／または旅行であるまたはそれらを含む。 In particular examples, the particular classifier is a supervised learning model, a partially supervised learning model, an unsupervised learning model, or an interpolation method. Identifying the language of the message can include selecting a presented language based on the confidence score. Identifying the language of the message can include selecting a classifier based on the identified language domain. In some examples, the language field is or includes video games, sports, news, agenda, politics, health, and / or travel.

一部の例において、メッセージは、文字、数字、記号、および顔文字のうち２種以上を含む。複数の異なる言語検出テストは、バイトn-gramメソッド、辞書に基づくメソッド、アルファベットに基づくメソッド、スクリプトに基づくメソッド、およびユーザ言語プロファイルメソッドからなる群から選択される少なくとも２つの方法を含むことができる。複数の異なる言語検出テストは、同時に（例えば、並列処理により）実行されてもよい。１つ以上の組み合わせは、バイトn-gramメソッドおよび辞書に基づくメソッドから得られたスコアセットを含むことができる。１つ以上の組み合わせは、ユーザ言語プロファイルメソッドおよび／またはアルファベットに基づくメソッドから得られたスコアセットをさらに含むことができる。 In some examples, the message includes two or more of letters, numbers, symbols, and emoticons. The plurality of different language detection tests may include at least two methods selected from the group consisting of byte n-gram methods, dictionary based methods, alphabet based methods, script based methods, and user language profile methods. . Multiple different language detection tests may be performed simultaneously (eg, by parallel processing). The one or more combinations can include a score set obtained from a byte n-gram method and a dictionary based method. The one or more combinations may further include a score set derived from a user language profile method and / or an alphabet based method.

別の態様において、本発明は、メッセージの言語を特定するためのシステムに関する。このシステムは、命令を記憶するコンピュータ記憶装置を備える。また、システムは、命令を実行して以下の動作を実行するように構成されたデータ処理装置を備える。これらの動作は、ユーザに関連するメッセージに対して、複数の異なる言語検出テストを実行することを含み、各言語検出テストは、それぞれ一組のスコアであるスコアセットを決定し、スコアセット内の各スコアは、メッセージが複数の異なる言語のうち１つの言語である尤度を表す。これらの動作はさらに、スコアセットの１つ以上の組み合わせを入力として１つ以上の異なる分類器に与えることと、１つ以上の分類器の各々からの出力として、メッセージが複数の異なる言語のうち１つの言語であるという提示を取得することとを含み、提示は、信用度スコアを含む。これらの動作はさらに、信用度スコアおよび特定された言語分野のうち少なくとも一方に基づいて、メッセージの言語を、１つ以上の分類器のうちの１つの分類器により提示された言語として特定することを含む。 In another aspect, the invention relates to a system for identifying the language of a message. The system includes a computer storage device that stores instructions. The system also includes a data processing device configured to execute instructions and perform the following operations. These operations include performing a plurality of different language detection tests on messages associated with the user, each language detection test determining a score set, each of which is a set of scores, Each score represents the likelihood that the message is one of a plurality of different languages. These operations further provide one or more combinations of score sets as input to one or more different classifiers, and output from each of the one or more classifiers as a message in multiple different languages. Obtaining a presentation that is in one language, the presentation including a credit score. These actions further identify the language of the message as the language presented by one of the one or more classifiers based on the confidence score and / or the identified language field. Including.

ある例において、特定の分類器は、教師付き学習モデル、部分教師付き学習モデル、教師なし学習モデル、または補間法である。メッセージの言語を特定するステップは、信用度スコアに基づいて提示された言語を選択することを含むことができる。メッセージの言語を特定するステップは、特定された言語分野に基づいて分類器を選択することを含むことができる。一部の例において、言語分野は、ビデオゲーム、スポーツ、ニュース、議事、政治、健康、および／または旅行であるまたはそれらを含む。 In certain examples, the particular classifier is a supervised learning model, a partially supervised learning model, an unsupervised learning model, or an interpolation method. Identifying the language of the message can include selecting a presented language based on the confidence score. Identifying the language of the message can include selecting a classifier based on the identified language domain. In some examples, the language field is or includes video games, sports, news, agenda, politics, health, and / or travel.

別の態様において、本発明は、１つ以上の記憶装置に記憶され、データ処理装置の処理モードを制御するためのコンピュータプログラム製品に関する。コンピュータプログラム製品は、データ処理装置によって実行されると、データ処理装置に以下の動作を実行させる。これらの動作は、ユーザに関連するメッセージに対して、複数の異なる言語検出テストを実行することを含み、各言語検出テストは、それぞれ一組のスコアであるスコアセットを決定し、スコアセット内の各スコアは、メッセージが複数の異なる言語のうち１つの言語である尤度を表す。これらの動作はさらに、スコアセットの１つ以上の組み合わせを入力として１つ以上の異なる分類器に与えることと、１つ以上の分類器の各々からの出力として、メッセージが複数の異なる言語のうち１つの言語であるという提示を取得することとを含み、提示は、信用度スコアを含む。これらの動作はさらに、信用度スコアおよび特定された言語分野のうち少なくとも一方に基づいて、メッセージの言語を、１つ以上の分類器のうちの１つの分類器により提示された言語として特定することを含む。 In another aspect, the invention relates to a computer program product for controlling a processing mode of a data processing device stored in one or more storage devices. When executed by a data processing device, the computer program product causes the data processing device to execute the following operations. These operations include performing a plurality of different language detection tests on messages associated with the user, each language detection test determining a score set, each of which is a set of scores, Each score represents the likelihood that the message is one of a plurality of different languages. These operations further provide one or more combinations of score sets as input to one or more different classifiers, and output from each of the one or more classifiers as a message in multiple different languages. Obtaining a presentation that is in one language, the presentation including a credit score. These actions further identify the language of the message as the language presented by one of the one or more classifiers based on the confidence score and / or the identified language field. Including.

本発明の所定の態様に関して記載された実施形態の要素は、本発明の別の態様のさまざまな実施形態に使用することができる。例えば、１つの独立請求項に従属している従属請求項の特徴は、他の独立請求項のいずれかの装置および／または方法に使用することができると考えられる。 Elements of the embodiments described with respect to certain aspects of the invention can be used in various embodiments of other aspects of the invention. For example, features of a dependent claim that are dependent on one independent claim can be used in the apparatus and / or method of any other independent claim.

言語検出を実行する例示的なシステムを示す図である。FIG. 2 illustrates an example system that performs language detection. テキストメッセージの言語を検出する例示的な方法を示すフローチャートである。6 is a flowchart illustrating an exemplary method for detecting a language of a text message. テキストメッセージの言語を検出する例示的なn-gramメソッドを示すフローチャートである。FIG. 6 is a flowchart illustrating an exemplary n-gram method for detecting the language of a text message. テキストメッセージの言語を検出する例示的な辞書に基づくメソッドを示すフローチャートである。FIG. 5 is a flowchart illustrating an exemplary dictionary-based method for detecting the language of a text message. テキストメッセージの言語を検出する例示的なアルファベットに基づくメソッドを示すフローチャートである。FIG. 5 is a flowchart illustrating an exemplary alphabet-based method for detecting the language of a text message. テキストメッセージの言語を検出する例示的なスクリプトに基づくメソッドを示すフローチャートである。FIG. 6 is a flowchart illustrating an example script-based method for detecting the language of a text message. テキストメッセージの言語を検出する例示的なユーザ言語プロファイルメソッドを示すフローチャートである。FIG. 5 is a flowchart illustrating an exemplary user language profile method for detecting the language of a text message. 例示的な言語検出メソッドモジュールを示す概略図である。FIG. 3 is a schematic diagram illustrating an exemplary language detection method module. 例示的な分類器モジュールを示す概略図である。FIG. 3 is a schematic diagram illustrating an exemplary classifier module. 図７の言語検出メソッドモジュールおよび図８の分類器モジュールを用いて、テキストメッセージの言語を検出する例示的な方法を示すフローチャートである。FIG. 9 is a flowchart illustrating an exemplary method for detecting the language of a text message using the language detection method module of FIG. 7 and the classifier module of FIG. テキストメッセージの言語を検出する例示的な方法を示すフローチャートである。6 is a flowchart illustrating an exemplary method for detecting a language of a text message. テキストメッセージの言語を検出する例示的な方法を示すフローチャートである。6 is a flowchart illustrating an exemplary method for detecting a language of a text message. テキストメッセージの言語を検出する例示的な方法を示すフローチャートである。6 is a flowchart illustrating an exemplary method for detecting a language of a text message.

詳細な説明
一般的に、メッセージの言語情報（例えば、クライアント装置からのキーボード情報）が欠落、変形または信頼できない場合に、本開示に記載の言語検出システムおよび言語検出方法を用いて、テキストメッセージの言語を特定することができる。本発明のシステムおよび方法は、１つの言語から別の言語にテキストメッセージを翻訳するために使用される言語翻訳方法の精度を向上させる。一般的に、言語の翻訳は、ソース言語を正確に特定することを必要とする。さもなければ、翻訳結果が不精確になる可能性がある。 DETAILED DESCRIPTION Generally, if the language information of a message (eg, keyboard information from a client device) is missing, transformed or unreliable, the language detection system and language detection method described in this disclosure can be used to Language can be specified. The system and method of the present invention improves the accuracy of language translation methods used to translate text messages from one language to another. In general, language translation requires that the source language be specified accurately. Otherwise, the translation results may be inaccurate.

図１Ａは、テキストメッセージまたは音声メッセージなどのメッセージから言語を検出するための例示的なシステム１０を示す。サーバシステム１２は、メッセージの解析および言語の検出機能を提供する。サーバシステム１２は、例えば１つ以上の地理的位置に配置された１つ以上のデータセンタ１４に展開することができるソフトウェア部品およびデータベースを含む。サーバシステム１２のソフトウェア部品は、検出メソッドモジュール１６、分類器モジュール１８、および管理モジュール２０を含む。ソフトウェア部品は、同一のデータ処理装置または異なる個別のデータ処理装置上で実行可能なサブ部品を含むことができる。サーバシステム１２のデータベースは、訓練データ２２、辞書２４、アルファベット２６、スクリプト２８、およびユーザプロファイル情報３０を含む。データベースは、１つ以上の物理的な記憶システムに常駐することができる。ソフトウェア部品およびデータベースは、以下にさらに説明される。 FIG. 1A illustrates an exemplary system 10 for detecting a language from a message, such as a text message or a voice message. The server system 12 provides message analysis and language detection functions. Server system 12 includes software components and databases that can be deployed in one or more data centers 14 located, for example, in one or more geographic locations. The software components of the server system 12 include a detection method module 16, a classifier module 18, and a management module 20. Software components can include subcomponents that can be executed on the same data processing device or on different individual data processing devices. The database of server system 12 includes training data 22, dictionary 24, alphabet 26, script 28, and user profile information 30. A database can reside in one or more physical storage systems. Software parts and databases are further described below.

ウェブアプリケーションなどのアプリケーションをエンドユーザアプリケーションとして提供することによって、ユーザは、メッセージをサーバシステム１２に提供することができる。クライアント装置、例えばパーソナルコンピュータ３４、スマートフォン３６、タブレットコンピュータ３８およびラップトップコンピュータ４０のユーザは、ネットワーク３２を介して、エンドユーザアプリケーションを利用することができる。他のクライアント装置も可能である。ユーザからのメッセージは、メッセージを作成するときに使用された装置に関する情報、例えば、メッセージを作成するときに使用されたキーボード、クライアント装置および／またはオペレーティングシステムに関する情報を含んでもよい。 By providing an application, such as a web application, as an end user application, a user can provide a message to the server system 12. Users of client devices, such as personal computer 34, smartphone 36, tablet computer 38, and laptop computer 40, can use end-user applications via network 32. Other client devices are possible. The message from the user may include information regarding the device used when composing the message, eg, information regarding the keyboard, client device and / or operating system used when composing the message.

図１Ａに示された分類器モジュール１８および管理モジュール２０がデータベース（すなわち、訓練データ２２、辞書２４、アルファベット２６、スクリプト２８およびユーザプロファイル情報３０）に接続されているが、分類器モジュール１８および／または管理モジュール２０は、必ずしもデータベースの一部または全てに接続される必要はない。一般的に、分類器モジュール１８は、検出メソッドモジュール１６から入力を受け取り、管理モジュール２０は、分類器モジュール１８から入力を受け取る。分類器モジュール１８および／または管理モジュール２０は、他の入力を受け取る必要はない。 Although the classifier module 18 and management module 20 shown in FIG. 1A are connected to a database (ie, training data 22, dictionary 24, alphabet 26, script 28 and user profile information 30), the classifier module 18 and / or Or the management module 20 does not necessarily need to be connected to a part or all of the database. In general, the classifier module 18 receives input from the detection method module 16 and the management module 20 receives input from the classifier module 18. The classifier module 18 and / or the management module 20 need not receive other inputs.

図１Ｂは、システム１０を用いてメッセージの言語を検出する例示的な方法１００を示す。方法１００は、ユーザによって生成されたテキストメッセージを受信または取得することによって始まる（ステップ１０２）。（例えば、検出メソッドモジュール１６からの）１つ以上の言語検出メソッドを用いて、テキストメッセージを解析する（ステップ１０４）。各々の言語検出メソッドは、メッセージに存在する１つまたは複数の言語を提示する。次いで、（例えば、分類器モジュール１８からの）１つ以上の分類装置を用いて、言語検出メソッドからの出力を結合する（ステップ１０６）ことによって、メッセージに存在する言語の更なる提示を提供する。１つ以上の分類器は、例えば、教師付き学習モデル、部分教師付き学習モデル、教師なし学習モデルおよび／または補間法を含むことができる。次いで、（例えば、管理モジュール２０を用いて）１つ以上の分類器からの出力に基づいて、メッセージの言語を決定する（ステップ１０８）。 FIG. 1B illustrates an exemplary method 100 for detecting the language of a message using the system 10. The method 100 begins by receiving or obtaining a text message generated by a user (step 102). The text message is parsed using one or more language detection methods (eg, from detection method module 16) (step 104). Each language detection method presents one or more languages present in the message. One or more classifiers (eg, from classifier module 18) are then used to combine the output from the language detection method (step 106) to provide further presentation of the language present in the message. . The one or more classifiers can include, for example, a supervised learning model, a partially supervised learning model, an unsupervised learning model, and / or an interpolation method. The language of the message is then determined based on the output from the one or more classifiers (eg, using the management module 20) (step 108).

いくつかの実装例において、１つ以上の分類器からの言語提示は、計算された信用度スコアおよび／または言語分野に従って、管理モジュール２０によって選択される。例えば、分類器は、言語の予測に関連する信用度を示す信用度スコアを計算することができる。追加的にまたは代替的に、ユーザまたはメッセージに関連する言語分野に従って、特定の分類器からの出力を選択してもよい。例えば、メッセージがコンピュータゲーム環境から由来した場合、最も正確な言語予測を提供する特定の分類器からの出力を選択することができる。同様に、メッセージがスポーツ（例えば、スポーツイベント）から由来した場合、スポーツ言語分野に対してより適切な別の分類器からの出力を選択することができる。他の可能性のある言語分野は、例えば、ニュース、議事、政治、健康、旅行、ウェブページ、新聞記事、およびマイクロブログメッセージを含む。一般的に、（例えば、分類器からの）ある種の言語検出メソッドまたは言語検出メソッドの組み合わせは、他の言語分野よりも、ある種の言語分野に対してより正確であり得る。いくつかの実装例において、言語分野は、メッセージに存在する専門用語に基づいて決定されてもよい。例えば、コンピュータゲーム用の専門用語は、ゲーマーによって使用される共通俗語を含むことができる。 In some implementations, language presentations from one or more classifiers are selected by the management module 20 according to the calculated confidence score and / or language domain. For example, the classifier can calculate a confidence score that indicates the confidence associated with the prediction of the language. Additionally or alternatively, output from a particular classifier may be selected according to the language domain associated with the user or message. For example, if the message comes from a computer game environment, the output from a particular classifier that provides the most accurate language prediction can be selected. Similarly, if the message comes from a sport (eg, a sporting event), an output from another classifier that is more appropriate for the sports language field can be selected. Other possible language areas include, for example, news, agenda, politics, health, travel, web pages, newspaper articles, and microblog messages. In general, certain language detection methods or combinations of language detection methods (eg, from a classifier) may be more accurate for certain language areas than for other language areas. In some implementations, the language domain may be determined based on terminology present in the message. For example, terminology for computer games can include common slang used by gamers.

検出メソッドモジュール１６に使用される言語検出メソッドは、例えば、n-gramメソッド（例えば、バイトn-gramメソッド）、辞書に基づくメソッド、アルファベットに基づくメソッド、スクリプトに基づくメソッド、およびユーザ言語プロファイルメソッドを含むことができる。他の言語検出メソッドも可能である。これらの言語検出メソッドのいずれかを用いて、メッセージに存在する言語を検出することができる。各メソッドからの出力は、例えば、メッセージ内の各可能性のある言語に関連する確率の組または確率ベクトルであってもよい。一部の例において、並列計算を用いて、２つ以上の言語検出メソッドを同時に実行することができ、これによって、計算時間を大幅に短縮することができる。 Language detection methods used in the detection method module 16 include, for example, n-gram methods (eg, byte n-gram methods), dictionary-based methods, alphabet-based methods, script-based methods, and user language profile methods. Can be included. Other language detection methods are possible. Any of these language detection methods can be used to detect the language present in the message. The output from each method may be, for example, a set of probabilities or probability vectors associated with each possible language in the message. In some examples, using parallel computation, two or more language detection methods can be executed simultaneously, which can significantly reduce computation time.

一実装形態において、バイトn-gramメソッドは、単語n-gramまたは文字n-gramの代わりに、バイトn-gramを用いて言語を検出する。好ましくは、多項式イベントモデルを備えるナイーブベイズ分類器を用いて、バイトn-gramの混合物（例えば、１≦ｎ≦４）上でバイトn-gramメソッドを訓練する。好ましくは、異なる言語分野からのデータに対して、モデルを一般化する。これによって、モデルのデフォルト構成は、新聞記事、オンラインゲーム、ウェブページ、およびマイクロブログメッセージを含む多様な分野にわたって正確である。言語を特定する作業に関する情報は、さまざまな分野から集約することができる。 In one implementation, the byte n-gram method detects a language using byte n-grams instead of word n-grams or character n-grams. Preferably, a byte n-gram method is trained on a mixture of byte n-grams (eg, 1 ≦ n ≦ 4) using a naive Bayes classifier with a polynomial event model. Preferably, the model is generalized for data from different linguistic fields. Thereby, the default configuration of the model is accurate across a variety of areas including newspaper articles, online games, web pages, and microblog messages. Information about the work of identifying languages can be aggregated from various fields.

分野内の訓練データが利用可能な従来のテキスト分類設定に対して、言語の特定を高精度で達成する作業は、比較的簡単である。１つの言語分野に対して学習したモデルパラメータを用いて、別の言語分野からデータを分類または分別しようとする場合、その作業はより難しくなる。この課題は、言語の特定作業に関連する重要な機能を重視することによって、対処することができる。対処方法は、例えば、情報利得（information gain）と呼ばれる概念に基づくことができる。この情報利得は、最初に決定木の分割基準として導入されたが、その後テキスト分類において特徴の選択に有用であることが判明した。ある実現例において、分野および言語に対して情報利得の差異を表す検出スコアが計算される。高い検出スコアを有する特徴は、分野に関する情報を提供することなく、言語に関する情報を提供することができる。簡素化のため、情報利得を計算する前に、ターム頻度に基づく特徴の選択によって、候補特徴のセットから余分なものを取り除くことができる。 The task of achieving language identification with a high degree of accuracy is relatively simple compared to conventional text classification settings where training data in the field is available. If model parameters learned for one language field are used to classify or sort data from another language field, the task becomes more difficult. This issue can be addressed by placing emphasis on important functions related to language specific tasks. The coping method can be based, for example, on a concept called information gain. This information gain was first introduced as a decision tree splitting criterion, but later proved useful for feature selection in text classification. In some implementations, a detection score is calculated that represents the difference in information gain for the field and language. Features with a high detection score can provide information about the language without providing information about the field. For simplicity, extra features can be removed from the set of candidate features by selecting features based on term frequency before calculating the information gain.

図２を参照して、例示的なバイトn-gramメソッド２００は、訓練データ２２を用いて訓練することによって始まる。例えば、多項イベントモデルを有する単純ベイズ分類器を用いて、バイトn-gramの混合物でメソッドを訓練することができる。訓練データ２２は、好ましくは、大量の数および種類の言語に対して収集され、各言語に利用可能なデータの量が均一になるように調整される（ステップ２０２）。訓練データ２２の一部を取り出し、テストセットに設定する（ステップ２０４）。訓練データ２２を選択した後、適切な平滑化技術およびバックオフ技術を用いて、データ２２上でバイトn-gramモデルを訓練する（ステップ２０６）。モデルの入力特徴が各入力文章からのバイトストリームであり且つこれらの文章のソース言語ラベルが既知であるため、モデルは、パラメータを調整して、所定の言語に特有のバイトシーケンスを学習する。最初に分けられたテストセットを用いて、訓練されたモデルに基づいて、言語ラベルを予測する（ステップ２０８）。予測の精度は、このバイトn-gramシステムの言語特定性能を決定する。一部の例において、多くの言語分野に亘ってデータを収集することによって、各言語分野に対してバイトn-gramシステムを訓練することは、困難である。その理由は、分野ごとに十分なデータがないからである。したがって、これらのバイトn-gramシステムは、典型的には、特定の分野ではなく、共通分野に対応するように訓練される。訓練されたモデルは、中間機械パラメータと共にプログラムにコンパイルされてもよい（ステップ２１０）。このプログラムは、汎用言語特定システムとして機能することができる。 With reference to FIG. 2, an exemplary byte n-gram method 200 begins by training with training data 22. For example, a naive Bayes classifier with a multinomial event model can be used to train a method with a mixture of byte n-grams. Training data 22 is preferably collected for a large number and types of languages and adjusted so that the amount of data available for each language is uniform (step 202). A part of the training data 22 is extracted and set in a test set (step 204). After selecting the training data 22, a byte n-gram model is trained on the data 22 using appropriate smoothing and backoff techniques (step 206). Since the model input feature is a byte stream from each input sentence and the source language labels of these sentences are known, the model adjusts the parameters to learn a byte sequence specific to a given language. A language label is predicted based on the trained model using the initially divided test set (step 208). The accuracy of the prediction determines the language specific performance of this byte n-gram system. In some examples, it is difficult to train a byte n-gram system for each language field by collecting data across many language fields. The reason is that there is not enough data for each field. Thus, these byte n-gram systems are typically trained to address a common field rather than a specific field. The trained model may be compiled into a program along with intermediate machine parameters (step 210). This program can function as a general-purpose language specifying system.

一般的に、辞書に基づく言語検出メソッドは、言語に関連する辞書または単語リスト内の単語を検索することによって、言語に属するトークンまたは単語の数をカウントする。メッセージに最も多くの単語を有する言語は、最も可能性のある言語として選択される。最も可能性のある言語が複数である場合、最も可能性のある言語のうち、より頻繁にまたはより一般的に使用された言語を選択する。言語辞書は、辞書データベース２４に記憶することができる。 In general, dictionary-based language detection methods count the number of tokens or words belonging to a language by searching for words in a dictionary or word list associated with the language. The language with the most words in the message is selected as the most likely language. If there is more than one most likely language, select the more likely or more commonly used language from the most likely languages. The language dictionary can be stored in the dictionary database 24.

図３は、例示的な辞書に基づく言語検出メソッド３００を示すフローチャートである。テキストメッセージが提供され（ステップ３０２）、テキストメッセージの可能性のある言語セットが特定される（ステップ３０４）。次いで、セットから、第１の可能性のある言語を選択する（ステップ３０６）。可能性のある言語に対応する辞書に存在するテキストメッセージ内の単語をカウントする（ステップ３０８）。検討されていない追加の可能性のある言語が可能性のある言語セットに存在する場合（ステップ３１０）、新しい可能性のある言語を選択し（ステップ３１２）、ステップ３０８を繰り返す。セットからの全ての可能性のある言語を検討した後、テキストメッセージ内に最も多くの単語を有する言語が、メッセージの言語として特定されてもよい（ステップ３１４）。代替的にまたは追加的に、この方法を用いて、言語セット内の各言語がメッセージに存在する確率を計算することができる。例えば、辞書に基づくメソッドからの出力は、セット内の各言語の確率ベクトルであってもよい。 FIG. 3 is a flowchart illustrating an exemplary dictionary-based language detection method 300. A text message is provided (step 302) and a possible language set of the text message is identified (step 304). A first possible language is then selected from the set (step 306). Count words in the text message that are present in the dictionary corresponding to the possible languages (step 308). If there are additional potential languages that have not been considered in the potential language set (step 310), a new potential language is selected (step 312) and step 308 is repeated. After considering all possible languages from the set, the language with the most words in the text message may be identified as the language of the message (step 314). Alternatively or additionally, this method can be used to calculate the probability that each language in the language set is present in the message. For example, the output from a dictionary-based method may be a probability vector for each language in the set.

辞書に基づく言語検出メソッドの精度、特に短文の場合の精度を保証するために、好ましくは、正式単語に加えて、非正式単語またはチャット単語（略語、頭字語、俗語、不敬語）を含む辞書を使用することが望ましい。非正式単語は、ショートテキスト通信およびチャットルームによく使用される。好ましくは、非公式単語が新しく作成され使用されるときに、非正式単語を含むように、辞書を継続的に拡張する。 To ensure the accuracy of dictionary-based language detection methods, especially in the case of short sentences, preferably a dictionary containing non-formal words or chat words (abbreviations, acronyms, slang, profane words) in addition to formal words It is desirable to use Informal words are often used in short text communication and chat rooms. Preferably, the dictionary is continuously expanded to include informal words as they are newly created and used.

アルファベットに基づくメソッドは、一般的に、各言語のアルファベットの文字カウントに基づき、多くの言語が特有のアルファベットまたは異なる文字セットを有するという所見に依存する。たとえば、ロシア語、英語、韓国語および日本語は、それぞれ異なるアルファベットを使用する。アルファベットに基づくメソッドは、一部の言語（例えば、ラテン語などの類似のアルファベットを使用する言語）を正確に区別することができないが、一般的に特定の言語を迅速に検出することができる。場合によって、本開示に説明したように、アルファベットに基づくメソッドを１つ以上の他の言語検出メソッド（例えば、分類器を用いる）と組み合わせて使用することが好ましい。言語のアルファベットは、アルファベットデータベース２６に記憶される。 Alphabet-based methods generally rely on the finding that many languages have unique alphabets or different character sets based on the alphabetic character count of each language. For example, Russian, English, Korean, and Japanese use different alphabets. Alphabet-based methods cannot accurately distinguish some languages (eg, languages that use similar alphabets such as Latin), but can generally quickly detect a particular language. In some cases, as described in this disclosure, it is preferable to use alphabet-based methods in combination with one or more other language detection methods (eg, using a classifier). The language alphabet is stored in the alphabet database 26.

図４は、例示的なアルファベットに基づく言語検出メソッド４００を示すフローチャートである。テキストメッセージが提供され（ステップ４０２）、テキストメッセージの可能性のある言語セットが特定される（ステップ４０４）。次に、セットから、第１の可能性のある言語を選択する（ステップ４０６）。可能性のある言語のアルファベットに存在するテキストメッセージ内の文字をカウントする（ステップ４０８）。検討されていない追加の可能性のある言語が可能性のある言語セットに存在する場合（ステップ４１０）、新しい可能性のある言語を選択し（ステップ４１２）、ステップ４０８を繰り返す。セットからの全ての可能性のある言語を検討した後、テキストメッセージに最も多くの文字を有する言語が、メッセージの言語として特定されてもよい（ステップ４１４）。代替的にまたは追加的に、アルファベットに基づくメソッドを用いて、言語セット内の各言語がメッセージに存在する確率を計算することができる。例えば、アルファベットに基づくメソッドからの出力は、セット内の各言語の確率ベクトルであってもよい。 FIG. 4 is a flowchart illustrating an exemplary alphabet-based language detection method 400. A text message is provided (step 402) and a possible language set of the text message is identified (step 404). Next, the first possible language is selected from the set (step 406). The characters in the text message existing in the alphabet of possible languages are counted (step 408). If there are additional possible languages that have not been considered in the possible language set (step 410), a new possible language is selected (step 412) and step 408 is repeated. After considering all possible languages from the set, the language with the most characters in the text message may be identified as the language of the message (step 414). Alternatively or additionally, alphabet-based methods can be used to calculate the probability that each language in the language set is present in the message. For example, the output from an alphabet-based method may be a probability vector for each language in the set.

一般的に、スクリプトに基づく言語検出メソッドは、メッセージに存在する可能性のある各スクリプト（例えば、ラテン語スクリプト、ＣＪＫスクリプトなど）の文字カウントを決定する。スクリプトに基づくメソッドは、異なる言語が異なるスクリプト（例えば、中国語および英語）を使用する可能性があるという所見に依存する。この方法は、好ましくは、スクリプトを使用する言語のリストにスクリプトをマッピングするマップを使用する。例えば、マップは、メッセージに存在する文字または記号のユニコード値を考慮する。これらのユニコード値は、メッセージに対応する言語または可能性のある言語セットにマッピングされてもよい。言語スクリプトおよびユニコード値または範囲値は、スクリプトデータベース２８に記憶されてもよい。 In general, script-based language detection methods determine the character count of each script (eg, Latin script, CJK script, etc.) that may be present in a message. Script based methods rely on the observation that different languages may use different scripts (eg, Chinese and English). The method preferably uses a map that maps the script to a list of languages that use the script. For example, the map takes into account the Unicode values of characters or symbols present in the message. These Unicode values may be mapped to the language or possible language set corresponding to the message. Language scripts and Unicode values or range values may be stored in the script database 28.

図５を参照して、例示的なスクリプトに基づくメソッド５００において、テキストメッセージが提供され（ステップ５０２）、メッセージに存在するスクリプトが特定される（５０４）。次いで、各スクリプトの文字数をカウントする（ステップ５０６）。最大文字数を有するスクリプトが最も可能なスクリプトであると考えられ（ステップ５０８）、最も可能なスクリプトに対応する言語を特定する（ステップ５１０）。最も可能なスクリプトトが１つのみの言語に対応している場合、その言語は、最も可能性のある言語であると考えられる。最も可能なスクリプトが複数の言語に対応する場合、追加の言語検出メソッドを使用して、さらなる検出を行うことができる。いくつかの実現例において、スクリプトに基づくメソッドからの出力は、メッセージに存在する可能な各言語の確率の（例えば、ベクトル形式）集合である。 Referring to FIG. 5, in an example script-based method 500, a text message is provided (step 502) and a script present in the message is identified (504). Next, the number of characters in each script is counted (step 506). The script with the maximum number of characters is considered the most possible script (step 508) and the language corresponding to the most possible script is identified (step 510). If the most likely script corresponds to only one language, that language is considered the most likely language. If the most likely script corresponds to multiple languages, additional language detection methods can be used to perform further detection. In some implementations, the output from the script-based method is a set of probabilities (eg, vector form) for each possible language present in the message.

ユーザ言語プロファイルに基づくメソッドは、さまざまなユーザによって送信された過去メッセージを記憶するユーザプロファイルデータベース３０を使用する。記憶されたこれらのメッセージの言語は、例えば、本開示に記載され、各ユーザによって使用された言語を特定する１つ以上の他の言語検出メソッド（例えば、バイトn-gramメソッド）を用いて、検出される。例えば、ユーザの全ての過去メッセージがスペイン語である場合、そのユーザの言語プロファイルは、ユーザの優先言語がスペイン語であることを示すことができる。同様に、ユーザの過去メッセージが異なる言語の混合である場合、ユーザの言語プロファイルは、異なる言語に関連する確率を示すことができる（例えば、英語８０％、フランス語１５％、スペイン語５％）。一般的に、ユーザ言語プロファイルに基づくメソッドは、非常に短いメッセージに関連する言語検出問題に対処する。これらのメッセージは、正確な言語決定を行うのに十分な情報をもっていないことが多い。この場合、ユーザが以前に使用した言語を引き続き使用することを想定して、ユーザの言語嗜好を用いてユーザのメッセージの言語を予測することができる。 Methods based on user language profiles use a user profile database 30 that stores past messages sent by various users. The language of these stored messages can be determined using, for example, one or more other language detection methods (eg, byte n-gram methods) that are described in this disclosure and that identify the language used by each user, Detected. For example, if all the user's past messages are in Spanish, the user's language profile can indicate that the user's preferred language is Spanish. Similarly, if the user's past message is a mixture of different languages, the user's language profile can indicate probabilities associated with different languages (eg, 80% English, 15% French, 5% Spanish). In general, methods based on user language profiles address language detection problems associated with very short messages. These messages often do not have enough information to make an accurate language decision. In this case, it is possible to predict the language of the user's message using the user's language preference, assuming that the user will continue to use the previously used language.

図６を参照して、例示的なユーザ言語プロファイル検出メソッド６００は、ユーザの過去メッセージを記憶し（ステップ６０２）、記憶されたメッセージに存在する言語を検出する（ステップ６０４）。異なる言語がユーザのメッセージに現れる頻度を判断し（６０６）、出力する（ステップ６０８）。 Referring to FIG. 6, an exemplary user language profile detection method 600 stores a user's past message (step 602) and detects the language present in the stored message (step 604). The frequency at which different languages appear in the user's message is determined (606) and output (step 608).

図７を参照して、検出メソッドモジュール１６には、さまざまな言語検出メソッドを組み込むことができる。テキストメッセージを検出メソッドモジュール１６に入力することができ、１つ以上の言語検出メソッドがそのメッセージの言語を特定することができる。例えば、各言語検出メソッドは、確率ベクトルを提供することができる。ベクトル内の各確率は、メッセージ内の可能性のある言語に関連付けられ、メッセージが所定の言語を使用する可能性を表す。使用される異なる方法およびメッセージに利用可能な情報により、各言語検出メソッドからの確率が一致しないことがある。検出メソッドモジュール１６は、例えば、n-gram検出メソッド（例えば、バイトn-gram検出メソッド２００）を実行するためのn-gramモジュール７０２、辞書に基づくメソッド３００を実行するための辞書に基づくモジュール７０４、アルファベットに基づくメソッド４００を実行するためのアルファベットに基づくモジュール７０６、スクリプトに基づくメソッド５００を実行するためのスクリプトに基づくモジュール７０８、およびユーザ言語プロファイルメソッド６００を実行するためのユーザ言語プロファイルモジュール７１０を含むまたは利用することができる。必要に応じて、追加の言語検出メソッドを検出メソッドモジュール１６に組み込むことができる。いくつかの既知の方法には、単語レベルのn-gram、マルコフモデルおよび予測モデリング技術の使用を含む。 Referring to FIG. 7, the detection method module 16 can incorporate various language detection methods. A text message can be input to the detection method module 16 and one or more language detection methods can identify the language of the message. For example, each language detection method can provide a probability vector. Each probability in the vector is associated with a possible language in the message and represents the likelihood that the message will use a given language. Due to the different methods used and the information available for the messages, the probabilities from each language detection method may not match. The detection method module 16 includes, for example, an n-gram module 702 for executing an n-gram detection method (for example, byte n-gram detection method 200) and a dictionary-based module 704 for executing a dictionary-based method 300. An alphabet-based module 706 for executing the alphabet-based method 400, a script-based module 708 for executing the script-based method 500, and a user language profile module 710 for executing the user language profile method 600. Can be included or utilized. Additional language detection methods can be incorporated into the detection method module 16 as needed. Some known methods include the use of word level n-grams, Markov models and predictive modeling techniques.

検出メソッドモジュール１６内のさまざまな言語検出メソッドからの出力は、分類器モジュール１８を用いて結合することができる。図８を参照して、分類器モジュール１８は、補間モジュール８０２、サポートベクトルマシン（ＳＶＭ）モジュール８０４、および線形ＳＶＭモジュール８０６を含むことができる。 Outputs from various language detection methods within detection method module 16 can be combined using classifier module 18. With reference to FIG. 8, the classifier module 18 can include an interpolation module 802, a support vector machine (SVM) module 804, and a linear SVM module 806.

補間モジュール８０２を用いて、２つ以上の言語検出メソッドからの結果の線形補間を行う。例えば、テキストメッセージの言語は、バイトn-gramメソッドの結果および辞書に基づくメソッドの結果を補間することによって決定することができる。チャットメッセージ「lol gtg」の場合、バイトn-gramメソッドは、英語である可能性が０．３であり、フランス語である可能性が０．４であり、ポーランド語である可能性が０．３であると判定することができる（すなわち、バイトn-gramメソッドの出力は、{en:0.3, fr:0.4, pl:0.3}である）。辞書に基づくメソッドは、英語である可能性が０．１であり、フランス語である可能性が０．２であり、ポーランド語である可能性が０．７であると判定することができる（すなわち、辞書に基づくメソッドは、{en:0.1, fr:0.2, pl:0.7}である）。これらの２つの方法の結果を補間するために、バイトn-gramからの出力に第１重みを乗算し、辞書に基づくメソッドからの出力に第２重みを乗算する。第１重みおよび第２重みの合計が１である。次いで、２つの方法からの重み付き出力を加算する。例えば、バイトn-gramの結果に０．６の重みを与えた場合、辞書に基づく結果に０．４の重みを与える。２つの方法の補間は、{en:0.3, fr:0.4, pl: 0.3}*0.6 + {en:0.1, fr:0.2, pl:0.7}*0.4 = {en:0.22, fr:0.32, pl:0.46}である。 Interpolation module 802 is used to perform linear interpolation of results from two or more language detection methods. For example, the language of the text message can be determined by interpolating the result of the byte n-gram method and the result of the dictionary based method. For the chat message “lol gtg”, the byte n-gram method has a probability of 0.3 in English, a probability of 0.4 in French, and a probability of 0.3 in Polish. (Ie, the output of the byte n-gram method is {en: 0.3, fr: 0.4, pl: 0.3}). A dictionary-based method can be determined to have a probability of 0.1 in English, a probability of 0.2 in French, and a probability of 0.7 in Polish (ie, The dictionary-based methods are {en: 0.1, fr: 0.2, pl: 0.7}). To interpolate the results of these two methods, the output from the byte n-gram is multiplied by a first weight, and the output from the dictionary based method is multiplied by the second weight. The sum of the first weight and the second weight is 1. The weighted outputs from the two methods are then added. For example, when a weight of 0.6 is given to the result of the byte n-gram, a weight of 0.4 is given to the result based on the dictionary. The two methods of interpolation are {en: 0.3, fr: 0.4, pl: 0.3} * 0.6 + {en: 0.1, fr: 0.2, pl: 0.7} * 0.4 = {en: 0.22, fr: 0.32, pl: 0.46}.

一般的に、２つ以上の値を補間するための最適な重み値は、試行錯誤によって決定することができる。所定のメッセージのセットに対して異なる重みを試すことによって、最も可能な重みのセットを特定するができる。場合によって、重みは、メッセージ内の単語または文字の数の関数であってもよい。代替的または追加的に、重みは、メッセージの言語分野に依存してもよい。例えば、ゲーム環境の最適な重みは、スポーツ環境の最適な重みと異なる場合がある。バイトn-gramメソッドと辞書に基づくメソッドとの組み合わせについて、バイトn-gramメソッドに０．１という重みを用いて、辞書法に０．９という重みを用いて、良好な結果を得ることができる。 In general, the optimal weight value for interpolating two or more values can be determined by trial and error. By trying different weights for a given set of messages, the most possible set of weights can be identified. In some cases, the weight may be a function of the number of words or characters in the message. Alternatively or additionally, the weight may depend on the language domain of the message. For example, the optimal weight for the game environment may be different from the optimal weight for the sport environment. For combinations of byte n-gram methods and dictionary-based methods, good results can be obtained using a weight of 0.1 for the byte n-gram method and a weight of 0.9 for the dictionary method. .

ＳＶＭモジュール８０４は、言語データを分析し、言語パターンを認識する教師付き学習モデルであってもよく、それを含んでもよい。ＳＶＭモジュール８０４は、例えば、マルチクラスＳＶＭ分類器であってもよい。英語のＳＶＭ分類器の場合、特徴ベクトルは、上記の２つの分布の連結（すなわち、{en:0.3, fr:0.4, pl:0.3, en:0.1, fr:0.2, pl:0.7}）であってもよい。ＳＶＭ分類器は、好ましくは、ラベルされた訓練データに対して訓練される。訓練されたモデルは、入力の予測器として機能する。言語検出の場合に選択される特徴は、例えば、バイト、単語またはフレーズのシーケンスであってもよい。入力の訓練ベクトルは、多次元空間にマッピングすることができる。次いで、ＳＶＭアルゴリズムは、カーネルを用いて、これらの次元間の最適な分離超平面を特定することができ、アルゴリズムに言語（この場合）を予測する顕著な能力を与える。カーネルは、例えば、線形カーネル、多項式カーネル、または放射基底関数（ＲＢＦ）カーネルであってもよい。ＳＶＭ分類器の好ましいカーネルは、ＲＢＦカーネルである。訓練データを用いてＳＶＭ分類器を訓練した後、分類器を用いて、全ての可能性のある言語の中から最も可能性のある言語を出力することができる。 The SVM module 804 may be or may include a supervised learning model that analyzes language data and recognizes language patterns. The SVM module 804 may be, for example, a multi-class SVM classifier. For the English SVM classifier, the feature vector is the concatenation of the above two distributions (ie {en: 0.3, fr: 0.4, pl: 0.3, en: 0.1, fr: 0.2, pl: 0.7}). May be. The SVM classifier is preferably trained on the labeled training data. The trained model functions as an input predictor. The feature selected in the case of language detection may be, for example, a sequence of bytes, words or phrases. The input training vector can be mapped to a multidimensional space. The SVM algorithm can then use the kernel to identify the optimal separation hyperplane between these dimensions, giving the algorithm a significant ability to predict the language (in this case). The kernel may be, for example, a linear kernel, a polynomial kernel, or a radial basis function (RBF) kernel. The preferred kernel for the SVM classifier is the RBF kernel. After training the SVM classifier using the training data, the classifier can be used to output the most likely language out of all possible languages.

例えば、異なるメッセージ長、言語分野および／または言語を有する多くのメッセージ用の訓練データは、異なる言語検出メソッドからの出力ベクトルおよび正しい言語を表す提示であってもよく、それを含んでもよい。訓練データは、各メッセージの言語が既知である多くのメッセージを含むことができる。 For example, training data for many messages having different message lengths, linguistic fields and / or languages may be and may include presentations representing output vectors and correct languages from different language detection methods. The training data can include many messages where the language of each message is known.

線形ＳＶＭモジュール８０６は、大規模線形分類器であってもよく、それを含んでもよい。線形カーネルを有するＳＶＭ分類器は、線形回帰などの他の線形分類器よりも優れた性能を発揮することができる。線形ＳＶＭモジュール８０６は、カーネルレベルでＳＶＭモジュール８０４と異なる。場合によって、多項式モデルは、線形モデルよりも優れた性能を発揮し、その逆も可能である。最適カーネルは、メッセージデータの言語分野および／またはデータの性質に依存してもよい。 The linear SVM module 806 may be or include a large linear classifier. An SVM classifier with a linear kernel can perform better than other linear classifiers such as linear regression. The linear SVM module 806 differs from the SVM module 804 at the kernel level. In some cases, the polynomial model performs better than the linear model and vice versa. The optimal kernel may depend on the language domain of the message data and / or the nature of the data.

本開示に記載のシステムおよび方法に使用され得る他の分類器は、例えば、決定木学習、関連ルール学習、人工神経ネットワーク、帰納的理論プログラミング、ランダムフォレスト、クラスタリング、ベイジアンネットワーク、強化学習、表現学習、類似性およびメトリック学習、およびスパース辞書学習を含む。これらの分類器または他の分類器の１つ以上は、分類器モジュール１８に組み込むことができ、および／または分類器モジュール１８の一部を形成することができる。 Other classifiers that can be used in the systems and methods described in this disclosure include, for example, decision tree learning, related rule learning, artificial neural networks, inductive theory programming, random forest, clustering, Bayesian networks, reinforcement learning, expression learning , Similarity and metric learning, and sparse dictionary learning. One or more of these classifiers or other classifiers can be incorporated into and / or form part of classifier module 18.

図９を参照して、例示的な方法９００は、検出メソッドモジュール１６、分類器モジュール１８および管理モジュール２０を用いて、メッセージの言語を検出する。メッセージは、検出メソッドモジュール１６に提供または供給される（ステップ９０２）。メッセージは、メッセージに関する情報および／またはメッセージを作成したユーザに関する情報を含んでもよい。情報は、例えば、ユーザ識別番号、メッセージを作成するためにユーザによって使用されたキーボードに関する情報、および／またはメッセージを作成するためにユーザによって使用されたソフトウェアを制御するオペレーティングシステムに関する情報を含んでもよい。例えば、メッセージは、ユーザがフランス語キーボードを用いてメッセージを作成し、そのユーザのオペレーティングシステムが英語であることを示すデータを含んでもよい。 With reference to FIG. 9, the exemplary method 900 uses the detection method module 16, the classifier module 18, and the management module 20 to detect the language of the message. The message is provided or provided to the detection method module 16 (step 902). The message may include information about the message and / or information about the user who created the message. The information may include, for example, a user identification number, information about the keyboard used by the user to create the message, and / or information about the operating system that controls the software used by the user to create the message. . For example, the message may include data indicating that the user has created a message using a French keyboard and that the user's operating system is English.

検出メソッドモジュール１６内の１つ以上の言語検出メソッドを用いて、メッセージの言語を検出する（ステップ９０４）。検出メソッドモジュール１６によって使用される各方法は、メッセージに存在する言語に関する予測を出力することができる。予測は、メッセージ内に存在する可能性のある各言語の確率を含むベクトルであってもよい。 The language of the message is detected using one or more language detection methods in detection method module 16 (step 904). Each method used by the detection method module 16 can output a prediction regarding the language present in the message. The prediction may be a vector containing the probabilities for each language that may be present in the message.

次に、検出メソッドモジュール１６からの出力は、２つ以上の言語検出メソッドからの結果を結合することができる分類器モジュール１８に供給される（ステップ９０６）。これによって、言語検出メソッドの結果のさまざまな組み合わせを得ることができる。一例において、バイトn-gramメソッドおよび辞書に基づくメソッドからの結果は、補間によって分類器モジュール１８において結合される。別の例において、バイトn-gramメソッド、辞書に基づくメソッド、アルファベット法、及びユーザプロファイルメソッドからの結果に対してＳＶＭ結合または分類が実行される。代替的にまたは追加的に、その結合は、スクリプトに基づくメソッドの結果を含んでもよく、考慮してもよい。さらなる例は、バイトn-gramメソッド、言語プロファイルメソッドおよび辞書メソッドの大きな線形結合を含む。しかしながら、一般的に、分類器モジュール１８において、任意の２つ以上の言語検出メソッドの結果を結合することができる。 The output from the detection method module 16 is then provided to a classifier module 18 that can combine results from two or more language detection methods (step 906). This allows for various combinations of language detection method results. In one example, the results from the byte n-gram method and the dictionary based method are combined in the classifier module 18 by interpolation. In another example, SVM combining or classification is performed on results from byte n-gram methods, dictionary-based methods, alphabetic methods, and user profile methods. Alternatively or additionally, the binding may include and take into account the results of script-based methods. Further examples include large linear combinations of byte n-gram methods, language profile methods and dictionary methods. In general, however, the classifier module 18 can combine the results of any two or more language detection methods.

方法９００は、管理モジュール２０を用いて、特定の分類器から出力を選択する（ステップ９０８）。出力は、例えば、分類器によって計算された信用度スコア、期待される言語検出精度および／またはメッセージの言語分野に基づいて選択されてもよい。次に、選択された分類器の出力から、最も可能性のある言語を決定する（ステップ９１０）。 The method 900 uses the management module 20 to select an output from a particular classifier (step 908). The output may be selected based on, for example, the confidence score calculated by the classifier, the expected language detection accuracy, and / or the language field of the message. Next, the most likely language is determined from the output of the selected classifier (step 910).

一部の例において、本開示に記載のシステムおよび方法は、メッセージの長さに応じて言語検出メソッドを選択する。例えば、図１０を参照して、方法１０００は、メッセージを作成するときに使用されたキーボード言語に関する情報を含むメッセージを受信または提供すること（ステップ１００２）を含む。メッセージが閾値長さ（例えば、２５バイトまたは２５文字）よりも長い場合（ステップ１００４）、バイトn-gramメソッド（または他の方法、または方法の組み合わせ）を用いて、言語を検出することができる（ステップ１００６）。次いで、バイトn-gramメソッドからの結果に基づいて、メッセージの言語を選択することができる（ステップ１００８）。一方、メッセージが閾値長以下である場合、システムは、キーボード言語が利用可能であるか否かを判断することができる（ステップ１０１０）。キーボード言語が利用可能である場合、キーボード言語と同様であるように、メッセージの言語を選択してもよい（ステップ１０１２）。代替的には、キーボード言語が利用可能でない場合、方法１０００は、再びメッセージの長さを考慮してもよい。例えば、メッセージ長が第２閾値（例えば、４バイトまたは４文字）未満である場合（ステップ１０１４）、辞書に基づくメソッドを用いて言語を検出し、選択することができる（ステップ１０１６）。メッセージ長が第２閾値よりも大きい場合、バイトn-gramメソッド（または他の方法または方法の組み合わせ）を使用して、メッセージの言語を検出することができる（ステップ１０１８）。バイトn-gramメソッドおよび辞書に基づくメソッドからの結果は、（例えば、補間器または他の分類器を用いて）結合されてもよく、メッセージの言語は、結合に基づいて決定されてもよい（ステップ１０２０）。 In some examples, the systems and methods described in this disclosure select a language detection method as a function of message length. For example, referring to FIG. 10, the method 1000 includes receiving or providing a message that includes information regarding the keyboard language used when composing the message (step 1002). If the message is longer than a threshold length (eg, 25 bytes or 25 characters) (step 1004), the byte n-gram method (or other method or combination of methods) can be used to detect the language. (Step 1006). The language of the message can then be selected based on the result from the byte n-gram method (step 1008). On the other hand, if the message is less than or equal to the threshold length, the system can determine whether a keyboard language is available (step 1010). If a keyboard language is available, the language of the message may be selected to be similar to the keyboard language (step 1012). Alternatively, if the keyboard language is not available, the method 1000 may again consider the message length. For example, if the message length is less than a second threshold (eg, 4 bytes or 4 characters) (step 1014), the language can be detected and selected using a dictionary based method (step 1016). If the message length is greater than the second threshold, the byte n-gram method (or other method or combination of methods) can be used to detect the language of the message (step 1018). The results from the byte n-gram method and the dictionary based method may be combined (eg, using an interpolator or other classifier) and the language of the message may be determined based on the combination ( Step 1020).

図１１は、テキストメッセージの言語を特定する例示的な方法１１００である。ユーザのクライアント装置上で作成されたテキストメッセージが受信または提供される（ステップ１１０２）。アルファベットに基づくメソッドおよび／またはスクリプトに基づくメソッドを用いて、テキストメッセージに関連するアルファベットおよび／またはスクリプトを決定する（ステップ１１０４）。アルファベットおよび／またはスクリプトに関連付けられた候補言語を特定する。候補言語が独特なアルファベットおよび／またはスクリプトを有する言語（例えば、ロシア語、アラビア語、ヘブライ語、ギリシャ語、中国語、台湾語、日本語または韓国語）である場合（ステップ１１０６）、その候補言語は、テキストメッセージの言語として決定される（ステップ１１０８）。 FIG. 11 is an exemplary method 1100 for identifying the language of a text message. A text message created on the user's client device is received or provided (step 1102). An alphabet-based method and / or script-based method is used to determine an alphabet and / or script associated with the text message (step 1104). Identify candidate languages associated with the alphabet and / or script. If the candidate language is a language with a unique alphabet and / or script (eg, Russian, Arabic, Hebrew, Greek, Chinese, Taiwanese, Japanese or Korean) (step 1106), the candidate The language is determined as the language of the text message (step 1108).

一方、候補言語が独特なアルファベットおよび／またはスクリプトを有する言語でない場合、テキストメッセージの長さを評価する。メッセージ長が閾値長（例えば、４バイトまたは４文字）未満であり且つテキストメッセージがクライアント装置によって使用されたキーボード言語を含むまたは備える場合（ステップ１１１０）、メッセージの言語は、キーボード言語として選択される（ステップ１１１２）。 On the other hand, if the candidate language is not a language with a unique alphabet and / or script, the length of the text message is evaluated. If the message length is less than a threshold length (eg, 4 bytes or 4 characters) and the text message includes or comprises the keyboard language used by the client device (step 1110), the language of the message is selected as the keyboard language. (Step 1112).

代替的には、メッセージ長が閾値長よりも長い場合またはキーボード言語が利用できない場合、n-gramメソッド（例えば、バイトn-gramメソッド）を用いてメッセージを処理することによって、テキストメッセージの第１の可能性のある言語セットを特定する（ステップ１１１４）。その後、辞書に基づくメソッドを用いてメッセージを処理することによって、テキストメッセージの第２の可能性のある言語セットを特定する（ステップ１１１６）。ユーザ言語プロファイルが存在する場合（ステップ１１１８）、（例えば、ＳＶＭ分類器または大きな線形分類器を用いて）第１の可能性のある言語セット、第２の可能性のある言語セットおよびユーザ言語プロファイル（１１２０）を組み合わせることによって、第１の可能性のある言語の組み合わせを取得する（ステップ１１２２）。次いで、第１の可能性のある言語の組み合わせに基づいて、テキストメッセージの言語を選択する（ステップ１１２４）。一方、ユーザ言語プロファイルが利用できない場合、（例えば、線形補間器または他の分類器を用いて）第１の可能性のある言語セットおよび第２の可能性のある言語セットを組み合わせることによって、第１の可能性のある言語の組み合わせを取得する（ステップ１１２６）。最後に、第２の可能性のある言語の組み合わせに基づいて、テキストメッセージの言語を選択する（ステップ１１２８）。 Alternatively, if the message length is longer than the threshold length or the keyboard language is not available, the first of the text message is processed by processing the message using an n-gram method (eg, byte n-gram method). A language set having the possibility of (1) is specified (step 1114). A second possible language set of the text message is then identified by processing the message using a dictionary based method (step 1116). If a user language profile exists (step 1118), the first possible language set, the second possible language set, and the user language profile (eg, using an SVM classifier or a large linear classifier). By combining (1120), a first possible language combination is obtained (step 1122). The language of the text message is then selected based on the first possible language combination (step 1124). On the other hand, if the user language profile is not available, the first possible language set and the second possible language set are combined (eg, using a linear interpolator or other classifier) to One possible language combination is obtained (step 1126). Finally, the language of the text message is selected based on the second possible language combination (step 1128).

一部の例において、２つ以上のステップで複数の言語検出メソッドの出力を組み合わせることによって、言語検出を実行する。たとえば、第１ステップは、アルファベットスクリプトに基づくメソッドを用いて、中国語（ｃｎ）、日本語（ｊａ）、韓国語（ｋｏ）、ロシア語（ｒｕ）、ヘブライ語（ｈｅ）、ギリシャ語（ｅｌ）、アラビア語（ａｒ）などの独特なアルファベットまたはスクリプトを使用する特殊言語を検出することができる。必要に応じて、第２ステップは、複数の検出メソッド（例えば、バイトn-gramメソッド、ユーザ言語プロファイルに基づくメソッドおよび辞書に基づくメソッド）の組み合わせ（例えば、分類器からのもの）を使用して、メッセージに存在する他の言語（例えば、ラテン語など）を検出することができる。 In some examples, language detection is performed by combining the output of multiple language detection methods in two or more steps. For example, the first step uses a method based on an alphabet script, and uses Chinese (cn), Japanese (ja), Korean (ko), Russian (ru), Hebrew (he), Greek (el ), Special languages that use a unique alphabet or script, such as Arabic (ar). Optionally, the second step uses a combination (eg, from a classifier) of multiple detection methods (eg, byte n-gram methods, user language profile based methods and dictionary based methods). , Other languages present in the message (eg, Latin, etc.) can be detected.

一部の例において、言語検出のために提供または受信されたメッセージは、特定の言語に固有ではなくおよび／または言語嗜好に関係なく、任意のユーザに認識できる特定の数字、文字または画像（例えば、顔文字または絵文字）を含む。本開示に記載のシステムおよび方法は、言語検出を行う際に、このような文字または画像もしくはこのような文字または画像のみを含むメッセージを無視することができる。 In some examples, messages provided or received for language detection are not specific to a particular language and / or are specific numbers, letters or images (eg, that can be recognized by any user, regardless of language preference). , Emoticons or emoji). The systems and methods described in this disclosure can ignore such characters or images or messages containing only such characters or images when performing language detection.

図１２は、メッセージの言語を検出する例示的な方法１２００を示すフローチャートである。この方法は、検出メソッドモジュール１６、分類器モジュール１８およびマネージャーモジュール２０を用いて、所定の入力メッセージ１２０４の最も可能性のある言語または最も良い言語１２０２を特定する。入力メッセージ１２０４は、ユーザまたはメッセージを作成するときに使用されたシステムに関する情報を含むことができる。例えば、入力メッセージ１２０４は、ユーザ識別番号（または他のユーザ識別子）、メッセージを作成するときに使用されたキーボードに関する情報（例えば、キーボード言語）、および／またはメッセージを作成するときに使用されたオペレーティングシステムに関する情報（例えば、オペレーティングシステム言語）を含んでもよい。 FIG. 12 is a flowchart illustrating an exemplary method 1200 for detecting the language of a message. The method uses the detection method module 16, the classifier module 18, and the manager module 20 to identify the most likely or best language 1202 for a given input message 1204. Input message 1204 may include information about the user or system used when composing the message. For example, input message 1204 may include a user identification number (or other user identifier), information about the keyboard used to create the message (eg, keyboard language), and / or the operating system used to create the message. Information about the system (eg, operating system language) may be included.

図示された例示的な方法１２００において、検出メソッドモジュール１６は、１０個の異なる言語検出メソッドを含む。検出メソッドモジュール１６に含まれた３つの言語検出メソッドは、バイトn-gram Ａ１２０６）、バイトn-gram Ｂ１２０８およびバイトn-gram Ｃ１２１０である。これらは、全てバイトn-gramメソッドであり、異なるセットまたは数の言語を検出するように構成することができる。例えば、バイトn-gram Ａ１２０６は、９７個の言語を検出するように構成され、バイトn-gram Ｂ１２０８は、２７個の言語を検出するように構成され、バイトn-gram Ｃ１２１０は、２０個の言語を検出するように構成されてもよい。検出メソッドモジュール１６に含まれた言語検出メソッドのうち２つは、辞書に基づくメソッドであり、異なるセットまたは数の言語を検出するように構成され得る辞書Ａ１２１２および辞書Ｂ１２１４である。例えば、辞書Ａ１２１２は、９個の言語を検出するように構成され、辞書Ｂ１２１４は、１０個の言語を検出するように構成されてもよい。検出メソッドモジュール１６に含まれた言語検出メソッドのうち２つは、ユーザ言語プロファイルメソッドであり、異なるセットまたは数の言語を検出するように構成することができる言語プロファイルＡ１２１６および言語プロファイルＢ１２１８である。例えば、言語プロファイルＡ１２１６は、２０個の言語を検出するように構成されてもよく、言語プロファイルＢ１２１８は、２７個の言語を検出するように構成されてもよい。検出メソッドモジュール１６に含まれた言語検出メソッドのうち２つは、アルファベットに基づくメソッドであり、異なるセットまたは数の言語を検出するように構成されたアルファベットＡ１２２０およびアルファベットＢ１２２２である。例えば、アルファベットＡ１２２０は、２０個の言語を検出するように構成されてもよく、アルファベットＢ１２２２は、２７個の言語を検出するように構成されてもよい。検出メソッドモジュール１６は、さらに、スクリプトに基づく言語検出メソッド１２２４を含む。 In the illustrated exemplary method 1200, the detection method module 16 includes ten different language detection methods. The three language detection methods included in the detection method module 16 are byte n-gram A 1206), byte n-gram B 1208, and byte n-gram C 1210. These are all byte n-gram methods and can be configured to detect different sets or numbers of languages. For example, byte n-gram A 1206 is configured to detect 97 languages, byte n-gram B 1208 is configured to detect 27 languages, and byte n-gram C 1210 is It may be configured to detect 20 languages. Two of the language detection methods included in detection method module 16 are dictionary-based methods, dictionary A 1212 and dictionary B 1214 that may be configured to detect different sets or numbers of languages. For example, dictionary A 1212 may be configured to detect nine languages and dictionary B 1214 may be configured to detect ten languages. Two of the language detection methods included in detection method module 16 are user language profile methods, with language profile A 1216 and language profile B 1218 that can be configured to detect different sets or numbers of languages. is there. For example, language profile A 1216 may be configured to detect 20 languages, and language profile B 1218 may be configured to detect 27 languages. Two of the language detection methods included in detection method module 16 are alphabet-based methods, alphabet A 1220 and alphabet B 1222, configured to detect different sets or numbers of languages. For example, alphabet A 1220 may be configured to detect 20 languages, and alphabet B 1222 may be configured to detect 27 languages. The detection method module 16 further includes a language detection method 1224 based on a script.

検出メソッドモジュール１６内の異なる言語検出メソッドからの出力は、分類器モジュール１８によって結合され処理される。例えば、補間分類器１２２６は、バイトn-gram Ｂ１２０８および辞書Ｂ１２１４からの出力を結合する。バイトn-gram Ｂ１２０８の補間重みは、例えば０．１であってもよく、辞書Ｂ１２１４のの補間重みは、例えば０．９であってもよい。分類器モジュール１８は、バイトn-gram Ｃ１２１０、辞書Ｂ１２１４、言語プロファイルＢ１２１８、およびアルファベットＢ１２２２からの出力を結合するＳＶＭ分類器１２２８を使用することもできる。分類器モジュール１８は、スクリプトに基づくメソッド１２２４と、バイトn-gram Ｃ１２１０、辞書Ａ１２１２、言語プロファイルＡ１２１６およびアルファベットＡ１２２０のＳＶＭ分類器組み合わせとの第１組み合わせ１２３０を使用することもできる。さらに、分類器モジュール１８は、スクリプトに基づくメソッド１２２４と、バイトn-gram Ｃ１２１０、辞書Ａ１２１２および言語プロファイルＡ１２１６の線形ＳＶＭ分類器組み合わせとの第２組み合わせ１２３２を使用することもできる。図１２は、特定の分類器モジュール１８に使用された言語検出テスト、分類器および／または検出テストの出力の組み合わせを示しているが、他の言語検出テスト、分類器および／または組み合わせを使用することもできる。 Outputs from different language detection methods in detection method module 16 are combined and processed by classifier module 18. For example, interpolation classifier 1226 combines the output from byte n-gram B 1208 and dictionary B 1214. The interpolation weight of byte n-gram B 1208 may be 0.1, for example, and the interpolation weight of dictionary B 1214 may be 0.9, for example. The classifier module 18 may also use an SVM classifier 1228 that combines the output from byte n-gram C 1210, dictionary B 1214, language profile B 1218, and alphabet B 1222. The classifier module 18 may also use a first combination 1230 of a script-based method 1224 and an SVM classifier combination of byte n-gram C 1210, dictionary A 1212, language profile A 1216 and alphabet A 1220. Further, the classifier module 18 may use a second combination 1232 of the script-based method 1224 and the linear SVM classifier combination of byte n-gram C 1210, dictionary A 1212 and language profile A 1216. FIG. 12 shows the combination of language detection test, classifier and / or detection test output used for a particular classifier module 18, but other language detection tests, classifiers and / or combinations are used. You can also

第１組み合わせ１２３０および第２組み合わせ１２３２の両方に、スクリプトに基づくメソッド１２２４および分類器を段階的な手法で使用することができる。例えば、スクリプトに基づくメソッド１２２４を用いて、独特なスクリプトを有する言語を迅速に特定することができる。メッセージ１２０４の言語を特定した場合、第１組み合わせ１２３０のＳＶＭ分類器または第２組み合わせの線形ＳＶＭ分類器を使用する必要がない。 Script-based methods 1224 and classifiers can be used in a step-wise manner for both the first combination 1230 and the second combination 1232. For example, a script-based method 1224 can be used to quickly identify languages with unique scripts. If the language of the message 1204 is specified, it is not necessary to use the first combination 1230 SVM classifier or the second combination linear SVM classifier.

一般的に、管理モジュール２０は、特定の言語検出メソッド、分類器および／または検出メソッドの出力の組み合わせを選択することによって、メッセージ１２０４内の言語を特定することができる。管理モジュール２０は、言語分野に従ってまたはメッセージの予測言語に従って、上記の選択を行うことができる。例えば、管理モジュール２０は、分類器によって決定された信用度スコアに従って、特定の分類器を選択することができる。例えば、管理モジュール２０は、分類器からの最も高い予測信用度スコアを有する出力を選択することができる。 In general, management module 20 may identify the language in message 1204 by selecting a particular language detection method, classifier, and / or combination of detection method outputs. The management module 20 can make the above selection according to the language domain or according to the predicted language of the message. For example, the management module 20 can select a particular classifier according to the confidence score determined by the classifier. For example, the management module 20 can select the output with the highest predicted credit score from the classifier.

特定の実現例において、本開示に記載のシステムおよび方法は、言語の検出をサービスとして複数のユーザに提供することに適している。このサービスは、システムおよび方法が言語を特定する速度によって可能になり、および／または多様なクライアントからのサービス要求に基づいて、実行時に複数の特定技術を処理するシステムおよび方法の能力によって強化される。 In certain implementations, the systems and methods described in this disclosure are suitable for providing language detection to multiple users as a service. This service is enabled by the speed with which the system and method specify the language and / or enhanced by the ability of the system and method to handle multiple specific technologies at runtime based on service requests from various clients. .

本開示に記載された主題および動作の実施形態は、本開示に開示された構造およびそれらの構造的均等物を含むデジタル電子回路、コンピュータソフトウェア、ファームウェアまたはハードウェア、もしくはそれらの１つ以上の組み合わせにおいて、実現することができる。本開示に記載された主題の実施形態は、１つ以上のコンピュータプログラム、すなわち、コンピュータ記憶媒体上に符号化され、データ処理装置によって実行されるまたはデータ処理装置の動作を制御するためのコンピュータプログラム命令の１つ以上のモジュールとして実装することができる。これに代えてまたは加えて、プログラム命令は、人為的に生成された伝播信号、例えば、データ処理装置による実行のため、情報を適切な受信機に送信するために符号化することによって生成された機械生成電気信号上に符号化されてもよい。コンピュータ記憶媒体は、コンピュータ可読記憶装置、コンピュータ可読記憶基板、ランダムまたはシリアルアクセスメモリアレイまたはデバイス、またはそれらの１つ以上の組み合わせであってもよく、またはそれらを含んでもよい。また、コンピュータ記憶媒体は、伝播信号ではないが、人工的に生成された伝播信号に符号化されたコンピュータプログラム命令のソースまたは宛先であってもよい。コンピュータ記憶媒体は、１つ以上の別個の物理要素または媒体（例えば、複数のＣＤ、ディスク、または他の記憶装置）であってもよく、それらに含まれてもよい。 Embodiments of the subject matter and operations described in this disclosure are digital electronic circuits, computer software, firmware or hardware, or combinations of one or more thereof that include the structures disclosed in this disclosure and their structural equivalents. Can be realized. Embodiments of the subject matter described in this disclosure are one or more computer programs, ie, computer programs encoded on a computer storage medium and executed by a data processing device or for controlling the operation of the data processing device. It can be implemented as one or more modules of instructions. Alternatively or in addition, the program instructions are generated by encoding an artificially generated propagated signal, eg, information for transmission to an appropriate receiver for execution by a data processing device. It may be encoded on the machine-generated electrical signal. The computer storage medium may be or include a computer readable storage device, a computer readable storage substrate, a random or serial access memory array or device, or one or more combinations thereof. A computer storage medium may also be a source or destination of computer program instructions that are not propagation signals but are encoded in artificially generated propagation signals. A computer storage medium may be or be included in one or more separate physical elements or media (eg, multiple CDs, disks, or other storage devices).

本開示に記載の動作は、データ処理装置によって、１つ以上のコンピュータ可読記憶装置に記憶されたデータまたは他のソースから受信されたデータに対して実行される動作として実現することができる。 The operations described in this disclosure may be implemented as operations performed by data processing devices on data stored in one or more computer-readable storage devices or received from other sources.

「データ処理装置」という用語は、データを処理するための全ての種類の機械、デバイスおよびマシン、例えばプログラム可能なプロセッサ、コンピュータ、チップシステム、またはこれらの複数のものによる組み合わせを含む。装置は、例えば、ＦＰＧＡ（フィールドプログラマブルゲートアレイ）またはＡＳＩＣ（特定用途向け集積回路）などの専用論理回路を含むことができる。装置は、ハードウェアに加えて、関与するコンピュータプログラムの実行環境を生成するコード、例えばプロセッサファームウェア、プロトコルスタック、データベース管理システム、オペレーティングシステム、クロスプラットフォームランタイム環境、仮想マシン、またはそれらの１つ以上の組み合わせを構成するコードを含むことができる。装置および実行環境は、ウェブサービス、分散コンピューティングインフラストラクチャおよびグリッドコンピューティングインフラストラクチャなど、さまざまな異なるコンピューティングモデルインフラストラクチャを実現することができる。 The term “data processing apparatus” includes all types of machines, devices and machines for processing data, such as programmable processors, computers, chip systems, or combinations thereof. The device may include dedicated logic circuitry such as, for example, an FPGA (Field Programmable Gate Array) or an ASIC (Application Specific Integrated Circuit). In addition to hardware, the device may generate code that generates an execution environment for the computer program involved, eg, processor firmware, protocol stack, database management system, operating system, cross-platform runtime environment, virtual machine, or one or more thereof Codes that make up the combination can be included. The devices and execution environments can implement a variety of different computing model infrastructures such as web services, distributed computing infrastructures and grid computing infrastructures.

（プログラム、ソフトウェア、ソフトウェアアプリケーション、スクリプトまたはコードとも知られている）コンピュータプログラムは、コンパイル言語またはインタープリタ言語、宣言型言語または手続き型言語を含む任意のプログラミング言語で記述することができ、スタンドアロンプログラムとしてまたはコンピューティング環境内の使用に適したモジュール、コンポーネント、サブルーチン、オブジェクトまたはその他のユニットとしての任意の形で使用することができる。コンピュータプログラムは、ファイルシステム内のファイルに対応することができるが、必ずしも対応する必要がない。プログラムは、他のプログラムまたはデータ（例えば、マークアップ言語文書に記憶された１つ以上のスクリプト）を保持するファイルの一部、関与しているプログラムに専用の単一ファイル、または複数の同格ファイル（例えば、１つ以上のモジュール、サブプログラムまたはコードの一部を記憶するファイル）に記憶されてもよい。コンピュータプログラムは、１つのコンピュータ上で、または１つのサイトに配置されまたは複数のサイトにわたって分散され、通信ネットワークによって相互接続されている複数のコンピュータ上で実行するように実装することができる。 Computer programs (also known as programs, software, software applications, scripts or code) can be written in any programming language, including compiled or interpreted languages, declarative or procedural languages, and as stand-alone programs Or any form of module, component, subroutine, object or other unit suitable for use within a computing environment. A computer program can correspond to a file in a file system, but need not necessarily correspond. A program can be part of a file that holds other programs or data (eg, one or more scripts stored in a markup language document), a single file dedicated to the program involved, or multiple equivalent files (Eg, one or more modules, subprograms or files that store portions of code). A computer program can be implemented to run on one computer or on multiple computers located at one site or distributed across multiple sites and interconnected by a communication network.

本開示に記載のプロセスおよびロジックフローは、入力データを操作して出力を生成することによって動作を行う１つ以上のコンピュータプログラムを実行する１つ以上のプログラマブルプロセッサによって実施することができる。また、プロセスおよびロジックフローは、ＦＰＧＡ（フィールドプログラマブルゲートアレイ）またはＡＳＩＣ（特定用途向け集積回路）などの専用論理回路によっても実施することができ、装置は、専用論理回路として実装することもできる。 The processes and logic flows described in this disclosure can be implemented by one or more programmable processors executing one or more computer programs that operate by manipulating input data to produce output. The process and logic flow can also be performed by dedicated logic circuits such as FPGAs (Field Programmable Gate Arrays) or ASICs (Application Specific Integrated Circuits), and the device can also be implemented as dedicated logic circuits.

コンピュータプログラムの実行に適したプロセッサは、例として、汎用マイクロプロセッサ、専用マイクロプロセッサ、および任意のデジタルコンピュータの任意の１つ以上のプロセッサを含む。一般的に、プロセッサは、読出専用メモリまたはランダムアクセスメモリもしくはその両方から、命令およびデータを受信する。コンピュータの必須要素は、命令に従って動作を実行するためのプロセッサと、命令およびデータを記憶するための１つ以上のメモリデバイスとである。一般的に、コンピュータはまた、データを記憶するための１つ以上の大容量記憶装置、例えば磁気ディスク、磁気光ディスクまたは光ディスクを含むおよび／またはこれらの大容量記憶装置とデータを送受信するように動作可能に結合される。しかしながら、コンピュータは、これらの装置を有する必要がない。さらに、コンピュータは、別のデバイス、例えば携帯電話、携帯情報端末（ＰＤＡ）、モバイルオーディオまたはビデオプレーヤ、ゲームコンソール、全地球測位システム（ＧＰＳ）受信機、または携帯型記憶デバイス（例えば、ユニバーサルシリアルバス（ＵＳＢ）フラッシュドライブ）を含むことができる。コンピュータプログラム命令およびデータの記憶に適したデバイスは、例えば、ＥＰＲＯＭ、ＥＥＰＲＯＭおよびフラッシュメモリデバイスなどの半導体メモリデバイス、内蔵ハードディスクまたはリムーバブルディスクなどの磁気ディスク、光磁気ディスク、ＣＤ−ＲＯＭおよびＤＶＤ−ＲＯＭディスクを含む全ての種類の不揮発性メモリ、媒体およびメモリデバイスを含む。プロセッサおよびメモリは、専用論理回路によって補完されてもよく、専用論理回路に組み込まれてもよい。 Processors suitable for the execution of computer programs include, by way of example, general purpose microprocessors, special purpose microprocessors, and any one or more processors of any digital computer. Generally, a processor will receive instructions and data from a read-only memory or a random access memory or both. The essential elements of a computer are a processor for performing operations in accordance with instructions and one or more memory devices for storing instructions and data. Generally, a computer also includes one or more mass storage devices for storing data, such as a magnetic disk, magnetic optical disk or optical disk, and / or operates to send and receive data to and from these mass storage devices Combined as possible. However, the computer need not have these devices. In addition, the computer may be another device such as a mobile phone, personal digital assistant (PDA), mobile audio or video player, game console, global positioning system (GPS) receiver, or portable storage device (eg, universal serial bus). (USB) flash drive). Devices suitable for storing computer program instructions and data include, for example, semiconductor memory devices such as EPROM, EEPROM and flash memory devices, magnetic disks such as internal hard disks or removable disks, magneto-optical disks, CD-ROMs and DVD-ROM disks. Including all types of non-volatile memory, media and memory devices. The processor and the memory may be supplemented by a dedicated logic circuit or may be incorporated in the dedicated logic circuit.

ユーザとの対話を提供するために、本開示に記載された主題の実施形態は、情報をユーザに提示するための表示装置（例えば、ＣＲＴ（陰極線管）モニタまたはＬＣＤ（液晶ディスプレイ）モニタ）、ユーザがコンピュータに入力を提供することができるキーボードおよびポインティングデバイス（例えば、マウスまたはトラックボール）を備えたコンピュータ上で実装することができる。他の種類の装置を用いて、ユーザとの対話を提供することもできる。例えば、ユーザに提供されるフィードバックは、任意種類の感覚フィードバック、例えば視覚フィードバック、聴覚フィードバックまたは触覚フィードバックであってもよく、ユーザからの入力は、音響入力、音声入力または触覚入力を含む任意の形で受信することができる。さらに、コンピュータは、ユーザによって使用されるデバイスとの間でドキュメントを送受信することによって、例えば、ウェブブラウザから受信した要求に応答して、ユーザのクライアント装置上のウェブブラウザにウェブページを送信することによって、ユーザと対話することができる。 In order to provide user interaction, embodiments of the subject matter described in this disclosure include a display device (eg, a CRT (cathode ray tube) monitor or LCD (liquid crystal display) monitor) for presenting information to the user, It can be implemented on a computer with a keyboard and pointing device (eg, a mouse or trackball) that allows the user to provide input to the computer. Other types of devices can also be used to provide user interaction. For example, the feedback provided to the user may be any type of sensory feedback, such as visual feedback, audio feedback or tactile feedback, and the input from the user may be in any form including acoustic input, audio input or tactile input. Can be received. In addition, the computer may send a web page to the web browser on the user's client device in response to a request received from the web browser, for example, by sending and receiving documents to and from the device used by the user. Can interact with the user.

本開示に記載された主題の実施形態は、例えばデータサーバなどのバックエンドコンポーネント、またはアプリケーションサーバなどのミドルウェアコンポーネント、ユーザが本開示に記載された主題の実装と対話することができるグラフィカルユーザインターフェイスまたはウェブブラウザを有するクライアントコンピュータなどのフロントコンポーネント、または１つ以上のバックエンドコンポーネント、ミドルウェアコンポーネントまたはフロントエンドコンポーネントの組み合わせを含むコンピューティングシステムに実現することができる。システムのコンポーネントは、任意の形式または媒体のデジタルデータ通信、例えば通信ネットワークと相互接続することができる。通信ネットワークの例として、ローカルエリアネットワーク（ＬＡＮ）、ワイドエリアネットワーク（「ＷＡＮ」）、ネットワーク間（例えば、インターネット）、およびピアツーピアネットワーク（例えば、臨時用ピアツーピアネットワーク）。 Embodiments of the subject matter described in this disclosure include a backend component such as a data server, or a middleware component such as an application server, a graphical user interface that allows a user to interact with an implementation of the subject matter described in this disclosure, or It can be implemented in a computing system that includes a front component, such as a client computer having a web browser, or a combination of one or more back-end components, middleware components, or front-end components. The components of the system can be interconnected with any form or medium of digital data communication, eg, a communication network. Examples of communication networks include a local area network (LAN), a wide area network (“WAN”), an inter-network (eg, the Internet), and a peer-to-peer network (eg, a temporary peer-to-peer network).

コンピューティングシステムは、クライアントおよびサーバを含むことができる。クライアントとサーバとは、一般的に互いに遠隔であり、典型的には通信ネットワークを介して相互作用する。クライアントとサーバとは、対応するコンピュータ上で実行し、互いにクライアント−サーバ関係を有するコンピュータプログラムである。いくつかの実施形態において、サーバは、（例えば、クライアント装置と対話するユーザにデータを表示し、ユーザから入力を受信するために）データ（例えば、ＨＴＭＬページ）をクライアント装置に送信する。クライアント装置で生成されたデータ（例えば、ユーザ対話の結果）は、サーバ上でクライアント装置から受信することができる。 The computing system can include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. A client and a server are computer programs that are executed on corresponding computers and have a client-server relationship with each other. In some embodiments, the server sends data (eg, an HTML page) to the client device (eg, for displaying data to a user interacting with the client device and receiving input from the user). Data generated by the client device (eg, a result of user interaction) can be received from the client device on the server.

本開示は、多くの具体的な実施詳細を含むが、これらの詳細は、発明の範囲または請求可能な範囲を限定するものではなく、むしろ特定の発明の特定の実施形態に特有の特徴の説明として考えるべきである。本開示の別個の実施形態に記載された特定の特徴は、単一の実施形態において組み合わせとして実施することもできる。逆に、単一の実施形態に記載されたさまざまな特徴は、複数の実施形態において、別々にまたは任意の適切なサブ組み合わせで実施することもできる。さらに、上記で特徴を特定の組み合わせで作用するものとして説明したが、このような説明にも拘らず、１つ以上の特徴は、説明した組み合わせから削除されてもよく、説明した組み合わせは、サブコンビネーションに変形されてもよい。 This disclosure includes many specific implementation details, but these details are not intended to limit the scope of the invention or the claimable scope, but rather are descriptions of features specific to particular embodiments of a particular invention. Should be considered as. Certain features that are described in separate embodiments of the disclosure can also be implemented in combination in a single embodiment. Conversely, various features that are described in a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Furthermore, although features have been described above as acting in particular combinations, in spite of such descriptions, one or more features may be deleted from the described combinations, and the described combinations are sub- It may be transformed into a combination.

同様に、動作が特定の順序で図面に示されているが、望ましい結果を達成するために、図示された順序または順番に従ってこれらの動作を実行する必要があるまたは図示された全ての動作を実行する必要があると理解すべきではない。特定の状況において、マルチ作業および並列処理は、有利である可能性がある。例えば、並列処理を使用して、複数の言語検出メソッドを同時に実行することができる。さらに、上述の実施形態におけるさまざまなシステム要素の分離は、全ての実施形態においてそのような分離が必要であると理解すべきではなく、記載されたプログラム要素およびシステムは、一般的に、単一のソフトウェア製品に一体化されまたは複数のソフトウェア製品にパッケージ化することができると理解すべきである。 Similarly, operations are shown in a particular order in the drawings, but in order to achieve the desired result, these operations need to be performed according to the illustrated order or sequence, or all illustrated operations are performed. It should not be understood that you need to. In certain situations, multi-tasking and parallel processing can be advantageous. For example, multiple language detection methods can be executed simultaneously using parallel processing. Further, the separation of the various system elements in the above embodiments should not be construed as requiring such a separation in all embodiments, and the described program elements and systems are generally single It should be understood that it can be integrated into a single software product or packaged into multiple software products.

したがって、主題の特定の実施形態を説明した。他の実施形態は、添付の特許請求の範囲内にある。場合によって、請求項に列挙された動作は、異なる順序で実行され、依然として望ましい結果を達成することができる。さらに、望ましい結果を達成するために、添付の図面に示されるプロセスは、必ずしも示された特定の順序または順番に従う必要がない。特定の実現例において、マルチ作業および並列処理が有利である可能性がある。 Accordingly, specific embodiments of the subject matter have been described. Other embodiments are within the scope of the appended claims. In some cases, the actions recited in the claims can be performed in a different order and still achieve desirable results. Moreover, to achieve the desired results, the processes shown in the accompanying drawings need not necessarily follow the specific order or sequence shown. In certain implementations, multi-tasking and parallel processing may be advantageous.

Claims

A computer-implemented method for identifying the language of a message, comprising:
Performing a plurality of different language detection tests on messages associated with the user, each language detection test determining a score set, each of which is a set of scores, each score in the score set being The likelihood that the message is one of a plurality of different languages, the method further comprising:
Providing one or more combinations of the score sets as input to one or more different classifiers;
Obtaining an indication that the message is in one of the plurality of different languages as output from each of the one or more classifiers, the presentation comprising a confidence score, and the method Furthermore,
Identifying the language of the message as a language presented by one of the one or more classifiers based on at least one of the confidence score and the identified language field; Computer implementation method.

The method of claim 1, wherein the particular classifier is a supervised learning model, a partially supervised learning model, an unsupervised learning model, or an interpolation method.

The method of claim 1, wherein identifying the language of the message comprises selecting the presented language based on the confidence score.

The method of claim 1, wherein identifying the language of the message comprises selecting the classifier based on the identified language domain.

The method of claim 1, wherein the language field is selected from the group consisting of video games, sports, news, agenda, politics, health, and travel.

The method of claim 1, wherein the message includes two or more of letters, numbers, symbols, and emoticons.

The plurality of different language detection tests includes at least two methods selected from the group consisting of byte n-gram methods, dictionary based methods, alphabet based methods, script based methods, and user language profile methods. Item 2. The method according to Item 1.

The method of claim 1, wherein the plurality of different language detection tests are performed simultaneously.

The method of claim 1, wherein the one or more combinations include a score set obtained from a byte n-gram method and a dictionary based method.

9. The method of claim 8, wherein the one or more combinations further comprises a score set obtained from at least one of a user language profile method and an alphabet based method.

A system for identifying the language of a message,
A computer storage device for storing instructions;
A data processing device configured to execute the instructions and perform the following operations:
The operation is
Performing a plurality of different language detection tests on messages associated with the user, each language detection test determining a score set, each of which is a set of scores, each score in the score set being The likelihood that the message is one of a plurality of different languages, said action further comprising:
Providing one or more combinations of the score sets as input to one or more different classifiers;
Obtaining an indication that the message is in one of the plurality of different languages as output from each of the one or more classifiers, wherein the indication includes a confidence score and the action Furthermore,
Identifying the language of the message as a language presented by a classifier of the one or more classifiers based on at least one of the confidence score and the identified language field; system.

The system of claim 11, wherein the particular classifier is a supervised learning model, a partially supervised learning model, an unsupervised learning model, or an interpolation method.

The system of claim 11, wherein identifying the language of the message includes selecting the presented language based on the confidence score.

The system of claim 11, wherein identifying the language of the message includes selecting the classifier based on the identified language domain.

The system of claim 11, wherein the language field is selected from the group consisting of video games, sports, news, agenda, politics, health, and travel.

The system of claim 11, wherein the message includes two or more of letters, numbers, symbols, and emoticons.

The plurality of different language detection tests includes at least two methods selected from the group consisting of byte n-gram methods, dictionary based methods, alphabet based methods, script based methods, and user language profile methods. Item 12. The system according to Item 11.

The system of claim 11, wherein the plurality of different language detection tests are performed simultaneously.

The system of claim 11, wherein the one or more combinations include a score set obtained from a byte n-gram method and a dictionary based method.

The system of claim 18, wherein the one or more combinations further include a score set obtained from at least one of a user language profile method and an alphabet based method.

A computer program product stored in one or more storage devices for controlling a processing mode of a data processing device,
The computer program product, when executed by the data processing device, causes the data processing device to execute the following operations:
The operation is
Performing a plurality of different language detection tests on messages associated with the user, each language detection test determining a score set, each of which is a set of scores, each score in the score set being The likelihood that the message is one of a plurality of different languages, said action further comprising:
Providing one or more combinations of the score sets as input to one or more different classifiers;
Obtaining an indication that the message is in one of the plurality of different languages as output from each of the one or more classifiers, wherein the indication includes a confidence score and the action Furthermore,
Identifying the language of the message as a language presented by a classifier of the one or more classifiers based on at least one of the confidence score and the identified language field; Computer program product.

The computer program product of claim 21, wherein the particular classifier is a supervised learning model, a partially supervised learning model, an unsupervised learning model, or an interpolation method.

The computer program product of claim 21, wherein identifying the language of the message includes selecting the presented language based on the confidence score.

The computer program product of claim 21, wherein identifying the language of the message comprises selecting the classifier based on the identified language domain.

The computer program product of claim 21, wherein the language field is selected from the group consisting of video games, sports, news, agenda, politics, health, and travel.

The computer program product of claim 21, wherein the message includes two or more of letters, numbers, symbols, and emoticons.

The plurality of different language detection tests includes at least two methods selected from the group consisting of byte n-gram methods, dictionary based methods, alphabet based methods, script based methods, and user language profile methods. Item 22. The computer program product according to Item 21.

The computer program product of claim 21, wherein the plurality of different language detection tests are performed simultaneously.

The computer program product of claim 21, wherein the one or more combinations include a score set obtained from a byte n-gram method and a dictionary based method.

30. The computer program product of claim 28, wherein the one or more combinations further include a score set obtained from at least one of a user language profile method and an alphabet based method.