JP4618083B2

JP4618083B2 - Document processing apparatus and document processing method

Info

Publication number: JP4618083B2
Application number: JP2005284585A
Authority: JP
Inventors: 幸治奥村
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 2005-09-29
Filing date: 2005-09-29
Publication date: 2011-01-26
Anticipated expiration: 2025-09-29
Also published as: JP2007094838A

Description

本発明は，文書から重要部分を自動抽出する文書処理装置および文書処理方法に関する。特に，ハッシュ関数を用いて文書の重要部分を自動抽出する文書処理装置および文書処理方法に関する。 The present invention relates to a document processing apparatus and document processing method for automatically extracting important parts from a document. In particular, the present invention relates to a document processing apparatus and a document processing method for automatically extracting an important part of a document using a hash function.

近年，ネットワークを利用する人口の急激な増加や，目覚ましい通信技術の進歩に伴い，ネットワークを介して他の機器から配信された文書を，携帯端末などの電子機器を使用して閲覧する機会が多くなっている。このような状況では，携帯端末の利用者は，たとえば，情報量が多いＷｅｂページや電子メールのように，予め携帯端末向けに作成されたものではない文書（テキスト）であっても，自己の携帯端末で閲覧したいと感じる機会が非常に多い。 In recent years, with the rapid increase in the population using the network and remarkable progress in communication technology, there are many opportunities to view documents distributed from other devices via the network using electronic devices such as mobile terminals. It has become. In such a situation, a user of a mobile terminal can use his / her own document (text) that has not been created for the mobile terminal in advance, such as a Web page or e-mail with a large amount of information. There are so many opportunities to feel like browsing on a mobile device.

しかし，携帯端末は，利用者により常時携帯されるという性質上，小型かつ軽量であることが求められる。このような要求により，近年，小型化，軽量化が著しい携帯端末では，スペックの問題から情報の記憶量や処理量に一定の制約があるため，電子メールのような文書を一度に表示することができない場合がある。 However, a portable terminal is required to be small and lightweight due to the nature of being always carried by a user. Due to these requirements, portable devices that are becoming smaller and lighter in recent years have certain restrictions on the amount of information stored and the amount of processing due to spec issues, so documents such as e-mail can be displayed at once. May not be possible.

この問題に対して，自然言語処理技術を用いて，入力された文書から要約文を自動生成することにより文書の情報量を少なくした後，その要約文を携帯端末に送信して表示させようという研究が模索されている。この研究は，意味理解技術や文脈理解技術などを用いて文書の要約を計算機により自動生成することを目指している。しかし，この技術は，今なお，実用レベルまで確立されておらず，現時点では，実用化が困難な状況にある。 To solve this problem, the natural text processing technology is used to reduce the amount of document information by automatically generating a summary sentence from the input document, and then send the summary sentence to the mobile terminal for display. Research is being sought. This research aims to automatically generate document summaries by computer using semantic understanding technology and context understanding technology. However, this technology has not yet been established to a practical level and is currently difficult to put into practical use.

そこで，より実用化が容易な方法として，キーワードをマッチングすることにより文書から重要文を抽出する技術が提案されている（たとえば，特許文献１を参照。）。この技術では，まず，重要文あるいは不要文に統計的に多く含まれる特定の単語や言い回しが，予め，人手により計算機に登録され，その上で，テキストの各文の中に先に登録された単語や言い回しが含まれているか否かが検査される。その結果，各文に含まれる特定の単語や言い回しの数に基づき各文の重要度がそれぞれ決定され，重要度が高いと決定された１または２以上の文が要約文として抽出される。 Therefore, as a method that is easier to put into practical use, a technique for extracting an important sentence from a document by matching a keyword has been proposed (see, for example, Patent Document 1). In this technology, first, specific words and phrases that are statistically included in important or unnecessary sentences are registered in advance in the computer by hand, and then registered in each sentence of the text first. It is checked whether words and phrases are included. As a result, the importance of each sentence is determined based on the number of specific words and phrases included in each sentence, and one or more sentences determined to have a high importance are extracted as summary sentences.

特開平６−２５９４２３号公報JP-A-6-259423

しかし，上記の技術には次のような３つの問題点がある。まず，第１は，上記技術では，予め重要度を計算する手がかりとなる特定の単語や言い回しを実用に十分耐えうる程度まで計算機に登録しなければならず，この作業に多くの時間と人手とがかかるという点である。また，時代とともに移り変わる流行語などにも適宜対応するためには，上記登録作業を継続して行うことによって，登録情報を常に更新する必要がある。 However, the above technique has the following three problems. First, in the above technique, it is necessary to register a specific word or phrase that is a clue to calculate the importance in advance to a level that can sufficiently withstand practical use. It is that it takes. In addition, in order to appropriately cope with buzzwords that change with the times, it is necessary to constantly update the registration information by continuously performing the above registration work.

第２は，上記技術では，登録された単語や言い回しが，各文に含まれているか否かを判定するために，各文と登録された単語または言い回しとの比較が必要であり，計算機の処理の負荷が非常に高い点である。具体的には，登録された単語や言い回しが各文の一部に含まれているか否かを判定するには，各文に含まれる文字列を一文字ずつ，ずらしながら，登録単語と比較する必要がある。換言すれば，この判定処理は，各文と登録された単語または言い回しとの文字列の比較を，各文の文字数にて示される回数だけ繰り返さなければならない。このため，その文字列をマッチングするための処理量が非常に多くなってしまう。 Second, in the above technique, it is necessary to compare each sentence with the registered word or phrase in order to determine whether or not the registered word or phrase is included in each sentence. The processing load is very high. Specifically, to determine whether or not a registered word or phrase is included in a part of each sentence, it is necessary to compare the registered word with the character string included in each sentence by shifting one character at a time. There is. In other words, in this determination process, comparison of character strings between each sentence and a registered word or phrase must be repeated as many times as indicated by the number of characters in each sentence. For this reason, the processing amount for matching the character string becomes very large.

第３は，上記技術では，多言語に対応することが困難であるという点である。すなわち，登録すべき単語や言い回しの選定は，それぞれの言語で行う必要がある。このため，開発者または登録者は，対象となる言語に精通していなければならない。特に，メジャーでない言語を対象とする場合，そのような言語に精通した者を確保すること自体難しく，もしそのような者を探してきたとしても，その者を雇用するために多くのコストがかかってしまう。 Third, it is difficult for the above technique to cope with multiple languages. In other words, it is necessary to select words and phrases to be registered in each language. For this reason, the developer or registrant must be familiar with the target language. In particular, when targeting non-major languages, it is difficult to secure a person who is familiar with such a language, and even if you have searched for such a person, it will cost a lot to hire that person. End up.

そこで，本発明は，上記問題に鑑みてなされたものであり，本発明の目的とするところは，処理の負荷を軽減することにより，文書から重要部分を高速に自動抽出する，新規かつ改良された文書処理装置および文書処理方法を提供することにある。 Therefore, the present invention has been made in view of the above problems, and an object of the present invention is a new and improved automatic extraction of important parts from a document at high speed by reducing the processing load. Another object of the present invention is to provide a document processing apparatus and a document processing method.

上記課題を解決するために，本発明のある観点によれば，所定の規則に基づき算出された計算値とその計算値が算出された頻度を表す出現頻度とを関連付けて記憶する記憶部と，文書を所定の条件に基づいて複数の文字列に分割する分割部と，上記所定の規則に基づいて，上記分割された各文字列から各文字列固有の計算値をそれぞれ求める計算部と，上記計算部により求められた各計算値と上記記憶部に記憶された計算値とを比較することにより，上記求められた各計算値に対応する出現頻度をそれぞれ検出する検索部と，上記検出された各計算値に対応する各出現頻度に基づいて，１または２以上の計算値を選択し，選択された１または２以上の計算値に対する文字列を上記文書の要約として抽出する抽出部と，を備える文書処理装置が提供される。 In order to solve the above-described problem, according to an aspect of the present invention, a storage unit that associates and stores a calculated value calculated based on a predetermined rule and an appearance frequency that represents the frequency at which the calculated value is calculated; A dividing unit that divides a document into a plurality of character strings based on a predetermined condition; a calculation unit that obtains a calculation value unique to each character string from each of the divided character strings based on the predetermined rule; A search unit for detecting the appearance frequency corresponding to each calculated value by comparing each calculated value obtained by the calculating unit and the calculated value stored in the storage unit, and the detected An extraction unit that selects one or more calculated values based on each appearance frequency corresponding to each calculated value, and extracts a character string for the selected one or more calculated values as a summary of the document; The document processing device It is subjected.

従来，文書からその要約を抽出する際，文書に含まれる文字列に予め登録された単語が含まれているか否かを一文字ずつ，ずらしながらマッチング処理していた。この文字列の比較は，実際には文字数にて示される回数だけの比較が必要であった。 Conventionally, when extracting a summary from a document, matching processing is performed by shifting whether or not a pre-registered word is included in a character string included in the document, one character at a time. This comparison of character strings actually required comparison as many times as indicated by the number of characters.

しかし，本発明によれば，分割された各文の文字列から，その各文に固有な１つのデータ（計算値）が算出され，算出された１つのデータ（計算値）と，記憶部に記憶された計算値と，がマッチング処理される。すなわち，本発明では，文字列の比較ではなく，１回の数値の比較のみが必要になる。このため，文書からその要約を抽出する際，非常に高速にマッチング処理を完了することができる。すなわち，予め登録された単語が，各文に含まれる文字列に含まれているか否かを一文字ずつ，ずらしながらマッチング処理していた従来に比べて処理の負荷を劇的に少なくすることができる。 However, according to the present invention, one data (calculated value) unique to each sentence is calculated from the character string of each divided sentence, and the calculated one data (calculated value) is stored in the storage unit. The stored calculated value is matched. That is, in the present invention, it is only necessary to compare numerical values once instead of comparing character strings. For this reason, when the summary is extracted from the document, the matching process can be completed very quickly. In other words, it is possible to dramatically reduce the processing load compared to the conventional method in which matching processing is performed while shifting whether characters registered in advance are included in character strings included in each sentence one character at a time. .

そして，本発明では，このような高速なマッチング処理の結果，マッチングした計算値に対応して記憶された各出現頻度に基づいて文書の要約となる重要文を高速に抽出することができる。この結果，文書の情報量が多いために携帯電話等の比較的スペックに乏しい機器に今まで表示できなかった情報であっても，これをすばやく要約して表示することができる。このため，ユーザは，いままで見ることさえできなかった文書の要約文により，文書の重要部分をすばやく把握することができる。 In the present invention, as a result of such a high-speed matching process, it is possible to extract an important sentence that is a summary of the document at a high speed based on the appearance frequencies stored corresponding to the matched calculated values. As a result, even information that could not be displayed on a device having relatively poor specifications such as a mobile phone due to a large amount of document information can be quickly summarized and displayed. For this reason, the user can quickly grasp the important part of the document by the summary sentence of the document that could not be seen until now.

また，これによれば，文書を要約する度に，計算された計算値とその計算値の出現度数に関する情報とが，「自動的に」記憶部に蓄積されていく。このため，重要文であるか，または，不要文であるかを判定する手がかりとなる特定の単語や言い回しを実用に充分耐えうる程度まで，記憶部に予め登録しておくという作業が不要になる。 Also, according to this, each time a document is summarized, the calculated value and information regarding the frequency of appearance of the calculated value are accumulated automatically in the storage unit. For this reason, it is not necessary to pre-register the specific word or phrase that serves as a clue to determine whether the sentence is an important sentence or an unnecessary sentence in the storage unit to the extent that it can sufficiently withstand practical use. .

さらに，これによれば，各文の文字列が文書の言語に依存しない単なるデータに変換される。このため，文書の言語に依存せずに，文書を要約することができる。したがって，本文書処理装置を使用してシステムを構築または運用する際，開発者や登録者は，それぞれの言語に精通している必要がなく，未知の言語であってもこれに対応することができる。 Furthermore, according to this, the character string of each sentence is converted into simple data independent of the language of the document. Therefore, it is possible to summarize the document without depending on the language of the document. Therefore, when building or operating a system using this document processing device, developers and registrants do not need to be familiar with each language, and can handle unknown languages. it can.

上記計算部は，上記所定の規則としてハッシュ関数を用いて，上記各文字列から上記計算値としてのハッシュ値をそれぞれ求めるようにしてもよい。 The calculation unit may obtain a hash value as the calculation value from each character string using a hash function as the predetermined rule.

また，上記抽出部は，上記検索された各計算値に対応する出現頻度をそれぞれ比較することにより，より低い値を有する出現頻度に関連付けて記憶されている計算値から順に１または２以上の計算値を選択するようにしてもよい。 In addition, the extraction unit compares the appearance frequencies corresponding to the searched calculation values, respectively, so that one or more calculations are sequentially performed from the calculation values stored in association with the appearance frequency having a lower value. A value may be selected.

上記検索部は，上記検索部による比較の結果，上記計算部により求められた計算値が上記記憶部に記憶されていると判定された場合には，上記求められた計算値に関連付けて記憶されている出現頻度を増加させ，上記求められた計算値が上記記憶部に記憶されていないと判定された場合には，上記求められた計算値とともに所与の値をもつ出現頻度を新たに記憶してもよい。 If it is determined as a result of the comparison by the search unit that the calculation value obtained by the calculation unit is stored in the storage unit, the search unit is stored in association with the calculated value. If it is determined that the calculated value is not stored in the storage unit, a new appearance frequency having a given value is stored together with the calculated value. May be.

上記分割部は，上記文書を文節または文または段落のいずれかを構成する複数の文字列に分割することができる。 The dividing unit can divide the document into a plurality of character strings constituting any of a clause or a sentence or a paragraph.

上記文書処理装置であって，さらに，上記文書または上記分割された各文のいずれかに含まれる文字列の形式を整える正規化部を備えていてもよい。 The document processing apparatus may further include a normalization unit that adjusts a format of a character string included in either the document or each divided sentence.

これによれば，たとえば，全角文字，半角文字を統一するなど，文字列の形式が整えられる。これにより，表記の違いによる計算の誤差をなくして，形式が整えられた文字列からより正確な計算値を算出することができる。この結果，正確な計算値を用いて，文書からより正確に要約を抽出することができる。 According to this, for example, the format of the character string is adjusted, such as unifying full-width characters and half-width characters. As a result, it is possible to eliminate a calculation error due to a difference in notation, and to calculate a more accurate calculated value from a character string whose format is arranged. As a result, it is possible to extract a summary more accurately from a document using an accurate calculated value.

さらに，上記文書処理装置は，上記文書の属性を定めるテキスト分類部を備えていてもよい。このとき，上記記憶部は，上記求められた計算値の出現頻度を文書の属性毎に複数蓄積し，上記抽出部は，上記定められた文書の属性と上記記憶部に記憶された文書の属性との相関関係から定められる相関値をそれぞれ用いて，上記記憶部に文書の属性毎に記憶された複数の出現頻度をそれぞれ重み付けし，重み付けられた各出現頻度に基づき，１または２以上の計算値を選択するようにしてもよい。 Furthermore, the document processing apparatus may include a text classification unit that determines attributes of the document. At this time, the storage unit accumulates a plurality of the calculated appearance frequencies of the calculated values for each document attribute, and the extraction unit stores the attribute of the document and the attribute of the document stored in the storage unit. Each of the correlation values determined from the correlation with each other is used to weight each of the appearance frequencies stored for each attribute of the document in the storage unit, and one or more calculations are performed based on each weighted appearance frequency. A value may be selected.

また，このとき，上記相関値は，上記テキスト分類部により決定された文書の属性と上記記憶部に記憶された文書の属性との関連が小さいほど，より大きな値をもつように設定されていてもよい。 At this time, the correlation value is set to have a larger value as the relation between the document attribute determined by the text classification unit and the document attribute stored in the storage unit is smaller. Also good.

また，上記抽出部は，上記各相関値を用いて上記計算値に対して上記文書の属性毎に記憶された複数の出現頻度にそれぞれ重み付けし，重み付けられた各出現頻度の総和を上記各計算値に対応した重要度として算出し，算出された重要度のうち高いものから順に，対応する１または２以上の計算値を選択するようにしてもよい。 Further, the extraction unit weights the calculated values using a plurality of appearance frequencies stored for each attribute of the document with respect to the calculated values, and calculates the sum of the weighted appearance frequencies. The importance corresponding to the value may be calculated, and one or two or more corresponding calculated values may be selected in descending order of the calculated importance.

これによれば，文書の属性に関する相関値が予め定められていて，相関値を用いて各出現度数が重み付けされる。ここで，相関値は，相関度が低いほど高い値に設定される。たとえば，特定の分野でのみ頻出する語や文に対する相関値は，その特定分野との関連性が高いため，予め小さく設定されることができる。これにより，特定の分野でのみ頻出する文が，その他の分野でも頻出する文より重要度が高くなるように重み付けがなされる。この結果，各文に対する重要度ｍがより適切に算出され，算出された重要度に基づいて，より内容の充実した要約を抽出することができる。 According to this, the correlation value regarding the attribute of the document is determined in advance, and each appearance frequency is weighted using the correlation value. Here, the correlation value is set to a higher value as the correlation degree is lower. For example, the correlation value for a word or sentence that appears frequently only in a specific field is highly related to the specific field, and can be set small in advance. As a result, the sentence that appears frequently only in a specific field is weighted so as to be more important than the sentence that frequently appears in other fields. As a result, the importance m for each sentence is more appropriately calculated, and a more detailed summary can be extracted based on the calculated importance.

また，上記計算部は，テキストの一部または全部に固有の計算値を全計算値として求め，上記検索部は，上記全計算値が上記記憶部に記憶されているか否かを検索し，上記全計算値が上記記憶部に記憶されていない場合，上記全計算値を，上記抽出部により上記文書の要約として抽出された文字列に関連付けて記憶し，上記抽出部は，上記全計算値が上記記憶部に記憶されている場合，上記分割部，上記計算部，上記検索部および上記抽出部による上記各部の動作を各部に実行させることなく，上記全計算値に関連付けて上記記憶部に記憶されている計算値に応じた文字列を上記文書の要約として抽出するようにしてもよい。 Further, the calculation unit obtains a calculation value unique to a part or all of the text as a total calculation value, and the search unit searches whether or not the total calculation value is stored in the storage unit. If the total calculated values are not stored in the storage unit, the total calculated values are stored in association with the character string extracted by the extracting unit as a summary of the document. When stored in the storage unit, the division unit, the calculation unit, the search unit, and the extraction unit are stored in the storage unit in association with all the calculated values without causing each unit to execute the operation of each unit. A character string corresponding to the calculated value may be extracted as a summary of the document.

これによれば，同じ内容のメールが操作ミスや誤送により複数回入力された場合にも，出現頻度の値は，実情に沿った適切な値をとるように設定される。これにより，各文の重要度が必要以上に低下することを回避することができる。 According to this, even when an email having the same content is input a plurality of times due to an operation error or an erroneous transmission, the value of the appearance frequency is set to take an appropriate value according to the actual situation. As a result, it is possible to prevent the importance of each sentence from decreasing more than necessary.

また，これによれば，たとえば，同じ内容のメールが複数回入力されたときのように，以前入力したテキストと同一テキストを入力した場合には，再度，処理の負荷が高い計算値の算出やマッチング処理を実行する必要がない。このため，処理の負荷を軽減しながら，前に抽出した重要文を使用してすばやく要約テキストをユーザに提供することができる。 Also, according to this, when the same text as the previously entered text is input, for example, when the same content mail is input multiple times, the calculation value with high processing load is again calculated. There is no need to execute matching processing. Therefore, it is possible to quickly provide the summary text to the user using the important sentence extracted previously while reducing the processing load.

また，上記課題を解決するために，本発明の別の観点によれば，所定の規則に基づき算出された計算値とその計算値が算出された頻度を表す出現頻度とを関連付けて記憶部に記憶し，文書を所定の条件に基づいて複数の文字列に分割し，上記所定の規則に基づいて，上記分割された各文字列から各文字列固有の計算値をそれぞれ求め，上記求められた各計算値と上記記憶部に記憶されている計算値とを比較することにより，上記求められた各計算値に対応する出現頻度をそれぞれ検索し，上記検索された各計算値に対応する各出現頻度に基づき，１または２以上の計算値を選択し，選択された１または２以上の計算値に対する文字列を上記文書の要約として抽出する文書処理方法が提供される。 In order to solve the above-described problem, according to another aspect of the present invention, a calculated value calculated based on a predetermined rule and an appearance frequency representing the calculated frequency of the calculated value are associated with each other in the storage unit. Storing the document, dividing the document into a plurality of character strings based on a predetermined condition, obtaining a calculated value unique to each character string from each of the divided character strings based on the predetermined rule, By comparing each calculated value with the calculated value stored in the storage unit, the appearance frequency corresponding to each calculated value is searched, and each occurrence corresponding to each searched calculation value is searched. There is provided a document processing method for selecting one or more calculated values based on the frequency and extracting a character string corresponding to the selected one or more calculated values as a summary of the document.

これによれば，各文に対応した１つの計算値と記憶部に記憶された計算値とが高速にマッチング処理される。これにより，各文の重要度を高速に判定し，その重要度に基づいて入力文書の要約を高速に抽出することができる。この結果，文書をすばやく要約してユーザに提供することができる。 According to this, one calculated value corresponding to each sentence and the calculated value stored in the storage unit are matched at high speed. As a result, the importance of each sentence can be determined at high speed, and the summary of the input document can be extracted at high speed based on the importance. As a result, documents can be quickly summarized and provided to the user.

以上説明したように本発明によれば，処理の負荷を軽減することにより，文書から重要部分を高速に自動抽出する，新規かつ改良された文書処理装置および文書処理方法を提供することができる。 As described above, according to the present invention, it is possible to provide a new and improved document processing apparatus and document processing method capable of automatically extracting important parts from a document at high speed by reducing the processing load.

以下に添付図面を参照しながら，本発明の好適な実施形態について詳細に説明する。なお，以下の説明及び添付図面において，同一の構成及び機能を有する構成要素については，同一符号を付することにより，重複説明を省略する。また，以下の各実施形態では，電子メールをテキスト（文書）の一例として挙げ，その要約を自動生成する文書処理装置およびその方法について説明する。 Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. In the following description and the accompanying drawings, components having the same configuration and function are denoted by the same reference numerals, and redundant description is omitted. In the following embodiments, an e-mail is taken as an example of text (document), and a document processing apparatus and method for automatically generating a summary thereof will be described.

（第１実施形態）
（文書処理装置１００のハードウエア構成）
まず，第１実施形態にかかる文書処理装置のハードウエア構成について，図１を参照しながら説明する。文書処理装置１００は，ＨＤＤ（ＨａｒｄＤｉｓｋＤｒｉｖｅ）１０５，ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）１１０，ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）１１５，プロセッサ１２０，インターフェース１２５およびバス１３０を含んで構成される。 (First embodiment)
(Hardware configuration of document processing apparatus 100)
First, the hardware configuration of the document processing apparatus according to the first embodiment will be described with reference to FIG. The document processing apparatus 100 includes an HDD (Hard Disk Drive) 105, a ROM (Read Only Memory) 110, a RAM (Random Access Memory) 115, a processor 120, an interface 125, and a bus 130.

ＨＤＤ１０５には，テキストから重要文を抽出するために必要な情報を含んだ各種データやプログラムが蓄積されている。ＨＤＤ１０５は，記憶装置の一例であり，光ディスクや光磁気ディスクなどの記憶装置であってもよい。 The HDD 105 stores various data and programs including information necessary for extracting important sentences from text. The HDD 105 is an example of a storage device, and may be a storage device such as an optical disk or a magneto-optical disk.

ＲＯＭ１１０には，プロセッサ１２０を動作させるための基本的なプログラムやプロセッサ１２０が異常なときに起動するプログラムなどが記録されている。ＲＡＭ１１５には，外部から入力されたテキストや，後述する分割された各文，分類コードなどのデータが一時的に記憶される。プロセッサ１２０は，入力テキストから要約を生成するために，ＨＤＤ１０５やＲＯＭ１１０等に記憶されたプログラムを実行するようになっている。 The ROM 110 stores a basic program for operating the processor 120, a program that starts when the processor 120 is abnormal, and the like. The RAM 115 temporarily stores data such as text input from the outside, divided sentences (to be described later), and classification codes. The processor 120 executes a program stored in the HDD 105, the ROM 110, or the like in order to generate a summary from the input text.

インターフェース１２５は，たとえば，キーボード２００，ＯＣＲ（ＯｐｔｉｃａｌＣｈａｒａｃｔｅｒＲｅａｄｅｒ）２０５，ネットワークカード２１０，音声入力装置２１５などの入力デバイスからテキストを入力するようになっている。また，インターフェース１２５は，テキストから抽出した要約文を，たとえば，ＣＲＴ（ＣａｔｈｏｄｅＲａｙＴｕｂｅ）３００，プリンタ３０５，ネットワークカード３１０，音声出力装置３１５などの出力デバイスに出力するようになっている。 The interface 125 is configured to input text from input devices such as a keyboard 200, an OCR (Optical Character Reader) 205, a network card 210, and a voice input device 215, for example. Further, the interface 125 outputs the summary sentence extracted from the text to an output device such as a CRT (Cathode Ray Tube) 300, a printer 305, a network card 310, and an audio output device 315, for example.

バス１３０は，ＨＤＤ１０５，ＲＯＭ１１０，ＲＡＭ１１５，プロセッサ１２０，インターフェース１２５の各デバイス間で情報をやりとりする経路である。 The bus 130 is a path for exchanging information among the devices such as the HDD 105, the ROM 110, the RAM 115, the processor 120, and the interface 125.

（文書処理装置１００の機能構成）
つぎに，文書処理装置１００の機能構成について，図２を参照しながら説明する。文書処理装置１００は，入力部１５０，分割部１５５，正規化部１６０，文ハッシュ計算部１６５，記憶部１７０，文ハッシュ検索部１７５，抽出部１８０および出力部１８５の各ブロックにて示される機能を有している。 (Functional configuration of document processing apparatus 100)
Next, the functional configuration of the document processing apparatus 100 will be described with reference to FIG. The document processing apparatus 100 includes functions indicated by blocks of an input unit 150, a division unit 155, a normalization unit 160, a sentence hash calculation unit 165, a storage unit 170, a sentence hash search unit 175, an extraction unit 180, and an output unit 185. have.

入力部１５０は，たとえば，図１に示したキーボード２００から打ち込まれたテキストや，ＯＣＲ２０５やネットワークカード２１０から取り込まれたテキストや，音声入力装置２１５から音声入力されたテキストや，外部から伝送されるＷｅｂページや電子メールなどのテキストを入力して電子化する。 The input unit 150 is, for example, text entered from the keyboard 200 shown in FIG. 1, text taken from the OCR 205 or the network card 210, text inputted from the voice input device 215, or transmitted from the outside. Input texts such as web pages and e-mails for digitization.

分割部１５５は，所定の条件に基づいて，入力された文書を複数の文字列に分割する。以下の説明では，分割部１５５は，対象となる電子メールを文単位に分割する。たとえば，分割部１５５は，文書中に句点（句点で区切られた各文を分割する場合）または改行（箇条書きにされた各行を分割する場合）が出現したら分割するという条件に基づいて，文書を複数の文字列に分割する。 The dividing unit 155 divides the input document into a plurality of character strings based on a predetermined condition. In the following description, the dividing unit 155 divides a target electronic mail into sentence units. For example, the dividing unit 155 performs document division based on the condition that a document is divided when a punctuation point (when dividing each sentence separated by punctuation points) or a line feed (when each itemized line is divided) appears in the document. Is split into multiple strings.

正規化部１６０は，表記形式の違いを統一する。一例としては，全角文字と半角文字とをいずれかに統一する場合が挙げられる。なお，正規化部１６０は，分割後の各文に対してその形式を統一してもよく，分割前の文書に対してその形式を統一してもよい。また，正規化部１６０は，本実施形態にかかる文書処理装置１００に必須の機能ではない。しかし，文字列の表記の違いによるハッシュ値の算出誤差をなくし，形式が整えられた文字列からより正確な計算値を算出することができる。この結果，入力文書から，より正確に要約を抽出することができる。 The normalization unit 160 unifies the difference in notation format. As an example, there is a case where a single-byte character or a single-byte character is unified. It should be noted that the normalization unit 160 may unify the format for each sentence after division, or may unify the format for the document before division. The normalization unit 160 is not an essential function for the document processing apparatus 100 according to the present embodiment. However, it is possible to eliminate the calculation error of the hash value due to the difference in the notation of the character string, and to calculate a more accurate calculated value from the character string whose format is arranged. As a result, the summary can be extracted more accurately from the input document.

文ハッシュ計算部１６５（計算部に相当）は，所定の規則に基づいて，分割後の各文字列から文字列毎に固有の計算値を求める。文ハッシュ計算部１６５は，たとえば，ハッシュ関数を用いて，分割部１５５にて分割された各文に対するハッシュ値を計算する。ハッシュ関数は，文書や数字などの文字列の羅列を一定長のデータ（ハッシュ値）に変換するための関数である。 The sentence hash calculation unit 165 (corresponding to the calculation unit) obtains a unique calculation value for each character string from each divided character string based on a predetermined rule. The sentence hash calculation unit 165 calculates a hash value for each sentence divided by the division unit 155 using, for example, a hash function. The hash function is a function for converting an enumeration of character strings such as documents and numbers into data of a fixed length (hash value).

記憶部１７０は，分割文ハッシュテーブル１７０ａを有している。分割文ハッシュテーブル１７０ａには，図３に示したように，いままでに入力されたテキストの各文から計算されたハッシュ値１７０ａ１とそのハッシュ値の出現頻度を示す出限度数１７０ａ２とが累積されている。なお，出現度数１７０ａ２は，文ハッシュ計算部１６５によりハッシュ値が算出された頻度，すなわち，各文の文字列が出現する頻度を表す値の一例であり，各文字列が出現する頻度を表す値であれば，各文字列が出現する回数以外の値であってもよく，たとえば，各文字列が出現する確率などであってもよい。 The storage unit 170 has a divided sentence hash table 170a. In the divided sentence hash table 170a, as shown in FIG. 3, a hash value 170a1 calculated from each sentence of the text input so far and an output limit number 170a2 indicating the appearance frequency of the hash value are accumulated. ing. The appearance frequency 170a2 is an example of a value representing the frequency with which the hash value is calculated by the sentence hash calculation unit 165, that is, the frequency with which the character string of each sentence appears, and is a value representing the frequency with which each character string appears. If it is, it may be a value other than the number of times each character string appears, for example, the probability that each character string appears.

文ハッシュ検索部１７５（検索部に相当）は，文ハッシュ計算部１６５により今回計算された各文のハッシュ値が分割文ハッシュテーブル１７０ａに記憶されているか否かのマッチング処理を行う。今回計算されたハッシュ値が，分割文ハッシュテーブル１７０ａに記憶されたハッシュ値１７０ａ１と一致した場合，文ハッシュ検索部１７５は，一致したハッシュ値に対応して記憶された出現度数１７０ａ２をそれぞれ検出する。 The sentence hash search unit 175 (corresponding to the search unit) performs matching processing as to whether or not the hash value of each sentence calculated this time by the sentence hash calculation unit 165 is stored in the divided sentence hash table 170a. When the hash value calculated this time matches the hash value 170a1 stored in the divided sentence hash table 170a, the sentence hash search unit 175 detects the appearance frequency 170a2 stored corresponding to the matched hash value. .

このように，文ハッシュ計算部１６５により求められたハッシュ値が分割文ハッシュテーブル１７０ａに記憶されている場合，文ハッシュ検索部１７５は，そのハッシュ値１７０ａ１に関連付けて記憶された出現度数１７０ａ２の値をたとえば，「１」増加させる。 Thus, when the hash value calculated | required by the sentence hash calculation part 165 is memorize | stored in the division | segmentation sentence hash table 170a, the sentence hash search part 175 is the value of the appearance frequency 170a2 memorize | stored in association with the hash value 170a1. For example, “1” is increased.

一方，求められたハッシュ値が分割文ハッシュテーブル１７０ａに記憶されていない場合には，文ハッシュ検索部１７５は，文ハッシュ計算部１６５により求められたハッシュ値とともに予め定められた所定値（たとえば，出現回数が一回であることを示す「１」）を有する出現頻度を分割文ハッシュテーブル１７０ａに記憶する。 On the other hand, when the obtained hash value is not stored in the divided sentence hash table 170a, the sentence hash search unit 175, together with the hash value obtained by the sentence hash calculation unit 165, determines a predetermined value (for example, The appearance frequency having “1” indicating that the number of appearances is 1) is stored in the divided sentence hash table 170a.

抽出部１８０は，文ハッシュ検索部１７５により検出された各ハッシュ値に対応する各出現度数の値に基づいて，その中から１または２以上の出現度数を選択し，選択された出現度数に関連付けて記憶されている各ハッシュ値に対応する１または２以上の原文を要約として抽出する。このとき，抽出部１８０は，重要文を抽出する際には，入力されたテキストに含まれる各文に対応したハッシュ値のうち，出現度数が大きいハッシュ値に対応する文は重要度が低く，出現度数が小さいハッシュ値に対応する文は重要度が高いという規則を利用する。具体的には，抽出部１８０は，分割文ハッシュテーブル１７０ａに記憶されている該当ハッシュ値１７０ａ１の出現度数１７０ａ２を比較することにより，相対的に出現度数１７０ａ２が低いハッシュ値１７０ａ１を選択し，選択されたハッシュ値に対応した文を重要文として抽出する。 The extraction unit 180 selects one or more appearance frequencies from the appearance frequency values corresponding to the hash values detected by the sentence hash search unit 175, and associates them with the selected appearance frequency. Then, one or more original sentences corresponding to each hash value stored are extracted as a summary. At this time, when the extraction unit 180 extracts an important sentence, a sentence corresponding to a hash value having a high appearance frequency among hash values corresponding to each sentence included in the input text has low importance. A sentence corresponding to a hash value having a small appearance frequency uses the rule that the importance is high. Specifically, the extraction unit 180 selects a hash value 170a1 having a relatively low appearance frequency 170a2 by comparing the appearance frequency 170a2 of the corresponding hash value 170a1 stored in the divided sentence hash table 170a. A sentence corresponding to the hash value is extracted as an important sentence.

１または２以上の重要文を決定する具体的方法としては，たとえば，抽出部１８０は，入力テキストに含まれる各文のうち，一番出現度数が低いハッシュ値に対応する文を重要文として抽出する方法が挙げられる。また，抽出部１８０は，入力テキストに含まれる各文のうち，出現度数が低い順に重要文を数文抽出するようにしてもよい。出力部１８５は，このようにして抽出された重要文をテキストの要約文として，たとえば，携帯電話の画面などに出力する。 As a specific method for determining one or more important sentences, for example, the extraction unit 180 extracts, as important sentences, a sentence corresponding to the hash value having the lowest appearance frequency among the sentences included in the input text. The method of doing is mentioned. Further, the extraction unit 180 may extract several important sentences from the sentences included in the input text in ascending order of appearance frequency. The output unit 185 outputs the important sentence extracted in this way as a text summary sentence, for example, on a screen of a mobile phone.

なお，以上に説明した文書処理装置１００の各機能は，実際には，プロセッサ１２０がこれらの機能を実現する処理手順を記述したプログラムを実行することにより，または，いずれかの機能を実現するためのハードウエアやＩＣの制御により達成される。たとえば，入力部１５０および出力部１８５の機能は，図１のインターフェース１２５として機能するＩＣにより実現されるようにしてもよい。また，分割部１５５，正規化部１６０，文ハッシュ計算部１６５，文ハッシュ検索部１７５，抽出部１８０の機能は，これらの機能を実現する処理手順を記述したプログラムを図１のプロセッサ１２０が実行することにより達成されるようにしてもよい。また，記憶部１７０の機能は，ＨＤＤ１０５，ＲＯＭ１１０またはＲＡＭ１１５等の記憶領域を用いて達成されるようにしてもよい。 Each function of the document processing apparatus 100 described above is actually executed by the processor 120 executing a program describing a processing procedure for realizing these functions, or for realizing any function. This is achieved by controlling the hardware and IC. For example, the functions of the input unit 150 and the output unit 185 may be realized by an IC that functions as the interface 125 of FIG. The functions of the division unit 155, normalization unit 160, sentence hash calculation unit 165, sentence hash search unit 175, and extraction unit 180 are executed by the processor 120 in FIG. It may be achieved by doing so. Further, the function of the storage unit 170 may be achieved using a storage area such as the HDD 105, the ROM 110, or the RAM 115.

（文書処理装置１００の動作）
つぎに，本実施形態にかかる文書処理装置１００の具体的動作について，図４を参照しながら説明する。図４は，本実施形態にかかる文書処理装置１００が実行する重要文抽出処理を示したフローチャートである。なお，この重要文抽出処理が実行される前に，入力部１５０によりテキストが入力され，記憶部１７０により入力されたテキストがＲＡＭ１１５またはＨＤＤ１０５に記憶されているものとする。 (Operation of the document processing apparatus 100)
Next, a specific operation of the document processing apparatus 100 according to the present embodiment will be described with reference to FIG. FIG. 4 is a flowchart showing important sentence extraction processing executed by the document processing apparatus 100 according to the present embodiment. It is assumed that text is input by the input unit 150 and the text input by the storage unit 170 is stored in the RAM 115 or the HDD 105 before the important sentence extraction process is executed.

ステップ４００から重要文抽出処理が開始され，ステップ４０５に進むと，分割部１５５は，入力されたテキストを文単位に分割する。ここでの文には，区点で区切られたものの他に箇条書きにされた各行も含まれている。 When the important sentence extraction process is started from step 400 and the process proceeds to step 405, the dividing unit 155 divides the input text into sentence units. The sentence here includes each line in a bulleted list in addition to what is delimited by the punctuation marks.

つぎに，ステップ４１０に進み，正規化部１６０が，句読点や，半角文字，全角文字などの文字列の形式の統一を行い，ステップ４１５に進んで，文ハッシュ計算部１６５が，分割された文毎のハッシュ値を計算する。具体的には，文ハッシュ計算部１６５は，ＲＦＣ１３２１に示されているＭＤ５や，ＲＦＣ３１７４に示されているＳＨＡ−１などのハッシュ関数を用いて，与えられた原文（分割文）から固定長の擬似乱数であるハッシュ値を算出する。 Next, the process proceeds to step 410, where the normalization unit 160 unifies the format of character strings such as punctuation marks, half-width characters, full-width characters, etc., and proceeds to step 415, where the sentence hash calculation unit 165 proceeds to the divided sentence. Calculate the hash value for each. More specifically, the sentence hash calculation unit 165 uses a hash function such as MD5 shown in RFC1321 or SHA-1 shown in RFC3174 to obtain a fixed length from a given original sentence (divided sentence). A hash value that is a pseudo-random number is calculated.

これにより，各分割文の文字列が，たとえば図３に示したように，固定長であって分割された各文に固有の計算値（ハッシュ値１７０ａ１）に変換される。そして，このように変換されたハッシュ値１７０ａ１は，つぎに説明する文ハッシュ検索部１７５の機能を用いて，ハッシュ値の出現度数１７０ａ２とともに分割文ハッシュテーブル１７０ａに記憶される。このようにして，文ハッシュ計算部１６５により毎回計算されるハッシュ値とそのハッシュ値の出現度数が分割文ハッシュテーブル１７０ａに蓄積される。 As a result, the character string of each divided sentence is converted into a calculated value (hash value 170a1) having a fixed length and unique to each divided sentence, as shown in FIG. The hash value 170a1 converted in this way is stored in the divided sentence hash table 170a together with the appearance frequency 170a2 of the hash value using the function of the sentence hash search unit 175 described below. In this way, the hash value calculated each time by the sentence hash calculation unit 165 and the appearance frequency of the hash value are accumulated in the divided sentence hash table 170a.

つぎに，ステップ４２０に進むと，文ハッシュ検索部１７５は，各文に対応して算出された各ハッシュ値が，分割文ハッシュテーブル１７０ａに記憶されたハッシュ値１７０ａ１のいずれかに一致するか否かを検索する。
Next, when proceeding to step 420, the sentence hash search unit 175 determines whether each hash value calculated corresponding to each sentence matches one of the hash values 170a1 stored in the divided sentence hash table 170a. Search for.

検索の結果，文ハッシュ計算部１６５により求められたハッシュ値が分割文ハッシュテーブル１７０ａに記憶されていると判定された場合，文ハッシュ検索部１７５は，そのハッシュ値に関連付けて記憶された出現度数１７０ａ２を増加（たとえば，出現度数１７０ａ２を１つ増加）する。一方，文ハッシュ計算部１６５により求められたハッシュ値が分割文ハッシュテーブル１７０ａに記憶されていないと判定された場合には，求められたハッシュ値とともに出現度数として予め定められた所定値，たとえば「１」を記憶する。 As a result of the search, when it is determined that the hash value obtained by the sentence hash calculation unit 165 is stored in the divided sentence hash table 170a, the sentence hash search unit 175 stores the appearance frequency stored in association with the hash value. 170a2 is increased (for example, the appearance frequency 170a2 is increased by one). On the other hand, when it is determined that the hash value obtained by the sentence hash calculation unit 165 is not stored in the divided sentence hash table 170a, a predetermined value, for example, “ 1 ”is stored.

つぎに，ステップ４２５に進み，抽出部１８０が，元の文（原文）の文書（テキスト）から重要文を決定し，出力部１８５が，抽出した重要文を要約テキストとして出力する。具体的には，抽出部１８０は，分割文ハッシュテーブル１７０ａに記憶された出現度数１７０ａ２を用いて以下のように重要文を抽出する。 In step 425, the extraction unit 180 determines an important sentence from the original sentence (original sentence) document (text), and the output unit 185 outputs the extracted important sentence as a summary text. Specifically, the extraction unit 180 extracts an important sentence as follows using the appearance frequency 170a2 stored in the divided sentence hash table 170a.

分割文ハッシュテーブル１７０ａに記憶された出現度数１７０ａ２は，前述したように，今までに計算されたすべてのテキストから分割された各文のハッシュ値が出現した回数の累積であり，各文がこれまでにどれだけ出現したかを表す指標となる。よって，出現度数１７０ａ２が大きい値をもつということは，電子メールなどの多くのテキストに出現する文であると判定することができる。そして，このように多く出現する文は，挨拶（たとえば，「おはようございます」や「お世話になります」）などの可能性が高く，一般に，重要度が低いと推定される。このような原理から，各文に対応するハッシュ値の出願度数ｎと各文の重要度ｍとの関係は，関数ｆを用いてつぎのように表される。 As described above, the appearance frequency 170a2 stored in the divided sentence hash table 170a is an accumulation of the number of times the hash value of each sentence divided from all the texts calculated so far has appeared. It becomes an index that shows how many times it appeared. Therefore, when the appearance frequency 170a2 has a large value, it can be determined that the sentence appears in many texts such as electronic mail. Such a sentence that appears frequently has a high possibility of greetings (for example, “Good morning” or “Thank you”), and is generally estimated to be less important. From such a principle, the relationship between the application frequency n of the hash value corresponding to each sentence and the importance m of each sentence is expressed as follows using the function f.

ｍ＝ｆ（ｎ）＋α
ここで，α：他の要因によって決定される重要度（任意）
このとき，ｍ１＝ｆ（ｎ１），ｍ２＝ｆ（ｎ２）において，
ｎ１＞ｎ２ならば，ｍ１≦ｍ２となる。 m = f (n) + α
Where α: Importance determined by other factors (optional)
At this time, in m1 = f (n1) and m2 = f (n2),
If n1> n2, m1 ≦ m2.

このようにして算出された各ハッシュ値に対する重要度ｍに基づいて，抽出部１８０は，重要度ｍが大きいハッシュ値１７０ａ１に対応した文を重要文として抽出する。 Based on the importance m for each hash value calculated in this way, the extraction unit 180 extracts a sentence corresponding to the hash value 170a1 having a large importance m as an important sentence.

このとき，抽出部１８０は，たとえば，入力されたテキストに含まれる各文に対応するハッシュ値１７０ａ１のうち，一番出現度数１７０ａ２が低いハッシュ値１７０ａ１に対応する文を重要文として抽出するようにしてもよい。また，抽出部１８０は，入力されたテキストに含まれる各文に対応するハッシュ値１７０ａ１のうち，出現度数１７０ａ２が低い順に重要文を数文抽出するようにしてもよい。 At this time, for example, the extraction unit 180 extracts, as an important sentence, a sentence corresponding to the hash value 170a1 having the lowest appearance frequency 170a2 among the hash values 170a1 corresponding to the sentences included in the input text. May be. The extraction unit 180 may extract several important sentences from the hash value 170a1 corresponding to each sentence included in the input text in ascending order of appearance frequency 170a2.

なお，入力テキストに含まれる各文とこの各文から求められたハッシュ値とは，重要文抽出処理が終了するまで，記憶部１７０のいずれかの記憶領域に関連付けて記憶されている。よって，抽出部１８０は，この記憶領域に記憶されたデータに基づいて，出現度数１７０ａ２が低いハッシュ値に対応する文を重要文として抽出する。その後，抽出した重要文が携帯電話等に表示され，ステップ４９５に進んで本処理は終了となる。 Note that each sentence included in the input text and the hash value obtained from each sentence are stored in association with one of the storage areas of the storage unit 170 until the important sentence extraction process is completed. Therefore, the extraction unit 180 extracts a sentence corresponding to a hash value having a low appearance frequency 170a2 as an important sentence based on the data stored in the storage area. Thereafter, the extracted important sentence is displayed on a mobile phone or the like, and the process proceeds to step 495 to end the present process.

従来においては，登録された単語や言い回しが各文の一部に含まれているか否かを判定する場合，各文に含まれる文字列を一文字ずつずらしながら比較しなければならなかった。この結果，登録された単語や言い回しが，各文に含まれているか否かを判定するために，非常に多くの処理が必要であった。 Conventionally, when it is determined whether or not a registered word or phrase is included in a part of each sentence, the character strings included in each sentence must be compared while shifting one character at a time. As a result, a large amount of processing is required to determine whether or not a registered word or phrase is included in each sentence.

しかし，以上に説明したように，本実施形態にかかる文書処理装置１００によれば，各文をその文に固有な値，すなわち，ハッシュ値として認識し，各文に対応した１つのデータ（ハッシュ値）と分割文ハッシュテーブル１７０ａに記憶されたハッシュ値１７０ａ１とをマッチング処理し，マッチングした結果検出される出願度数により，各文の重要度が判定される。このため，従来に比べて処理の負荷を劇的に少なくすることができる。これにより，各文の重要度を高速に判定し，その重要度に基づいて，入力テキストから重要文を高速に抽出することができる。この結果，情報量が多いテキストであって，携帯電話等，比較的スペックに乏しい機器に今まで表示できなかった情報であってもこれをすばやく要約して表示することができる。このため，ユーザは，表示された要約文により，テキストの内容を知ることができるばかりでなく，そのテキストの重要部分をすばやく把握することができる。 However, as described above, according to the document processing apparatus 100 according to the present embodiment, each sentence is recognized as a value unique to the sentence, that is, a hash value, and one data (hash) corresponding to each sentence is identified. Value) and the hash value 170a1 stored in the divided sentence hash table 170a, and the importance of each sentence is determined based on the application frequency detected as a result of the matching. For this reason, the processing load can be drastically reduced as compared with the prior art. Thereby, the importance of each sentence can be determined at high speed, and the important sentence can be extracted from the input text at high speed based on the importance. As a result, even text that has a large amount of information and could not be displayed on a device such as a mobile phone, which has relatively poor specifications, can be quickly summarized and displayed. For this reason, the user can not only know the content of the text by the displayed summary sentence, but also can quickly grasp the important part of the text.

また，本実施形態にかかる文書処理装置１００によれば，この装置を利用する度に，各文に対応するハッシュ値１７０ａ１とそのハッシュ値の出現度数１７０ａ２のデータとが，自動的に分割文ハッシュテーブル１７０ａに蓄積される。このため，重要文であるか，または，不要文であるかを判定する手がかりとなる特定の単語や言い回しを，実用に充分耐えうる程度まで計算機に予め登録しておくという作業が不要になる。 Further, according to the document processing apparatus 100 according to the present embodiment, each time this apparatus is used, the hash value 170a1 corresponding to each sentence and the data of the appearance frequency 170a2 of the hash value are automatically converted into the divided sentence hash. Accumulated in the table 170a. For this reason, it is not necessary to pre-register a specific word or phrase that serves as a clue to determine whether the sentence is an important sentence or an unnecessary sentence in the computer to a level that can sufficiently withstand practical use.

さらに，ハッシュ値の計算は，各文の文字列を各文字の種類（言語）に関係しない単なるデータとして計算するため，テキストの言語に依存せずに，本実施形態にかかる文書処理装置１００を使用してシステムを構築または運用することができる。このため，開発者や登録者は，それぞれの言語に精通している必要がなく，未知の言語であってもこれに対応することができる。 Further, since the hash value is calculated as simple data not related to the type (language) of each character, the document processing apparatus 100 according to the present embodiment is not dependent on the language of the text. Can be used to build or operate the system. For this reason, developers and registrants do not need to be familiar with each language, and can handle unknown languages.

（第２実施形態）
つぎに，第２実施形態にかかる文書処理装置１００について説明する。本実施形態にかかる文書処理装置１００は，図５に示したように，分類係数テーブル１７０ｂとテキスト分類部１９０とが新たに追加された点で図２に示した第１実施形態にかかる文書処理装置１００と機能構成上相異する。 (Second Embodiment)
Next, the document processing apparatus 100 according to the second embodiment will be described. As shown in FIG. 5, the document processing apparatus 100 according to the present embodiment is a document processing according to the first embodiment shown in FIG. 2 in that a classification coefficient table 170b and a text classification unit 190 are newly added. It differs from the device 100 in terms of functional configuration.

また，本実施形態にかかる文書処理装置１００は，分類係数テーブル１７０ｂに予め登録された分類情報に基づいて入力テキストがどの分類に属するかを決定し，前述した出現度数と重要度との相関関係に加え，決定されたテキストの分類と重要度との相関関係をも考慮して入力テキストから重要文を抽出する点で第１実施形態にかかる文書処理装置１００と動作上相異する。したがって，これらの相異点を中心に本実施形態にかかる文書処理装置１００について説明する。 In addition, the document processing apparatus 100 according to the present embodiment determines which classification the input text belongs to based on the classification information registered in the classification coefficient table 170b in advance, and the correlation between the appearance frequency and importance described above. In addition, it is different in operation from the document processing apparatus 100 according to the first embodiment in that an important sentence is extracted from the input text in consideration of the correlation between the determined text classification and importance. Therefore, the document processing apparatus 100 according to the present embodiment will be described focusing on these differences.

本実施形態にかかる文書処理装置１００は，入力部１５０，分割部１５５，正規化部１６０，文ハッシュ計算部１６５，記憶部１７０内の分割文ハッシュテーブル１７０ａ，文ハッシュ検索部１７５，抽出部１８０，出力部１８５に加え，テキスト分類係数テーブル１７０ｂ（記憶部１７０内）およびテキスト分類部１９０の各ブロックにて示される機能を有している。 The document processing apparatus 100 according to this embodiment includes an input unit 150, a dividing unit 155, a normalizing unit 160, a sentence hash calculating unit 165, a divided sentence hash table 170a in the storage unit 170, a sentence hash searching unit 175, and an extracting unit 180. In addition to the output unit 185, the text classification coefficient table 170b (in the storage unit 170) and the text classification unit 190 have the functions shown in the blocks.

分割文ハッシュテーブル１７０ａは，図６に示したように，ハッシュ値１７０ａ１および出現度数１７０ａ２の項目に加え，分類コード１７０ａ３の項目が新たに記憶されている。たとえば，図３のハッシュ値「２７５３・・・ａ７５９」の出現度数は「１０２０」であったが，本実施形態では，図６に示したように，ハッシュ値「２７５３・・・ａ７５９」によって表される文が含まれるテキストの分類コード１７０ａ３から，その出現度数を二つに分けて分類している。具体的には，図３のハッシュ値「２７５３・・・ａ７５９」の出現度数「１０２０」は，分類コード１７０ａ３が「２０」の場合の出現度数「６２１」と分類コード１７０ａ３が「２４」の場合の出現度数「３９９」とに分けてカウントされている。 As shown in FIG. 6, in the divided sentence hash table 170a, in addition to the items of the hash value 170a1 and the appearance frequency 170a2, the item of the classification code 170a3 is newly stored. For example, the appearance frequency of the hash value “2753... A759” in FIG. 3 is “1020”. However, in this embodiment, the hash value “2753... A759” is represented by the hash value “2753. From the classification code 170a3 of the text including the sentence to be processed, the appearance frequency is classified into two. Specifically, the appearance frequency “1020” of the hash value “2753... A759” in FIG. 3 is the appearance frequency “621” when the classification code 170a3 is “20” and the classification code 170a3 is “24”. And the appearance frequency of “399”.

分類コードは，ハッシュ値１７０ａ１を算出した元の文が含まれているテキストの属性を示した一例である。分類コードは，たとえば，図７に示したように，野球（２０），サッカー（２４），経済（０６）というようにテキストを分類するために使用される。また，たとえば，図８に示したように，電子メールの受取人によって，本人（０１），本人が属する部（０２），本人が属さない部（０３）というようにテキストを分類してもよい。 The classification code is an example showing an attribute of a text including the original sentence for which the hash value 170a1 is calculated. For example, as shown in FIG. 7, the classification code is used to classify text such as baseball (20), soccer (24), and economy (06). Also, for example, as shown in FIG. 8, texts may be classified according to the recipient of the e-mail, such as the principal (01), the part to which the principal belongs (02), and the part to which the principal does not belong (03). .

分類コードは，テキストの属性を表す一例であり，テキストの属性を表すことができれば，どんな情報であってもよい。たとえば，テキストの属性は，電子メールまたはＷｅｂコンテンツというような文書の種類やテキストを送信した送信元の情報などにより表されてもよい。 The classification code is an example representing text attributes, and may be any information as long as it can represent text attributes. For example, the text attribute may be represented by a document type such as e-mail or Web content, information on a transmission source that transmitted the text, and the like.

分類係数テーブル１７０ｂには，分割文ハッシュテーブル１７０ａに記憶された分類コード（図７の横軸）と，入力テキストが属する分類コード（図７の縦軸）と，の２つの分類コードの相関関係により決定される数値（相関値）が予め記憶されている。この数値は，各分類コードの相関度が低いほど高い値をもっている。たとえば，入力テキストが属する分類コードが野球の場合，分割文ハッシュテーブル１７０ａに記憶された各分類コードが野球ならば，相関値は「１」となり，サッカーならば「１．５」となり，経済ならば「４」となる。よって，野球と経済との相関関係が一番低く，サッカー，野球の順に相関関係が高くなることがわかる。 In the classification coefficient table 170b, the correlation between two classification codes, the classification code stored in the divided sentence hash table 170a (horizontal axis in FIG. 7) and the classification code to which the input text belongs (vertical axis in FIG. 7). The numerical value (correlation value) determined by is stored in advance. This numerical value has a higher value as the correlation degree of each classification code is lower. For example, when the classification code to which the input text belongs is baseball, if each classification code stored in the divided sentence hash table 170a is baseball, the correlation value is “1”, if it is soccer, “1.5”, and if economic, Would be “4”. Therefore, it can be seen that the correlation between baseball and economy is the lowest, and the correlation increases in the order of soccer and baseball.

図８には，分類係数テーブル１７０ｂに記憶された他の情報の例が示されている。具体的には，分類係数テーブル１７０ｂには，電子メールの受取人により分類コードを本人（０１），本人が属する部（０２），本人が属さない部（０３）のいずれかに設定し，それらの分類コードに対する相関値が予め記憶されている。 FIG. 8 shows an example of other information stored in the classification coefficient table 170b. Specifically, in the classification coefficient table 170b, the classification code is set to one of the person (01), the part to which the person belongs (02), or the part to which the person does not belong (03) by the recipient of the e-mail. Correlation values for the classification codes are stored in advance.

テキスト分類部１９０は，入力されたテキストの分類を示す分類コードを特定する。
たとえば，テキストに含まれる単語の出現回数を用いて，図７に示したように，野球（２０），サッカー（２４），経済（０６）といったように入力テキストの分類コードを特定する方法や，図８に示したように，電子メールの受取人によって分類コードを特定する。 The text classification unit 190 specifies a classification code indicating the classification of the input text.
For example, using the number of occurrences of words included in the text, as shown in FIG. 7, a method for specifying the classification code of the input text such as baseball (20), soccer (24), economy (06), As shown in FIG. 8, the classification code is specified by the recipient of the electronic mail.

（文書処理装置１００の動作）
つぎに，本実施形態にかかる文書処理装置１００の具体的動作について，図９を参照しながら説明する。図９は，本実施形態にかかる文書処理装置１００が実行する重要文抽出処理を示したフローチャートである。 (Operation of the document processing apparatus 100)
Next, a specific operation of the document processing apparatus 100 according to the present embodiment will be described with reference to FIG. FIG. 9 is a flowchart showing important sentence extraction processing executed by the document processing apparatus 100 according to the present embodiment.

ステップ９００から重要文抽出処理が開始され，ステップ９０５に進むと，テキスト分類部１９０は，入力されたテキストの分類を示す分類コードを特定する。つぎに，ステップ４０５〜ステップ４１５にて，各部が第１実施形態と同様の処理を実行する。すなわち，ステップ４０５にて，分割部１５５が，入力されたテキストを文単位に分割し，ステップ４１０にて，正規化部１６０が，文字等の正規化を行い，ステップ４１５にて，文ハッシュ計算部１６５が，各文のハッシュを計算する。 When the important sentence extraction process is started from step 900 and the process proceeds to step 905, the text classification unit 190 identifies a classification code indicating the classification of the input text. Next, in steps 405 to 415, each unit executes the same processing as in the first embodiment. That is, in step 405, the dividing unit 155 divides the input text into sentence units, in step 410, the normalizing unit 160 normalizes characters and the like, and in step 415, sentence hash calculation is performed. A unit 165 calculates a hash of each sentence.

つぎに，ステップ４２０に進むと，文ハッシュ検索部１７５は，各文から算出された各ハッシュ値が，分割文ハッシュテーブル１７０ａに記憶されたいずれかのハッシュ値１７０ａ１に一致するか否かを検索する。ここで，本実施形態の分割文ハッシュテーブル１７０ａには，ハッシュ値１７０ａ１が同じであっても，分類コード１７０ａ３が異なる複数の出現度数１７０ａ２が記憶されている。したがって，本実施形態では，文ハッシュ検索部１７５は，各文に対応する各ハッシュ値に一致する複数の出現度数１７０ａ２を検出する。 Next, in step 420, the sentence hash search unit 175 searches whether each hash value calculated from each sentence matches one of the hash values 170a1 stored in the divided sentence hash table 170a. To do. Here, even if the hash value 170a1 is the same, a plurality of appearance frequencies 170a2 with different classification codes 170a3 are stored in the divided sentence hash table 170a of the present embodiment. Therefore, in this embodiment, the sentence hash search unit 175 detects a plurality of appearance frequencies 170a2 that match each hash value corresponding to each sentence.

検索の結果，各文から求められたハッシュ値が分割文ハッシュテーブル１７０ａに記憶されている場合，文ハッシュ検索部１７５は，そのハッシュ値に関連付けて記憶された出現度数１７０ａ２のうち，テキストの分類コードに対応する出現度数１７０ａ２を１つ増加する。一方，各文から求められたハッシュ値が分割文ハッシュテーブル１７０ａに記憶されていない場合には，求められたハッシュ値およびテキストの分類コードとともに出現度数として「１」を記憶する。 When the hash value obtained from each sentence is stored in the divided sentence hash table 170a as a result of the search, the sentence hash search unit 175 classifies the text among the appearance frequencies 170a2 stored in association with the hash value. The appearance frequency 170a2 corresponding to the code is increased by one. On the other hand, if the hash value obtained from each sentence is not stored in the divided sentence hash table 170a, “1” is stored as the appearance frequency together with the obtained hash value and text classification code.

つぎに，ステップ４２５に進み，抽出部１８０が，テキストから重要文を決定し，出力部１８５が，抽出した重要文を要約テキストとして出力する。本実施形態では，抽出部１８０は，分割文ハッシュテーブル１７０ａに記憶された出現度数１７０ａ２と分類係数テーブル１７０ｂに記憶された相関値とを用いて重要文を抽出する。 Next, proceeding to step 425, the extraction unit 180 determines an important sentence from the text, and the output unit 185 outputs the extracted important sentence as a summary text. In the present embodiment, the extraction unit 180 extracts an important sentence using the appearance frequency 170a2 stored in the divided sentence hash table 170a and the correlation value stored in the classification coefficient table 170b.

具体的には，抽出部１８０は，各文に対応するハッシュ値の出願度数ｎおよび分類コードから求められる相関値ｋを変数とする関数ｆを用いて各文の重要度ｍを算出する。その関数ｆを以下に示す。 Specifically, the extraction unit 180 calculates the importance m of each sentence using the function f having as variables the application frequency n of the hash value corresponding to each sentence and the correlation value k obtained from the classification code. The function f is shown below.

ｍ＝Σｆ（ｋｉ・ｎｉ）＋α
ここで，α：他の要因によって決定される重要度（任意）
ｎ＝Σｎｉ（ｉ＝分類係数テーブルのインデックス） m = Σf (ki · ni) + α
Where α: Importance determined by other factors (optional)
n = Σni (i = index of classification coefficient table)

このとき，ｍ１＝ｆ（ｎ１），ｍ２＝ｆ（ｎ２）において，
ｎ１＞ｎ２ならば，ｍ１≦ｍ２となる。 At this time, in m1 = f (n1) and m2 = f (n2),
If n1> n2, m1 ≦ m2.

たとえば，テキストの分類コードが野球（２０）である場合，抽出部１８０は，分割文ハッシュテーブル１７０ａに記憶された各項目の値と，分類係数テーブル１７０ｂに記憶された各相関値と，を用いて以下のように重要度ｍを算出する。
ｍ＝ｆ（１・６２１）＋ｆ（１．５・３９９）＋α For example, when the text classification code is baseball (20), the extraction unit 180 uses the value of each item stored in the divided sentence hash table 170a and each correlation value stored in the classification coefficient table 170b. The importance m is calculated as follows.
m = f (1.621) + f (1.5.399) + α

このようにして算出された各ハッシュ値に対する重要度ｍに基づいて，抽出部１８０は，重要度ｍが大きいハッシュ値１７０ａ１に対応した文を重要文として抽出する。抽出した重要文が携帯電話等に表示された後，ステップ９９５に進み本処理は終了となる。 Based on the importance m for each hash value calculated in this way, the extraction unit 180 extracts a sentence corresponding to the hash value 170a1 having a large importance m as an important sentence. After the extracted important sentence is displayed on the mobile phone or the like, the process proceeds to step 995 and the process is terminated.

以上に説明したように，本実施形態にかかる文書処理装置１００によれば，テキストの分類から相関値ｋを求め，相関値ｋを用いて出現度数に重み付けをすることにより，重要度ｍが求められる。ここで，特定の分野でのみ頻出する語や文に対する相関値ｋは，分類係数テーブル１７０ｂにて，予め，小さく設定されている。よって，本実施形態の場合，特定の分野でのみ頻出する文が，その他の分野でも頻出する文より重要度が高くなるように関数ｆに重み付けがなされる。このようにして，各文に対する重要度ｍが適切に算出され，算出された各重要度ｍに基づいてより適切な要約テキストを抽出することができる。 As described above, according to the document processing apparatus 100 according to the present embodiment, the importance value m is obtained by obtaining the correlation value k from the text classification and weighting the appearance frequency using the correlation value k. It is done. Here, the correlation value k for words and sentences that appear frequently only in a specific field is set to be small in advance in the classification coefficient table 170b. Therefore, in the case of this embodiment, the function f is weighted so that a sentence that appears frequently only in a specific field is more important than a sentence that frequently appears in other fields. In this way, the importance m for each sentence is appropriately calculated, and a more appropriate summary text can be extracted based on each calculated importance m.

（第３実施形態）
つぎに，第３実施形態にかかる文書処理装置１００について説明する。本実施形態にかかる文書処理装置１００は，図１０に示したように，全文ハッシュテーブル１７０ｃと全ハッシュ計算部１９５と全ハッシュ検索部１９９とが新たに追加された点で図２に示した第１実施形態にかかる文書処理装置１００と機能構成上相異する。 (Third embodiment)
Next, a document processing apparatus 100 according to the third embodiment will be described. As shown in FIG. 10, the document processing apparatus 100 according to the present embodiment is the same as that shown in FIG. 2 in that a full-text hash table 170c, a full-hash calculation unit 195, and a full-hash search unit 199 are newly added. This is different from the document processing apparatus 100 according to the embodiment in terms of functional configuration.

また，本実施形態にかかる文書処理装置１００では，入力テキスト全体の文字列に対するハッシュ値（以下，全ハッシュ値と称呼する。）を求め，求められた全ハッシュ値が，全文ハッシュテーブル１７０ｃに予め登録されたハッシュ値に一致する場合には，図４の第１実施形態にかかる重要文抽出処理を実行せずに，該当全ハッシュ値に対応して全文ハッシュテーブル１７０ｃに予め登録された文を重要文とする点で第１実施形態にかかる文書処理装置１００と動作上相異する。したがって，これらの相異点を中心に本実施形態にかかる文書処理装置１００について説明する。 In the document processing apparatus 100 according to the present embodiment, a hash value (hereinafter referred to as a total hash value) for a character string of the entire input text is obtained, and the obtained total hash value is stored in the full text hash table 170c in advance. If the hash value matches the registered hash value, the sentence registered in advance in the full-text hash table 170c corresponding to the corresponding hash value is not executed without executing the important sentence extraction processing according to the first embodiment of FIG. It is different in operation from the document processing apparatus 100 according to the first embodiment in that it is an important sentence. Therefore, the document processing apparatus 100 according to the present embodiment will be described focusing on these differences.

本実施形態にかかる文書処理装置１００は，入力部１５０，分割部１５５，正規化部１６０，文ハッシュ計算部１６５，記憶部１７０内の分割文ハッシュテーブル１７０ａ，文ハッシュ検索部１７５，抽出部１８０，出力部１８５に加え，全文ハッシュテーブル１７０ｃ（記憶部１７０内），全ハッシュ計算部１９５および全ハッシュ検索部１９９の各ブロックにて示される機能を有している。 The document processing apparatus 100 according to this embodiment includes an input unit 150, a dividing unit 155, a normalizing unit 160, a sentence hash calculating unit 165, a divided sentence hash table 170a in the storage unit 170, a sentence hash searching unit 175, and an extracting unit 180. In addition to the output unit 185, the full-text hash table 170c (in the storage unit 170), the full-hash calculation unit 195, and the full-hash search unit 199 have functions shown in the respective blocks.

全文ハッシュテーブル１７０ｃには，図示されていないが，後述する全ハッシュ計算部１９５によりいままで計算された，各入力テキストの全文字列に対する全ハッシュ値（全計算値に相当）が，その入力テキスト対して以前に抽出された重要文に関連付けて蓄積されている。なお，全ハッシュ計算部１９５は，入力テキスト中の宛名情報や送信元情報を除いた本文を特定部分とし，その特定部分の全文字列に対する全ハッシュ値を計算してもよい。 Although not shown in the full-text hash table 170c, all hash values (corresponding to all calculated values) for all character strings of each input text calculated so far by the all-hash calculation unit 195 described later are the input text. On the other hand, it is accumulated in association with important sentences previously extracted. Note that the all hash calculation unit 195 may calculate the entire hash value for all the character strings of the specific part, with the text excluding the address information and the transmission source information in the input text as the specific part.

全ハッシュ計算部１９５は，入力テキストの全文字列に対するハッシュ値（全ハッシュ値）を計算する。全ハッシュ検索部１９９は，全ハッシュ計算部１９５により求められた各全ハッシュ値が，全文ハッシュテーブル１７０ｃに記憶されたいずれかのハッシュ値と一致するか否かを検索する。 The all hash calculation unit 195 calculates a hash value (all hash values) for all character strings of the input text. The all hash search unit 199 searches whether each all hash value obtained by the all hash calculation unit 195 matches any one of the hash values stored in the full text hash table 170c.

（文書処理装置１００の動作）
つぎに，本実施形態にかかる文書処理装置１００の具体的動作について，図１１を参照しながら説明する。図１１は，本実施形態にかかる文書処理装置１００が実行する重要文抽出処理を示したフローチャートである。 (Operation of the document processing apparatus 100)
Next, a specific operation of the document processing apparatus 100 according to the present embodiment will be described with reference to FIG. FIG. 11 is a flowchart showing important sentence extraction processing executed by the document processing apparatus 100 according to the present embodiment.

ステップ１１００から重要文抽出処理が開始され，ステップ１１０５に進むと，全ハッシュ計算部１９５は，入力されたテキスト全体に対するハッシュ値（全ハッシュ値）を計算する。つぎに，ステップ１１１０に進んで，全ハッシュ検索部１９９は，全ハッシュ計算部１９５により求められた全ハッシュ値が，全文ハッシュテーブル１７０ｃに記憶されたいずれかのハッシュ値と一致するか否かのマッチング処理を行う。 When the important sentence extraction process starts from step 1100 and proceeds to step 1105, the all hash calculation unit 195 calculates a hash value (all hash values) for the entire input text. Next, proceeding to step 1110, the all-hash search unit 199 determines whether or not all hash values obtained by the all-hash calculation unit 195 match any hash value stored in the full-text hash table 170c. Perform the matching process.

全ハッシュ検索部１９９によるマッチング処理の結果，全ハッシュ計算部１９５により求められた全ハッシュ値が全文ハッシュテーブル１７０ｃに記憶されていると判定された場合には，全ハッシュ検索部１９９は，ステップ１１１５にて「Ｙｅｓ」と判定し，直ちにステップ１１２０に進む。抽出部１８０は，ステップ１１２０にて，全ハッシュ値に対応して全文ハッシュテーブル１７０ｃに記憶されている重要文を要約テキストとして抽出する。この要約テキストは，出力部１８５により携帯電話を用いてユーザに表示された後，ステップ１１９５に進んで本処理は終了となる。 If it is determined as a result of the matching process by the all-hash search unit 199 that all hash values obtained by the all-hash calculation unit 195 are stored in the full-text hash table 170c, the all-hash search unit 199 performs step 1115. "Yes" is determined, and the process immediately proceeds to step 1120. In step 1120, the extraction unit 180 extracts the important sentence stored in the full-text hash table 170c corresponding to the full hash value as a summary text. The summary text is displayed to the user using the mobile phone by the output unit 185, and then the process proceeds to step 1195 to end the present process.

一方，全ハッシュ検索部１９９によるマッチング処理の結果，全ハッシュ計算部１９５により求められた全ハッシュ値が全文ハッシュテーブル１７０ｃに記憶されていないと判定された場合には，全ハッシュ検索部１９９は，ステップ１１１５にて「Ｎｏ」と判定し，ステップ４０５に進んで，ステップ４０５〜ステップ４２０にて，第１実施形態と同様の処理を実行することにより，入力テキストに対する各文のハッシュ値の出現度数１７０ａ２が検出される。 On the other hand, as a result of the matching process by the all-hash search unit 199, if it is determined that the all-hash value obtained by the all-hash calculation unit 195 is not stored in the full-text hash table 170c, the all-hash search unit 199 In step 1115, “No” is determined, the process proceeds to step 405, and in steps 405 to 420, the same processing as that of the first embodiment is executed, so that the frequency of appearance of the hash value of each sentence with respect to the input text is obtained. 170a2 is detected.

つぎに，ステップ１１２０に進み，抽出部１８０が，第１実施形態と同様の処理を実行することにより，出現度数１７０ａ２に基づいて重要文を抽出する。抽出部１８０は，全ハッシュ計算部１９５により求められた全ハッシュ値とともに，抽出された重要文を関連付けて全文ハッシュテーブル１７０ｃに記憶する。また，ステップ１１２０にて，出力部１８５が抽出した重要文を要約テキストとして出力した後，ステップ１１９５に進んで本処理は終了となる。 Next, it progresses to step 1120 and the extraction part 180 extracts an important sentence based on the appearance frequency 170a2 by performing the process similar to 1st Embodiment. The extraction unit 180 associates the extracted important sentences with the all hash values obtained by the all hash calculation unit 195 and stores them in the full text hash table 170c. In step 1120, the important sentence extracted by the output unit 185 is output as a summary text. Then, the process proceeds to step 1195, and the process ends.

以上に説明したように，本実施形態にかかる文書処理装置１００によれば，同じ内容のメールが操作ミスや誤送により複数回入力された場合にも，出現頻度の値は必要以上に高くならず，実情に合致した適切な値をとるように設定される。これにより，各文の重要度が必要以上に低下することを回避することができる。 As described above, according to the document processing apparatus 100 according to the present embodiment, even when a mail with the same content is input a plurality of times due to an operation error or erroneous transmission, the value of the appearance frequency becomes higher than necessary. First, it is set to take an appropriate value that matches the actual situation. As a result, it is possible to prevent the importance of each sentence from decreasing more than necessary.

また，本実施形態にかかる文書処理装置１００によれば，たとえば，同じ内容のメールが複数回入力されたときのように，以前入力されたテキストと同一テキストが入力された場合には，ステップ４０５〜ステップ４２０にて示される処理を実行する必要がない。すなわち，処理の負荷を軽減しながら，以前に抽出した重要文を使用してすばやく要約テキストをユーザに提供することができる。 Further, according to the document processing apparatus 100 according to the present embodiment, when the same text as the previously input text is input, for example, when the same content mail is input a plurality of times, step 405 is performed. It is not necessary to execute the processing shown in step 420. That is, while reducing the processing load, summary text can be quickly provided to the user using important sentences extracted previously.

なお，以上に説明したすべての実施形態では，入力テキストとして電子メールを例に挙げて説明したが，これに限られず，文書処理装置１００は，自己が作成したテキストなど複数種類のテキストを対象とすることができる。 In all the embodiments described above, e-mail has been described as an example of input text. However, the present invention is not limited to this, and the document processing apparatus 100 targets a plurality of types of text such as text created by itself. can do.

また，以上に説明したすべての実施形態では，受信した電子メールを要約する例を挙げて文書処理装置１００について説明した。しかし，この例に限られず，文書処理装置１００は，たとえば，キーボードなどにより入力された文書や記憶領域に記憶された文書などを要約する際に使用されてもよい。また，文書処理装置１００は，自己が作成した文を要約してから送信する際に使用されてもよい。 Further, in all the embodiments described above, the document processing apparatus 100 has been described by giving an example of summarizing received electronic mail. However, the present invention is not limited to this example, and the document processing apparatus 100 may be used, for example, for summarizing a document input with a keyboard or the like or a document stored in a storage area. Further, the document processing apparatus 100 may be used when a sentence created by itself is summarized and transmitted.

また，以上の実施形態にて説明した各ハッシュ値の出現度数１７０ａ２は，各ハッシュ値の出現頻度の一例であり，各ハッシュ値の出現頻度は，各ハッシュ値が出現する割合を示す値であればどんな値であってもよい。たとえば，ハッシュ値の出現頻度の他の例としては，分割文ハッシュテーブル１７０ａに記憶されたすべてのハッシュ値に対する該当ハッシュ値の出現率や，そのすべてのハッシュ値の平均出現頻度に対する該当ハッシュ値の偏差が挙げられる。 Further, the appearance frequency 170a2 of each hash value described in the above embodiment is an example of the appearance frequency of each hash value, and the appearance frequency of each hash value may be a value indicating the rate at which each hash value appears. Any value can be used. For example, as another example of the appearance frequency of the hash value, the appearance rate of the corresponding hash value with respect to all the hash values stored in the divided sentence hash table 170a and the corresponding hash value with respect to the average appearance frequency of all the hash values Deviations are mentioned.

また，以上では，分割部１５５は，テキストを文単位に分割した。しかし，分割部１５５は，所定の条件に基づいて，テキストを複数の文字列に分割することができればよく，必ずしも文単位に分割する必要はない。たとえば，分割部１５５は，テキストを文節単位や段落単位に分割してもよい。より具体的には，分割部１５５は，テキスト中に改行が生じたら分割するという条件に基づいて，テキストを段落単位に分割してもよい。また，読点または句点のいずれかが出現したら分割するという条件に基づいて，「こんにちは，○○です。」という文書を「こんにちは」，「○○です」という２つの文字列に分割してもよい。この場合，文書処理装置１００は，「こんにちは，○○です。」「こんにちは，△△です。」から出現度数の高い「こんにちは」の文字列や「こんにちは」を含んだ行を削除することにより，テキストから重要文（すなわち，「○○です。」および「△△です。」からなる要約文）を抽出することができる。 In the above, the dividing unit 155 divides the text into sentence units. However, the dividing unit 155 only needs to be able to divide the text into a plurality of character strings based on a predetermined condition, and is not necessarily divided into sentence units. For example, the dividing unit 155 may divide the text into phrase units or paragraph units. More specifically, the dividing unit 155 may divide the text into paragraphs based on a condition that a line break occurs in the text. In addition, based on the condition that one of the comma or period is divided After the appearance, "Hello, this is ○○." "Hello," a document that may be divided into two character string "is ○○" . In this case, the document processing device 100, by deleting "Hello, this is ○○." "Hello, this is △△." The line containing the string or "Hello" in the high frequency of occurrence "Hello" from, An important sentence (that is, a summary sentence consisting of “XX” and “△△”) can be extracted from the text.

また，以上の説明では，出力部１８５は，要約テキストを携帯電話のディスプレイに表示した。しかし，これに限られず，たとえば，出力部１８５は，図１に示したように他の機器のＣＲＴ３００，プリンタ３０５に要約テキストを出力したり，ネットワークカード３１０などに要約テキストを記憶するようにしてもよいし，音声出力装置３１５に，要約テキストを音声情報として出力するようにしてもよい。 In the above description, the output unit 185 displays the summary text on the mobile phone display. However, the present invention is not limited to this. For example, the output unit 185 outputs summary text to the CRT 300 or printer 305 of another device as shown in FIG. 1 or stores the summary text in the network card 310 or the like. Alternatively, the summary text may be output to the voice output device 315 as voice information.

上記実施形態において，各部の動作はお互いに関連しており，互いの関連を考慮しながら，一連の動作として置き換えることができる。そして，このように置き換えることにより，文書処理装置の発明の実施形態を，文書処理方法の実施形態とすることができる。 In the above embodiment, the operations of the respective units are related to each other, and can be replaced as a series of operations in consideration of the relationship between each other. And by replacing in this way, the embodiment of the invention of the document processing apparatus can be made an embodiment of the document processing method.

また，上記各部の動作を，各部の処理と置き換えることにより，プログラムの実施形態とすることができる。また，プログラムを，プログラムを記録したコンピュータ読み取り可能な記録媒体に記憶させることにより，プログラムの実施形態をプログラムに記録したコンピュータ読み取り可能な記録媒体の実施形態とすることができる。 Further, by replacing the operation of each unit with the processing of each unit, a program embodiment can be obtained. Further, by storing the program in a computer-readable recording medium in which the program is recorded, the embodiment of the program can be an embodiment of a computer-readable recording medium in which the program is recorded.

したがって，文書処理方法の実施形態は，所定の規則に基づき算出された計算値とその計算値が算出された頻度を表す出現頻度とを関連付けて記憶部に記憶する処理と，文書を所定の条件に基づいて複数の文字列に分割する処理と，上記所定の規則に基づいて，上記分割された各文字列から各文字列固有の計算値をそれぞれ求める処理と，上記求められた各計算値と上記記憶部に記憶された計算値とを比較することにより，上記求められた各計算値に対応する出現頻度をそれぞれ検索する処理と，上記検索された各計算値に対応する各出現頻度に基づき，１または２以上の計算値を選択し，選択された１または２以上の計算値に対する文字列を上記文書の要約として抽出する処理とを，をコンピュータに実行させる文書処理プログラムの実施形態とすることができる。 Therefore, the embodiment of the document processing method includes a process of associating a calculated value calculated based on a predetermined rule and an appearance frequency indicating the frequency at which the calculated value is calculated in association with each other, and storing the document in a predetermined condition. A process of dividing into a plurality of character strings based on the above, a process of determining a calculated value specific to each character string from each of the divided character strings based on the predetermined rule, By comparing the calculated values stored in the storage unit to search for the appearance frequencies corresponding to the calculated values, and based on the appearance frequencies corresponding to the calculated values. Implementation of a document processing program that causes a computer to select one or more calculated values and extract a character string corresponding to the selected one or more calculated values as a summary of the document. It can be a state.

また，文書処理方法の実施形態は，所定の規則に基づき算出された計算値とその計算値が算出された頻度を表す出現頻度とを関連付けて記憶部に記憶する処理と，文書を所定の条件に基づいて複数の文字列に分割する処理と，上記所定の規則に基づいて，上記分割された各文字列から各文字列固有の計算値をそれぞれ求める処理と，上記求められた各計算値と上記記憶部に記憶された計算値とを比較することにより，上記求められた各計算値に対応する出現頻度をそれぞれ検索する処理と，上記検索された各計算値に対応する各出現頻度に基づき，１または２以上の計算値を選択し，選択された１または２以上の計算値に対する文字列を上記文書の要約として抽出する処理とを，をコンピュータに実行させる文書処理プログラムを記憶したコンピュータ読み取り可能な記録媒体の実施形態とすることができる。 Further, the embodiment of the document processing method includes a process of associating a calculated value calculated based on a predetermined rule and an appearance frequency indicating the frequency at which the calculated value is calculated in association with each other, and storing the document in a predetermined condition. A process of dividing into a plurality of character strings based on the above, a process of determining a calculated value specific to each character string from each of the divided character strings based on the predetermined rule, By comparing the calculated values stored in the storage unit to search for the appearance frequencies corresponding to the calculated values, and based on the appearance frequencies corresponding to the calculated values. , Selecting one or more calculated values, and extracting a character string for the selected one or more calculated values as a summary of the document. It can be used as an embodiment of the computer readable recording medium.

以上，添付図面を参照しながら本発明の好適な実施形態について説明したが，本発明は係る例に限定されないことは言うまでもない。当業者であれば，特許請求の範囲に記載された範疇内において，各種の変更例または修正例に想到し得ることは明らかであり，それらについても当然に本発明の技術的範囲に属するものと了解される。 As mentioned above, although preferred embodiment of this invention was described referring an accompanying drawing, it cannot be overemphasized that this invention is not limited to the example which concerns. It will be apparent to those skilled in the art that various changes and modifications can be made within the scope of the claims, and these are naturally within the technical scope of the present invention. Understood.

たとえば，以上の各実施形態にかかる文書処理装置１００では，各部が，すべて文書処理装置１００内に存在するように説明したが，本発明はこれに限定されず，各部の機能の一部が，ネットワークで接続された別の制御手段の中に含まれていてもよく，また，負荷分散や安全性の確保のために，各部の手段および機能が複数存在していてもよい。 For example, in the document processing apparatus 100 according to each of the embodiments described above, it has been described that each unit exists in the document processing apparatus 100. However, the present invention is not limited to this, and some of the functions of each unit are It may be included in another control means connected by a network, and there may be a plurality of means and functions of each part for load distribution and ensuring safety.

本発明は，文書から重要部分を高速に自動抽出する文書処理装置および文書処理方法に適用可能である。 The present invention is applicable to a document processing apparatus and a document processing method for automatically extracting important parts from a document at high speed.

本発明の第１実施形態にかかる文書処理装置のハードウエア構成図である。It is a hardware block diagram of the document processing apparatus concerning 1st Embodiment of this invention. 同実施形態にかかる文書処理装置の機能構成図である。It is a functional block diagram of the document processing apparatus concerning the embodiment. 分割文ハッシュテーブルのデータ構造の一例を示した図である。It is the figure which showed an example of the data structure of a division | segmentation sentence hash table. 同実施形態にて文書処理装置が実行する重要文抽出処理ルーチンを示したフローチャートである。It is the flowchart which showed the important sentence extraction processing routine which a document processing apparatus performs in the same embodiment. 本発明の第２実施形態にかかる文書処理装置の機能構成図である。It is a functional block diagram of the document processing apparatus concerning 2nd Embodiment of this invention. 分割文ハッシュテーブルのデータ構造の他の一例を示した図である。It is the figure which showed another example of the data structure of a division | segmentation sentence hash table. 分類係数テーブルのデータ構造の一例を示した構成図である。It is the block diagram which showed an example of the data structure of a classification coefficient table. 分類係数テーブルのデータ構造の他の一例を示した構成図である。It is the block diagram which showed another example of the data structure of a classification coefficient table. 同実施形態にて文書処理装置が実行する重要文抽出処理ルーチンを示したフローチャートである。It is the flowchart which showed the important sentence extraction processing routine which a document processing apparatus performs in the same embodiment. 本発明の第３実施形態にかかる文書処理装置の機能構成図である。It is a functional block diagram of the document processing apparatus concerning 3rd Embodiment of this invention. 同実施形態にて文書処理装置が実行する重要文抽出処理ルーチンを示したフローチャートである。It is the flowchart which showed the important sentence extraction processing routine which a document processing apparatus performs in the same embodiment.

Explanation of symbols

１００文書処理装置
１２０プロセッサ
１５０入力部
１５５分割部
１６０正規化部
１６５文ハッシュ計算部
１７０記憶部
１７０ａ分割文ハッシュテーブル
１７０ｂ分類係数テーブル
１７０ｃ全文ハッシュテーブル
１７５文ハッシュ検索部
１８０抽出部
１８５出力部
１９０テキスト分類部
１９５全ハッシュ計算部
１９９全ハッシュ検索部 DESCRIPTION OF SYMBOLS 100 Document processing apparatus 120 Processor 150 Input part 155 Division | segmentation part 160 Normalization part 165 Sentence hash calculation part 170 Storage part 170a Division | segmentation sentence hash table 170b Classification coefficient table 170c Whole sentence hash table 175 Sentence hash search part 180 Extraction part 185 Output part 190 Text Classification part 195 All hash calculation part 199 All hash search part

Claims

A storage unit that associates and stores a calculated value calculated based on a predetermined rule and an appearance frequency indicating a frequency at which the calculated value is calculated;
A dividing unit for dividing the document into a plurality of character strings based on a predetermined condition;
A calculation unit for obtaining a calculated value unique to each character string from each of the divided character strings based on the predetermined rule;
A search unit that detects each occurrence frequency corresponding to each calculated value by comparing each calculated value obtained by the calculating unit and the calculated value stored in the storage unit;
An extraction unit that selects one or more calculated values based on the appearance frequency corresponding to each detected calculated value and extracts a character string corresponding to the selected one or more calculated values as a summary of the document And a document processing apparatus.

The calculation unit is
The document processing apparatus according to claim 1, wherein a hash value as the calculated value is obtained from each character string using a hash function as the predetermined rule.

The extraction unit
The one or two or more calculation values are selected in order from the calculation value stored in association with the appearance frequency having a lower value by comparing the appearance frequencies corresponding to the detected calculation values. The document processing apparatus according to claim 1 or 2.

The search unit
As a result of the comparison, if it is determined that the calculated value obtained by the calculating unit is stored in the storage unit, the appearance frequency stored in association with the calculated value is increased, 4. If it is determined that the calculated value is not stored in the storage unit, an appearance frequency having a given value is newly stored together with the calculated value. Document processing apparatus described in the above.

The dividing unit is
The document processing apparatus according to claim 1, wherein the document is divided into a plurality of character strings constituting any of a clause or a sentence or a paragraph.

Said document processing device, further comprising:
The document processing apparatus according to claim 1, further comprising: a normalization unit that adjusts a format of a character string included in either the document or each divided sentence.

Said document processing device, further comprising:
A text classification unit for defining the attributes of the document;
The storage unit
Storing a plurality of appearance frequencies of the calculated values for each attribute of the document;
The extraction unit
A plurality of appearance frequencies stored for each attribute of the document in the storage unit are respectively used by using a correlation value determined from a correlation between the determined attribute of the document and the attribute of the document stored in the storage unit. The document processing apparatus according to claim 1, wherein weighting is performed, and one or two or more calculated values are selected based on each weighted appearance frequency.

The extraction unit
Using the correlation values, the calculated values are respectively weighted to a plurality of appearance frequencies stored for each attribute of the document, and the sum of the weighted appearance frequencies is assigned as the importance corresponding to the calculated values. 8. The document processing apparatus according to claim 7, wherein one or two or more corresponding calculated values are selected in descending order of the calculated importance.

The correlation value is
9. The apparatus according to claim 7, wherein the smaller the association between the attribute of the document determined by the text classification unit and the attribute of the document stored in the storage unit is, the larger the value is set. Document processing device described.

The calculation unit is
Find the calculated value specific to a specific part or all part of the document as the total calculated value,
The search unit
Search whether all the calculated values are stored in the storage unit ,
When all the calculated values are not stored in the storage unit,
Processing of the dividing unit that divides a document into a plurality of character strings based on a predetermined condition; and a calculation unit that obtains a calculation value specific to each character string from each of the divided character strings based on the predetermined rule. Processing, and processing of the search unit for detecting appearance frequencies corresponding to the calculated values by comparing the calculated values calculated with the calculated values stored in the storage unit Thereafter, the extraction unit extracts a character string for one or more calculated values selected based on the appearance frequency corresponding to each detected calculated value as a summary of the document,
When all the calculated values are stored in the storage unit ,
The extraction section, the document processing apparatus according to a character string corresponding to the total calculated value stored immediately before term memory unit to any one of claims 1 to 9 for extracting as a summary of the document.

Storing the calculated value calculated based on a predetermined rule and the appearance frequency indicating the frequency at which the calculated value is calculated in association with each other in the storage unit;
Dividing the document into a plurality of character strings based on predetermined conditions;
Obtaining a calculated value unique to each character string from each of the divided character strings based on the predetermined rule;
Searching for the appearance frequency corresponding to each of the calculated values by comparing the calculated values and the calculated values stored in the storage unit;
Document processing for selecting one or more calculated values based on each appearance frequency corresponding to each searched calculated value and extracting a character string corresponding to the selected one or more calculated values as a summary of the document Method.