JP2008197952A

JP2008197952A - Text segmentation method, its device, its program and computer readable recording medium

Info

Publication number: JP2008197952A
Application number: JP2007033077A
Authority: JP
Inventors: Naoto Abe; 直人阿部; Katsuyoshi Tanabe; 勝義田邊; Hidenori Okuda; 英範奥田
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2007-02-14
Filing date: 2007-02-14
Publication date: 2008-08-28

Abstract

<P>PROBLEM TO BE SOLVED: To achieve a text segmentation corresponding to various fields without making it necessary to provide any data base for learning in the case of dividing an input text into the blocks of a semantically consistent sentence aggregate. <P>SOLUTION: An input text is divided into sentences in order to divide one text with consistent content, and the nouns of each sentence are defined as retrieval words (S1 to S3). Then, Web retrieval is performed by using the retrieval words. Nouns whose appearance frequency is high are extracted from the text retrieved by Web retrieval, and defined as relevant words (S4). The retrieval words+relevant words are used as the keyword group of the sentence, and when the number of the keywords of the adjacent sentences is predetermined or more and overlapped, they are gathered as one block, and when they are not overlapped, they are defined as the division of a text, and the division result of the text is output (S5, S6). <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は，テキストをパソコン等の計算機で扱う分野において，テキストの記述内容に応じてテキスト内の文章を一文，あるいは複数の文のまとまりで分割を行う方法に関し，特に，ウェブ（Ｗｅｂ）検索を利用することにより学習用データベースを使用しないテキストセグメンテーション方法に関するものである。 The present invention relates to a method for dividing a sentence in a text into a sentence or a group of sentences according to the description content of the text in a field where the text is handled by a computer such as a personal computer. The present invention relates to a text segmentation method that does not use a learning database.

近年，急速な計算機の性能向上に伴い莫大なテキスト（ここでは，文字列だけで構成される文の集合）を蓄積しデータベースを構築することが可能になった。しかし，保存されたテキストを人手で整理・管理することは一般的に困難となってきている。与えられたテキストを内容に応じて分割する技術をテキストセグメンテーションと呼び，テキストデータベースの分類や整理を計算機で自動的に行うことに応用されつつある。例えば，以下の特許文献１（トピック境界決定方法及び装置及びトピック境界決定プログラム）では，概念ベースと呼ばれる情報を用いてテキストセグメンテーションを行う技術が提案されている。 In recent years, with the rapid performance improvement of computers, it has become possible to build a database by accumulating enormous text (here, a set of sentences consisting only of character strings). However, it is generally difficult to manually organize and manage stored text. The technique of dividing a given text according to its content is called text segmentation, and it is being applied to automatically classify and organize text databases with a computer. For example, in the following Patent Document 1 (topic boundary determination method and apparatus and topic boundary determination program), a technique for performing text segmentation using information called a concept base is proposed.

この技術では，ある単語とそれに共起するパターンを数値ベクトル化した概念ベクトルを，あらかじめ蓄積した学習用データベースから複数作成する。そして，概念ベクトルの集まりである概念ベースを利用してテキストセグメンテーションを行う。学習用データベースには，一つの分野に関するテキスト（例えば「政治」の分野だけに関するテキスト）が数多く蓄積されている。
特開２００４−２３４５１２号公報 In this technology, a plurality of concept vectors obtained by converting a certain word and a pattern co-occurring into it into numerical vectors are created from a learning database accumulated in advance. Then, text segmentation is performed using a concept base which is a collection of concept vectors. In the learning database, many texts related to one field (for example, text related only to the field of “politics”) are accumulated.
JP 2004-234512 A

しかし，従来のテキストセグメンテーション手法の精度を高めるためには，大規模な学習用データベースを用意しなくてはならず，その学習用データベースを用意するのに莫大な労力を必要とする。学習用データベースが小規模な場合，概念ベースを適切に作成できないため，テキストセグメンテーションの精度が低下する。また，事前に用意した学習用データベースは特定の分野だけに対応しているため，異なる分野のテキストに対してテキストセグメンテーションを行うことができない。 However, in order to increase the accuracy of the conventional text segmentation method, a large-scale learning database must be prepared, and enormous effort is required to prepare the learning database. If the learning database is small, the concept base cannot be created properly, which reduces the accuracy of text segmentation. In addition, since the learning database prepared in advance supports only a specific field, text segmentation cannot be performed on texts in different fields.

本発明の目的は，学習用データベースを必要とせず，様々な分野のテキストに対応するテキストセグメンテーションの技術を提供することにある。 An object of the present invention is to provide a text segmentation technique that does not require a learning database and can handle texts in various fields.

一般的には，学習用のデータベースを用意しなくてもテキストセグメンテーションを行うことができることが望ましい。そのために，本発明は，ウェブで検索を行う概念に着目した新しいテキストセグメンテーション技術を創案することにより，上記の課題を解決するものである。 In general, it is desirable to be able to perform text segmentation without preparing a learning database. Therefore, the present invention solves the above-mentioned problems by creating a new text segmentation technique that focuses on the concept of searching on the web.

本発明では，例えば１つのテキストをまとまりのある内容で分割するために，まず処理対象文書のテキストを文に分割し，各文の名詞を検索語とする。なお，名詞を検索語とする際に，必要に応じて不要名詞を事前に検索語の候補から取り除く。次に，検索語によってウェブ検索を行う。ウェブ検索により検索されたテキストから出現頻度の高い名詞を抽出して関連語とする。検索語＋関連語をその文のキーワード集合とし，隣り合う文同士でキーワードが所定の個数以上，重複していなければ，文章の区切りとする。 In the present invention, for example, in order to divide a single text into coherent contents, first, the text of the document to be processed is divided into sentences, and nouns in each sentence are used as search words. When nouns are used as search terms, unnecessary nouns are removed in advance from search term candidates as necessary. Next, a web search is performed using the search terms. A noun with high appearance frequency is extracted from the text searched by the web search and used as a related word. The search word + related word is set as a keyword set of the sentence, and if there are no more than a predetermined number of keywords between adjacent sentences, the sentence is separated.

ここで，ウェブとはインターネット等のネットワークを介してアクセスできるＨＴＭＬやＸＭＬなどの構造化言語で記述されたテキストの集合を意味する。現在，ウェブ上には膨大な情報が蓄積されており，最新の話題も常に提供されている。つまり，ウェブは様々な情報をもつ辞書として捉えることができる。実際，我々はある事に関して調べる際，検索サイトで検索語を入力してウェブ上で検索を行い，単語の意味や物事の内容を調べている。その観点から，学習用データベースを使用しなくてもウェブ上にある情報を適切に利用すれば，「サッカー」や「野球」に対応するのは「スポーツ」や「ボール」という概念を取得できると言える。その結果，ウェブ上にある様々な情報からテキストの記述内容に応じた単語を取得することができ，文同士の関連性を幅広く比較し文の内容を追跡することができる。従って，事前に学習用データベースを用意することや維持・管理の必要がなく，様々な分野に対応するテキストセグメンテーションが実現できる。 Here, the web means a set of texts described in a structured language such as HTML or XML that can be accessed via a network such as the Internet. Currently, a huge amount of information is accumulated on the web, and the latest topics are always provided. In other words, the web can be understood as a dictionary with various information. In fact, when we look into a certain thing, we enter a search word on a search site and search the web to find out the meaning of the word and the contents of things. From that point of view, if the information on the web is used appropriately without using a learning database, the concept of "sports" and "balls" can be obtained for "soccer" and "baseball". I can say that. As a result, words corresponding to the description contents of the text can be acquired from various information on the web, and the contents of sentences can be tracked by comparing the relevance of sentences widely. Therefore, it is not necessary to prepare a learning database in advance or to maintain and manage it, and text segmentation corresponding to various fields can be realized.

以上のように，本発明では，検索語を用いてウェブ検索を行うことで，テキストの記述内容に関連する複数の構造化言語で記述されたテキストを取得し，記述内容に関係の高い複数の単語を抽出できる点に着目している。その結果，本発明の目的である学習用データベースを必要とせず，様々な分野に対応できるテキストセグメンテーションが可能になる。 As described above, in the present invention, by performing a web search using a search term, text described in a plurality of structured languages related to the description content of the text is acquired, and a plurality of items highly related to the description content are acquired. The focus is on the ability to extract words. As a result, it is possible to perform text segmentation that can deal with various fields without requiring the learning database that is the object of the present invention.

具体的には，本発明は，電子情報化された文の集合であるテキストを，コンピュータによって１または複数の文からなるブロックに分割するテキストセグメンテーション方法において，分割対象となるテキストを入力し，その入力されたテキストを文単位に分割し，分割された各文に対して形態素解析を行い，各文ごとに検索語を抽出し，各文に対して抽出された検索語を用いてウェブ上で検索を行い，得られた検索結果から関連語を取得し，各文における検索語と関連語の組であるキーワード集合を作成し，隣り合う各文ごとにそれらのキーワード集合を比較し，共通するキーワードの個数によって比較対象となったキーワード集合の文を一つのブロックにまとめるか否かを決定してブロックを生成し，生成されたブロックをテキストの分割結果として出力する。 Specifically, the present invention is a text segmentation method in which a text, which is a set of electronically converted sentences, is divided into blocks made up of one or more sentences by a computer, and the text to be divided is input. The input text is divided into sentence units, morphological analysis is performed on each divided sentence, a search word is extracted for each sentence, and the search word extracted for each sentence is used on the web. Perform a search, obtain related words from the obtained search results, create keyword sets that are pairs of search words and related words in each sentence, compare those keyword sets for each adjacent sentence, and share them Based on the number of keywords, a decision is made as to whether or not the sentences in the keyword set to be compared are combined into a single block, and a block is generated. It is output as a result.

これにより，学習用データベースを用いずに，入力テキストを意味的にまとまった文集合または同じ内容に言及している文集合のブロックに，精度良く分割することができる。 Thereby, without using a learning database, the input text can be accurately divided into a sentence group that is semantically grouped or a sentence group block that refers to the same content.

また，ウェブ検索で得られた検索結果からの関連語の取得では，検索結果である構造化言語で記述された複数のテキストから名詞の単語を抽出し，抽出した単語の出現頻度を算出し，出現頻度の高い順にある定められた個数の単語を関連語として選択する。関連語の個数は，例えば一つの文における検索語と関連語の合計数が一定値となるように選んでもよい。これにより，関連語として検索語に関係の深い可能性がある単語を選択することができ，また，テキストを分割した各ブロックの意味的なまとまりの強弱に関する均一性を，ある程度保証することもできるようになる。 In addition, in the acquisition of related words from the search results obtained by web search, the noun word is extracted from the multiple texts described in the structured language that is the search result, the appearance frequency of the extracted word is calculated, A predetermined number of words in descending order of appearance frequency are selected as related words. The number of related words may be selected so that the total number of search words and related words in one sentence becomes a constant value, for example. This makes it possible to select words that are closely related to the search word as related words, and to guarantee a certain degree of uniformity regarding the strength of the semantic unit of each block into which the text is divided. It becomes like this.

また，各文からの検索語の抽出またはウェブ検索で得られた検索結果からの関連語の取得では，前記各文または前記検索結果に含まれる名詞の単語のうち，あらかじめ不要語リストに登録された単語を除いたものを，検索語または関連語とする。これにより，意味的な内容が乏しい単語によって分割できないようなことを避けることができる。 In addition, when extracting a search word from each sentence or acquiring a related word from a search result obtained by web search, a noun word included in each sentence or the search result is registered in an unnecessary word list in advance. Search words or related words are excluded. As a result, it is possible to avoid the case where words cannot be divided by words having poor semantic content.

本発明により，ウェブで検索する概念を利用することで学習用データベースを事前に用意する必要がないテキストセグメンテーション技術が実現できる。また，ウェブ上に蓄積されている様々な情報を利用しているという点から，テキストセグメンテーションを行う対象テキストに対して記述内容の分野に制約がないという利点がある。 According to the present invention, it is possible to realize a text segmentation technique that does not require a learning database to be prepared in advance by using the concept of searching on the web. In addition, there is an advantage that there is no restriction in the field of description content for the target text for text segmentation because various information accumulated on the web is used.

本発明は莫大なテキストデータを扱う分野やニュース記事を配信する分野において，テキストデータの整理・更新を自動的に行う支援技術として応用できる。 The present invention can be applied as support technology for automatically organizing and updating text data in the field of handling enormous text data and the field of distributing news articles.

図１に本発明の処理手順の概要を示す。図１において，ステップＳ１では，テキストを入力する処理を実行する。ステップＳ２では，入力されたテキストを文単位に分割する処理を実行する。ステップＳ３では，文から検索語を抽出する処理を実行する。ステップＳ４では，検索語を利用してウェブ上で検索を行い，検索結果から関連語を取得する処理を実行する。ステップＳ５では，検索語と関連語の組からなるキーワード集合を用いてテキストを分割する処理を実行する。ステップＳ６では，ステップＳ５で分割したテキストのテキストセグメンテーション結果を出力する処理を実行する。 FIG. 1 shows an outline of the processing procedure of the present invention. In FIG. 1, in step S1, a process for inputting text is executed. In step S2, a process of dividing the input text into sentence units is executed. In step S3, a process for extracting a search term from the sentence is executed. In step S4, a search is performed on the web using the search word, and a process of acquiring a related word from the search result is executed. In step S5, a process of dividing the text using a keyword set made up of a set of search terms and related terms is executed. In step S6, a process for outputting the text segmentation result of the text divided in step S5 is executed.

図２に，本発明の実施形態におけるウェブ検索を利用した学習用データベースを使用しないテキストセグメンテーション処理装置の構成図を示す。図２において，コンピュータ１は，ソフトウェアプログラムや記憶装置等によって構成されるテキスト分解処理部１１と，検索語抽出処理部１２と，関連語取得処理部１３と，テキスト分割処理部１４と，制御部１５と，入力部１６と，出力部１７と，分解文章記憶部２０と，検索語記憶部３０と，関連語記憶部４０と，分割ブロック記憶部５０とを有する。 FIG. 2 shows a configuration diagram of a text segmentation processing apparatus that does not use a learning database using web search in the embodiment of the present invention. In FIG. 2, a computer 1 includes a text decomposition processing unit 11, a search word extraction processing unit 12, a related word acquisition processing unit 13, a text division processing unit 14, and a control unit configured by software programs, storage devices, and the like. 15, an input unit 16, an output unit 17, a decomposed text storage unit 20, a search word storage unit 30, a related word storage unit 40, and a divided block storage unit 50.

また，コンピュータ１には，ネットワーク３が接続されており，ウェブ４にアクセスできる。ウェブ４には，ＨＴＭＬやＸＭＬなどの構造化言語で記述された複数のテキスト５が蓄積されている。テキスト６は，コンピュータ１の入力部１６に入力されるテキストである。表示部２は，制御部１５から出力部１７を通じて出力された結果を表示するための装置である。 The computer 1 is connected to a network 3 and can access the web 4. The web 4 stores a plurality of texts 5 written in a structured language such as HTML or XML. Text 6 is text that is input to the input unit 16 of the computer 1. The display unit 2 is a device for displaying the result output from the control unit 15 through the output unit 17.

図３に，本発明の実施形態におけるテキスト６の一例を示す。図３に示すテキスト６は，本発明の実施例を説明するためのセグメンテーションの対象となる入力部１６が入力するテキストの例である。 FIG. 3 shows an example of the text 6 in the embodiment of the present invention. A text 6 shown in FIG. 3 is an example of text input by the input unit 16 to be segmented for explaining the embodiment of the present invention.

図４に，本発明の実施形態における分解文章記憶部２０に格納された文の一例を示す。図４において，２１はテキスト６の１番目の文，２２はテキスト６の２番目の文，２３はテキスト６の３番目の文，２４はテキスト６の４番目の文，２５はテキスト６の５番目の文，２６はテキスト６の６番目の文，２７はテキスト６の７番目の文，２８はテキスト６の８番目の文，２９はテキスト６の９番目の文をそれぞれ表す。 FIG. 4 shows an example of sentences stored in the decomposed sentence storage unit 20 in the embodiment of the present invention. In FIG. 4, 21 is the first sentence of text 6, 22 is the second sentence of text 6, 23 is the third sentence of text 6, 24 is the fourth sentence of text 6, and 25 is 5 of text 6 The sixth sentence, 26 is the sixth sentence of text 6, 27 is the seventh sentence of text 6, 28 is the eighth sentence of text 6, and 29 is the ninth sentence of text 6.

図５に，本発明の実施形態における不要語リストの一例を示す。図５において，６０は不要語リストであり，不要語リスト６０には，あらかじめセグメンテーションの処理において無視する単語が記憶部（図示省略）に登録されている。 FIG. 5 shows an example of an unnecessary word list in the embodiment of the present invention. In FIG. 5, reference numeral 60 denotes an unnecessary word list. In the unnecessary word list 60, words to be ignored in the segmentation process are registered in advance in a storage unit (not shown).

図６に，本発明の実施形態における検索語記憶部３０に格納された検索語の一例を示す。図６において，３１は文２１に対応する検索語，３２は文２２に対応する検索語，３３は文２３に対応する検索語，３４は文２４に対応する検索語，３５は文２５に対応する検索語，３６は文２６に対応する検索語，３７は文２７に対応する検索語，３８は文２８に対応する検索語，３９は文２９に対応する検索語を表す。 FIG. 6 shows an example of a search word stored in the search word storage unit 30 in the embodiment of the present invention. In FIG. 6, 31 is a search word corresponding to sentence 21, 32 is a search word corresponding to sentence 22, 33 is a search word corresponding to sentence 23, 34 is a search word corresponding to sentence 24, and 35 is equivalent to sentence 25. , 36 is a search word corresponding to the sentence 27, 37 is a search word corresponding to the sentence 27, 38 is a search word corresponding to the sentence 28, and 39 is a search word corresponding to the sentence 29.

図７に，本発明の実施形態における関連語記憶部４０に格納された関連語の一例を示す。図７において，４１は検索語３１に対応する関連語，４２は検索語３２に対応する関連語，４３は検索語３３に対応する関連語，４４は検索語３４に対応する関連語，４５は検索語３５に対応する関連語，４６は検索語３６に対応する関連語，４７は検索語３７に対応する関連語，４８は検索語３８に対応する関連語，４９は検索語３９に対応する関連語を表す。 In FIG. 7, an example of the related word stored in the related word memory | storage part 40 in embodiment of this invention is shown. In FIG. 7, 41 is a related word corresponding to the search word 31, 42 is a related word corresponding to the search word 32, 43 is a related word corresponding to the search word 33, 44 is a related word corresponding to the search word 34, 45 is Related words corresponding to the search word 35, 46 related words corresponding to the search word 36, 47 related words corresponding to the search word 37, 48 related words corresponding to the search word 38, and 49 corresponding to the search word 39. Represents a related term.

図８に，本発明の実施形態におけるテキスト分割処理部１４で作成されるキーワード集合の一例を示す。図８において，７１は検索語３１と関連語４１の組から生成されたキーワード集合，７２は検索語３２と関連語４２の組から生成されたキーワード集合，７３は検索語３３と関連語４３の組から生成されたキーワード集合，７４は検索語３４と関連語４４の組から生成されたキーワード集合，７５は検索語３５と関連語４５の組から生成されたキーワード集合，７６は検索語３６と関連語４６の組から生成されたキーワード集合，７７は検索語３７と関連語４７の組から生成されたキーワード集合，７８は検索語３８と関連語４８の組から生成されたキーワード集合，７９は検索語３９と関連語４９の組から生成されたキーワード集合を表す。 FIG. 8 shows an example of a keyword set created by the text division processing unit 14 in the embodiment of the present invention. In FIG. 8, 71 is a keyword set generated from a set of a search word 31 and a related word 41, 72 is a keyword set generated from a set of a search word 32 and a related word 42, and 73 is a search word 33 and a related word 43. A keyword set generated from the set, 74 is a keyword set generated from the set of search terms 34 and related terms 44, 75 is a keyword set generated from a set of search terms 35 and related terms 45, and 76 is a search term 36. A keyword set generated from a set of related terms 46, 77 is a set of keywords generated from a set of search terms 37 and related terms 47, 78 is a set of keywords generated from a set of search terms 38 and related terms 48, and 79 is a set of keywords. A keyword set generated from a set of search terms 39 and related terms 49 is represented.

図９に，本発明の実施形態における分割ブロック記憶部に格納された各ブロックに属する文番号の一例を示す。図９において，５１は１番目のブロックに属する文番号，５２は２番目のブロックに属する文番号を表す。 FIG. 9 shows an example of sentence numbers belonging to each block stored in the divided block storage unit in the embodiment of the present invention. In FIG. 9, 51 represents a sentence number belonging to the first block, and 52 represents a sentence number belonging to the second block.

本発明の実施形態によるテキストセグメンテーションの処理手順を具体例と共に詳細に説明する。まず，テキスト６が入力部１６を通じて入力されると，制御部１５からテキスト分解処理部１１が呼び出される。 A text segmentation processing procedure according to an embodiment of the present invention will be described in detail with specific examples. First, when the text 6 is input through the input unit 16, the text decomposition processing unit 11 is called from the control unit 15.

テキスト分解処理部１１では，テキスト６を一文字ずつ読み込み，文単位で切り出す。そして，切り出された複数の文を，制御部１５を介して分解文章記憶部２０に格納する処理を行う。ここで，文とは句点「。」で区切られる一文を指す。ここで，例えば“「”や“」”等の括弧記号で囲まれる会話文中に現れる句点は無視する。入力されるテキスト６に応じて，生成される文の個数は変化する。テキスト６の一例として，図３に示すテキスト６に対してテキスト分解処理部１１を実行すると，図４のように，文単位に分解された９つの文２１から文２９が生成され，制御部１５を介して分解文章記憶部２０に格納される。 The text decomposition processing unit 11 reads the text 6 character by character and cuts it out in sentence units. And the process which stores the cut-out several sentence in the decomposition | disassembly sentence memory | storage part 20 via the control part 15 is performed. Here, a sentence refers to a sentence delimited by the punctuation mark “.”. Here, for example, punctuation marks appearing in a conversation sentence surrounded by parenthesis symbols such as ““ ”and“ ”” are ignored, and the number of generated sentences varies depending on the input text 6. Example of Text 6 When the text decomposition processing unit 11 is executed on the text 6 shown in FIG. 3, a sentence 29 is generated from the nine sentences 21 decomposed into sentence units as shown in FIG. 4, and is decomposed via the control unit 15. It is stored in the text storage unit 20.

次に，分解文章記憶部２０に格納されたそれぞれの文に対して，制御部１５により検索語抽出処理部１２による処理が実行される。ここで，検索語とは，ウェブ上で検索を行う際に入力する一つまたは複数の単語のことを指す。最初に検索語抽出処理部１２では，入力された各文に対して形態素解析を行う。そして，名詞と判定された複数の単語を検索語として取り出し，制御部１５を介して検索語記憶部３０に格納する。 Next, the control unit 15 executes processing by the search word extraction processing unit 12 for each sentence stored in the decomposed text storage unit 20. Here, the search term refers to one or more words that are input when searching on the web. First, the search word extraction processing unit 12 performs morphological analysis on each inputted sentence. Then, a plurality of words determined as nouns are taken out as search words and stored in the search word storage unit 30 via the control unit 15.

ただし，名詞の単語を取り出すだけでは，「月」や「日」のような一般的に使用される単語も抽出される。そこで，「月」や「日」などの不要語リスト６０をあらかじめ作成しておき，不要語リスト６０に登録されていない名詞の単語を検索語として扱う。図５は，このような不要語リスト６０の一例を示している。 However, by simply extracting the noun word, commonly used words such as “month” and “day” are also extracted. Therefore, an unnecessary word list 60 such as “month” or “day” is created in advance, and noun words that are not registered in the unnecessary word list 60 are handled as search words. FIG. 5 shows an example of such an unnecessary word list 60.

一方，文２７のように名詞として抽出された単語が全て不要語である場合や，文２３のように名詞となる単語が文中に存在しない場合には，検索語抽出処理部１２では検索語は抽出されない。その場合，検索語抽出処理部１２において，図６の検索語３３のように検索語記憶部３０に単語を格納しない。 On the other hand, when all the words extracted as nouns such as sentence 27 are unnecessary words or when no word that becomes a noun exists in the sentence like sentence 23, the search word extraction processing unit 12 selects the search word. Not extracted. In that case, the search word extraction processing unit 12 does not store the word in the search word storage unit 30 like the search word 33 in FIG.

また，抽出される検索語が少ない場合もある。しかし，そのような文は特に意味的な内容が無いとしても問題ない。そこで，検索語抽出処理部１２において，抽出した検索語の個数が所定の閾値Ｓ_T以下の場合には，検索語が無い文として扱い，検索語記憶部３０に単語を格納しない。例えばＳ_T＝１とすれば，文２２において抽出された名詞は「久里浜」と一つであるため，検索語が無い文として扱い，検索語記憶部３０における検索語３２のように検索語を格納しない。図６では，Ｓ_T＝１のときの文２１から文２９までの検索語の例を，検索語３１から検索語３９として示している。 In addition, there are cases where a small number of search terms are extracted. However, there is no problem even if such a sentence does not have any particularly meaningful content. Therefore, in the search word extraction processing unit 12, the number of the extracted search terms in the case of less than a predetermined threshold value S _T is treated as no search word sentence does not store the word in the search word storage unit 30. For example, if S _T = 1, the noun extracted in the sentence 22 is one “Kurihama”, so it is treated as a sentence having no search word, and the search word is searched like the search word 32 in the search word storage unit 30. Do not store. In FIG. 6, examples of search words from sentence 21 to sentence 29 when S _T = 1 are shown as search word 31 to search word 39.

文２１から文２９に対応する検索語３１から検索語３９が作成された後，制御部１５から関連語取得処理部１３が呼び出される。関連語取得処理部１３では，初めに検索語抽出処理部１２で抽出された検索語を，制御部１５を介して検索語記憶部３０から取り出し入力する。次に，入力された検索語を用いてネットワーク３で接続されているウェブ４上で検索を行う。そして，検索結果で参照されているＨＴＭＬやＸＭＬなどの構造化言語で記述された複数のテキスト５を，ネットワーク３を介してウェブ４から予め定められたＰ個だけ取得し，取得したテキストから本文の内容を抽出する。 After the search word 39 is created from the search word 31 corresponding to the sentence 21 to the sentence 29, the related word acquisition processing unit 13 is called from the control unit 15. In the related word acquisition processing unit 13, the search word first extracted by the search word extraction processing unit 12 is extracted from the search word storage unit 30 through the control unit 15 and input. Next, a search is performed on the web 4 connected by the network 3 using the input search word. Then, a predetermined number P of texts 5 described in a structured language such as HTML or XML referenced in the search result are acquired from the web 4 via the network 3, and the text is obtained from the acquired texts. Extract the contents of.

なお，関連語を取得するためのウェブ検索では，複数の検索語をａｎｄ条件で検索することを基本とする。すなわち，基本的に検索語が全て現れるウェブページを探す。その理由は，複数の検索語をｏｒ条件で検索した場合，一部の検索語だけが現れるウェブページが検索結果に現れ，全体の検索語に関係の弱い単語が関連語として抽出される可能性が高いからである。 Note that the web search for acquiring related terms is based on searching for a plurality of search terms under the and condition. In other words, a web page where all search terms appear is basically searched. The reason is that when multiple search terms are searched using the or condition, a web page in which only some of the search terms appear appears in the search results, and words that are weakly related to the entire search terms may be extracted as related terms. Because it is expensive.

構造化言語で記述されたテキストにおいて，“＜”と“＞”で囲まれた文字列から構成されるタグを解析することで，本文の内容が記述されたＰ個の本文テキストが得られる。Ｐ個の本文テキストを抽出した後，それらに対して関連語取得処理部１３は形態素解析を行い，名詞の単語を抽出する。そして，抽出された名詞の単語の出現頻度を調べ，頻度の高い順に複数個の単語を関連語として関連語記憶部４０に格納する。 By analyzing a tag composed of a character string surrounded by “<” and “>” in a text described in a structured language, P body texts describing the content of the body text are obtained. After extracting P body texts, the related word acquisition processing unit 13 performs morphological analysis on the texts to extract noun words. And the appearance frequency of the word of the extracted noun is investigated, and a plurality of words are stored in the related word storage unit 40 as related words in descending order of frequency.

しかし，名詞の単語を直接的に関連語として使用すると，検索語抽出処理部１２と同様に「月」や「日」といった，普遍的に使用される単語が関連語として扱われる場合がある。そこで，関連語取得処理部１３においても，検索語抽出処理部１２と同様に，図５に示すような不要語リスト６０を参照し，不要語リスト６０に登録されていない単語だけを関連語として関連語記憶部４０に格納する。 However, if a noun word is directly used as a related word, a universally used word such as “month” or “day” may be treated as a related word, as in the search word extraction processing unit 12. Therefore, in the related word acquisition processing unit 13, similarly to the search word extraction processing unit 12, the unnecessary word list 60 as shown in FIG. 5 is referred to and only words that are not registered in the unnecessary word list 60 are used as related words. Stored in the related word storage unit 40.

関連語の一つの例として，検索語３１を入力したときの関連語４１を図７に示す。このとき，関連語取得処理部１３においては，入力する検索語によって得られる関連語の個数が異なる。そこで，取得する関連語の個数を調整するために，検索語と関連語の合計個数の閾値Ｔを用いて，収集する関連語の個数を設定する。具体的には，ある文に対して検索語抽出処理部１２でＳ個の検索語が抽出されたとすると，関連語取得処理部１３において関連語はＴ−Ｓ個だけウェブ検索で得られた本文テキストから抽出する。検索語の個数が合計個数Ｔを超える場合には関連語を取得せず，検索語はランダムに選択したＴ個だけ残す。 As an example of the related word, a related word 41 when the search word 31 is input is shown in FIG. At this time, in the related word acquisition processing unit 13, the number of related words obtained differs depending on the input search word. Therefore, in order to adjust the number of related words to be acquired, the number of related words to be collected is set using the threshold value T of the total number of search words and related words. More specifically, if S search words are extracted by the search word extraction processing unit 12 for a certain sentence, the related word acquisition processing unit 13 obtains only T−S related words by a web search. Extract from text. If the number of search terms exceeds the total number T, no related terms are acquired, and only T search terms selected at random are left.

さらに，関連語取得処理部１３において，入力される検索語によって得られるウェブページ数は異なるが，本発明においてウェブページはできるだけ多く取得できることが望ましい。そこで，ウェブ検索により得られたウェブページ数に対して，閾値Ｐ_Tを設定する。得られたウェブページ数が閾値Ｐ_Tを超えない場合には，検索が十分に行えないと判断し，関連語を抽出するのを止め，関連語記憶部４０には単語を格納しない。例えば，文２８に対する検索語３８を用いてウェブ検索を行うと検索件数は４件であるとする。そこで，閾値Ｐ_TをＰ_T＝５とすれば，検索語３８に対して関連語取得処理部１３にて取得したウェブページから関連語を抽出せず，関連語４８のように関連語記憶部４０には単語は格納されない。他，Ｐ_T＝５のときの文２１から文２９に対応する関連語４１から関連語４９の例を，図７に示す。 Furthermore, in the related word acquisition processing unit 13, the number of web pages obtained differs depending on the input search word, but it is desirable that as many web pages as possible can be obtained in the present invention. Therefore, a threshold value P _T is set for the number of web pages obtained by web search. When the obtained number of web pages does not exceed the threshold value P _T , it is determined that the search cannot be performed sufficiently, the related words are not extracted, and the words are not stored in the related word storage unit 40. For example, when a web search is performed using the search word 38 for the sentence 28, the number of searches is four. Therefore, if the threshold value P _{T is set} to P _T = 5, a related word is not extracted from the web page acquired by the related word acquisition processing unit 13 for the search word 38, and a related word storage unit like the related word 48 is used. 40 does not store any words. FIG. 7 shows an example of related words 41 to 49 corresponding to sentences 21 to 29 when P _T = 5.

最後に，分解文章記憶部２０に格納されている全ての文に対して，検索語抽出処理部１２と関連語取得処理部１３による処理が終了すると，制御部１５により，テキスト分割処理部１４による処理が実行される。テキスト分割処理部１４において，最初に制御部１５を介して検索語記憶部３０と関連語記憶部４０に格納されている検索語と関連語を順に取り出し，それらを要素とするキーワード集合を作成することを，各文に対して繰り返し行う。 Finally, when the processing by the search word extraction processing unit 12 and the related word acquisition processing unit 13 is completed for all the sentences stored in the decomposed text storage unit 20, the control unit 15 performs the text division processing unit 14 Processing is executed. In the text division processing unit 14, first, the search terms and the related terms stored in the search term storage unit 30 and the related term storage unit 40 are sequentially extracted via the control unit 15, and a keyword set including these as elements is created. Repeat for each sentence.

例えば，文２１に対しては，検索語３１と関連語４１から，図８のキーワード集合７１が作成される。ここで，文２２のように検索語が無い場合（図６の３２）には，対応する関連語も無いため，該当するキーワード集合はない。一方，文２８のように検索語があり，関連語が無い場合には，キーワード集合は検索語３８だけを用いて作成する。 For example, for the sentence 21, the keyword set 71 of FIG. 8 is created from the search word 31 and the related word 41. Here, when there is no search word as in the sentence 22 (32 in FIG. 6), there is no corresponding keyword, so there is no corresponding keyword set. On the other hand, when there is a search word as in the sentence 28 and there is no related word, a keyword set is created using only the search word 38.

キーワード集合の作成が終了すると，次にテキスト分割処理部１４では，生成された各文のキーワード集合を先頭から順に二つずつ比較し，共通する単語の有無を調べることによって与えられたテキスト６の分割を行う。一般的に，文章は先頭から順に書かれることが多い。そこで，本発明では先頭から順に二つの文に対するキーワード集合を解析し，共通単語の個数が所定の閾値Ｃ_T以上であれば分割を行わず，Ｃ_T未満であれば分割を行う。この比較処理をテキストの最後の二つの組を比較するまで繰り返す。そして，テキスト分割処理部１４によって得られる複数または一つの文からなるブロックを，制御部１５を通じて分割ブロック記憶部５０に格納する。 When the creation of the keyword set is completed, the text division processing unit 14 compares the generated keyword set of each sentence two by two from the top and examines the presence of the common word 6 Split. In general, sentences are often written in order from the beginning. Therefore, the present invention analyzes the keyword set for the two statements in order from the beginning, the number of common words without dividing if more than a predetermined threshold value C _T, performs the division is less than C _T. This comparison process is repeated until the last two sets of text are compared. Then, a block composed of a plurality of or one sentence obtained by the text division processing unit 14 is stored in the divided block storage unit 50 through the control unit 15.

ここで，テキスト分割処理部１４において，二つのキーワード集合を比較する際，それぞれに必ず単語が含まれているものを使用する。具体的な処理手順を図８のキーワード集合とＣ_T＝１と設定した例を用いて説明する。 Here, when the two keyword sets are compared in the text division processing unit 14, the one that always includes a word is used. A specific processing procedure will be described using an example in which the keyword set and C _T = 1 are set in FIG.

初めに，図８のキーワード集合７１とキーワード集合７２の比較を試みる。しかし，キーワード集合７２には単語が存在しないため比較処理を行わず，単語が存在するキーワード集合を見つける。その結果，キーワード集合に単語が存在するキーワード集合７１とキーワード集合７４が最初に比較対象となる。キーワード集合７１とキーワード集合７４にある共通単語を調べると，「旅行」という１個の共通単語が抽出される。この個数は閾値Ｃ_T＝１以上であるため，文２１から文２４までは一つのブロックとする。 First, comparison between the keyword set 71 and the keyword set 72 in FIG. 8 is attempted. However, since there is no word in the keyword set 72, the comparison process is not performed and the keyword set in which the word exists is found. As a result, the keyword set 71 and the keyword set 74 having words in the keyword set are first compared. When the common words in the keyword set 71 and the keyword set 74 are examined, one common word “travel” is extracted. Since this number is equal to or greater than the threshold C _T = 1, sentences 21 through 24 are set as one block.

次に，キーワード集合７４とキーワード集合７５を比較する。その結果，「温泉」と「旅行」の二つの共通単語が抽出されるため，文２１から文２５までは一つのブロックとする。そして，キーワード集合７５とキーワード集合７６を比較すると共通単語が存在しないため，一つ目のブロックは文２１と文２５までと判定し，一つ目のブロックに属する文の番号を制御部１５を介し，図９の文番号５１のように分割ブロック記憶部５０に格納する。 Next, the keyword set 74 and the keyword set 75 are compared. As a result, since two common words “hot spring” and “travel” are extracted, sentences 21 to 25 are assumed to be one block. Then, since there is no common word when comparing the keyword set 75 and the keyword set 76, the first block is determined to be the sentence 21 and the sentence 25, and the control unit 15 determines the number of the sentence belonging to the first block. Thus, the data is stored in the divided block storage unit 50 as shown by the sentence number 51 in FIG.

同様の処理をキーワード集合７６以降のキーワード集合に対して繰り返し行うと，二つ目のブロックは文２６から文２９までとなり，二つ目のブロックに属する文の番号を，図９の文番号５２のように分割ブロック記憶部５０に格納する。図９の結果から，与えられたテキスト６は二つに分割され，一つ目のブロックには１番目から５番目の文が属し，二つ目のブロックは６番目から９番目の文が属することが分かる。 When the same processing is repeated for the keyword sets subsequent to the keyword set 76, the second block becomes sentence 26 to sentence 29, and the sentence numbers belonging to the second block are assigned sentence numbers 52 in FIG. As shown in FIG. From the result of FIG. 9, the given text 6 is divided into two, the first to fifth sentences belong to the first block, and the sixth to ninth sentences belong to the second block. I understand that.

一方，ｉ番目とｊ番目（ただし，ｉ＜ｊ）のキーワード集合に共通する単語が存在せず，ｉ＋１番目からｊ−１番目までのキーワード集合に単語が存在しない場合，ｉ＋１番目からｊ−１番目はどのブロックにも割り当てられない。この場合，ｉ番目までの文で分割を行い，さらにｉ＋１番目からｊ−１番目の文をそれぞれの文が一つのブロック（本発明では空ブロックと呼ぶ）として分割し，分割ブロック記憶部５０に格納する。 On the other hand, when there is no common word in the i-th and j-th (where i <j) keyword sets, and there are no words in the i + 1-th to j-1-th keyword sets, the i + 1-th to j-1 The second is not assigned to any block. In this case, the i-th sentence is divided into the i-th sentence, the i + 1-th sentence to the j-1-th sentence are divided into one block (referred to as an empty block in the present invention), and the divided block storage unit 50 stores the divided sentences. Store.

最後に，制御部１５を通じて分割ブロック記憶部５０に格納された各ブロックの文番号を出力部１７に出力する。例えば，図９のように各ブロックの番号とそれに属する文の番号を組にして出力する方法がある。 Finally, the sentence number of each block stored in the divided block storage unit 50 is output to the output unit 17 through the control unit 15. For example, as shown in FIG. 9, there is a method in which the number of each block and the number of a sentence belonging to it are output as a set.

以上の実施形態において，テキストセグメンテーションを行う計算時間や分割の細かさに関しては，外部からパラメータＳ_T，Ｐ_T，Ｃ_T，Ｔ，Ｐを指定する手段を設けることにより，これらのパラメータによって任意に調整することができる。キーワード集合の比較方法については，分野毎によく使用される単語に重みをつけて共通単語を抽出する方法などが考えられる。また，出力部１７において，検索語記憶部３０と関連語記憶部４０を参照し，図８のようなキーワード集合に単語が存在しないという情報を利用して，空ブロックに割り当てられた文を除くように出力することもできる。 In the above embodiments, with respect to fineness of computation time or divided to perform text segmentation, the parameter S _T from the outside, P _T, C _T, T, by providing a means for specifying a P, optionally by these parameters Can be adjusted. As a method of comparing keyword sets, a method of extracting common words by weighting frequently used words for each field can be considered. Further, the output unit 17 refers to the search word storage unit 30 and the related word storage unit 40, and uses the information that the word does not exist in the keyword set as shown in FIG. 8 to remove the sentence assigned to the empty block. Can also be output.

以上のテキストセグメンテーションの処理は，コンピュータとソフトウェアプログラムとによって実現することができ，そのプログラムをコンピュータ読み取り可能な記録媒体に記録して提供することも，ネットワークを通して提供することも可能である。 The above text segmentation processing can be realized by a computer and a software program. The program can be provided by being recorded on a computer-readable recording medium or provided through a network.

本発明における処理手順の概要を示す図である。It is a figure which shows the outline | summary of the process sequence in this invention. 本発明の実施形態におけるテキストセグメンテーション処理装置の構成図である。It is a block diagram of the text segmentation processing apparatus in embodiment of this invention. 本発明の実施形態におけるテキストの一例を示す図である。It is a figure which shows an example of the text in embodiment of this invention. 本発明の実施形態における分解文章記憶部に格納された文の一例を示す図である。It is a figure which shows an example of the sentence stored in the decomposition | disassembly text storage part in embodiment of this invention. 本発明の実施形態における不要語リストの一例を示す図である。It is a figure which shows an example of the unnecessary word list | wrist in embodiment of this invention. 本発明の実施形態における検索語記憶部に格納された検索語の一例を示す図である。It is a figure which shows an example of the search word stored in the search word memory | storage part in embodiment of this invention. 本発明の実施形態における関連語記憶部に格納された関連語の一例を示す図である。It is a figure which shows an example of the related word stored in the related word memory | storage part in embodiment of this invention. 本発明の実施形態におけるテキスト分割処理部で作成されるキーワード集合の一例を示す図である。It is a figure which shows an example of the keyword set produced by the text division | segmentation process part in embodiment of this invention. 本発明に実施形態における分割ブロック記憶部に格納された文番号の一例を示す図である。It is a figure which shows an example of the sentence number stored in the division | segmentation block memory | storage part in embodiment to this invention.

Explanation of symbols

Ｓ１テキストを入力する処理
Ｓ２テキストを文単位に分割する処理
Ｓ３各文に対して検索語となる単語を抽出する処理
Ｓ４検索語を利用して関連語を取得する処理
Ｓ５キーワード集合（検索語と関連語の組）によってテキストを分割する処理
Ｓ６分割結果を出力する処理
１コンピュータ
２表示部
３ネットワーク
４ウェブ
５構造化言語で記述された複数のテキスト
６入力するテキスト
１１テキスト分解処理部
１２検索語抽出処理部
１３関連語取得処理部
１４テキスト分割処理部
１５制御部
１６入力部
１７出力部
２０分解文章記憶部
２１〜２９分解文章記憶部に格納されている１〜９番目の文
３０検索語記憶部
３１〜３９分解文章記憶部に登録されている１〜９番目の文に対応する検索語記憶部に格納されている単語
４０関連語記憶部
４１〜４９検索語記憶部に登録されている１〜９番目の単語に対応する関連語記憶部に格納されている単語
５０分割ブロック記憶部
５１，５２分割ブロック記憶部に格納されている１，２番目のブロックに属する文番号
６０不要語リスト
７１〜７９検索語記憶部に登録されている１〜９番目の単語と関連語記憶部に格納されている１〜９番目の単語を組にして作成したキーワード集合 S1 Process for inputting text S2 Process for dividing text into sentences S3 Process for extracting words as search terms for each sentence S4 Process for acquiring related terms using search terms S5 Keyword set (with search terms and S6 Process for Dividing Text by Set of Related Words S6 Process for Outputting Division Results 1 Computer 2 Display Unit 3 Network 4 Web 5 Multiple Texts Described in Structured Language 6 Input Text 11 Text Decomposition Processing Unit 12 Search Term Extraction processing unit 13 Related word acquisition processing unit 14 Text division processing unit 15 Control unit 16 Input unit 17 Output unit 20 Decomposed sentence storage unit 21-29 1 to 9th sentence stored in decomposed sentence storage unit 30 Search word storage 31 to 39 are stored in the search word storage corresponding to the first to ninth sentences registered in the decomposed sentence storage Word 40 related word storage unit 41-49 word stored in related word storage unit corresponding to first to ninth words registered in search word storage unit 50 divided block storage unit 51, 52 in divided block storage unit Sentence number belonging to the 1st and 2nd blocks stored 60 Unnecessary word list 71 to 79 1st to 9th words registered in the search word storage unit and 1 to 9th stored in the related word storage unit Keyword set created by pairing words

Claims

In a text segmentation method for dividing text, which is a set of electronic information sentences, into blocks consisting of one or more sentences by a computer,
The computer is
The process of entering the text to be split,
The process of dividing the input text into sentences,
A process of performing morphological analysis on each divided sentence, extracting a search word for each sentence, and storing it in the search word storage means;
Performing a search on the web using the search terms extracted for each sentence, obtaining related terms from the obtained search results, and storing them in the related term storage means;
By referring to the search word storage means and the related word storage means, a keyword set which is a set of search words and related words in each sentence is created, and the keyword sets are compared for each adjacent sentence. The process of generating a block by deciding whether or not to combine the sentences of the keyword set to be compared into one block according to the number of common keywords,
A text segmentation method, comprising: outputting a generated block as a result of text division.

The text segmentation method of claim 1,
In the acquisition of related words from the search results obtained by the web search, the words of nouns are extracted from a plurality of texts described in the structured language as the search results, and the appearance frequency of the extracted words is calculated. A text segmentation method, wherein a predetermined number of words in order of frequency are selected as related words.

The text segmentation method according to claim 1 or 2,
In the extraction of search terms from each sentence or the acquisition of related words from the search results obtained by the web search, the noun words included in the respective sentences or the search results are registered in the unnecessary word list in advance. A text segmentation method characterized in that a search word or related word is excluded from a word.

In a text segmentation processing apparatus that divides text, which is a set of electronically converted sentences, into blocks consisting of one or more sentences,
An input means for inputting the text to be divided;
Text decomposition processing means for dividing input text into sentence units;
Search word extraction processing means for performing morphological analysis on each divided sentence and extracting a search word for each sentence;
Search term storage means for storing the extracted search terms;
A related word acquisition processing means for performing a search on the web using the search word extracted for each sentence, and acquiring a related word from the obtained search result;
A related word storage means for storing the acquired related words;
By referring to the search word storage means and the related word storage means, a keyword set which is a set of search words and related words in each sentence is created, and the keyword sets are compared for each adjacent sentence. , A text division processing means for determining whether or not to combine the sentences of the keyword set to be compared into one block according to the number of common keywords, and generating a block;
An output means for outputting the generated block as a result of text division. A text segmentation processing apparatus, comprising:

A text segmentation processing program for causing a computer to execute the text segmentation method according to claim 1, claim 2 or claim 3.

A computer-readable recording medium on which a text segmentation processing program for causing a computer to execute the text segmentation method according to claim 1, 2 or 3 is recorded.