CN104346411A

CN104346411A - Method and equipment for clustering multiple manuscripts

Info

Publication number: CN104346411A
Application number: CN201310346857.4A
Authority: CN
Inventors: 王露
Original assignee: Peking University Founder Group Co Ltd; Beijing Founder Electronics Co Ltd
Current assignee: New Founder Holdings Development Co ltd; Beijing Founder Electronics Co Ltd
Priority date: 2013-08-09
Filing date: 2013-08-09
Publication date: 2015-02-11
Anticipated expiration: 2033-08-09
Also published as: CN104346411B

Abstract

The invention relates to a method and equipment for clustering multiple manuscripts. The method comprises the steps of establishing a manuscript classification space according to the classification of a news classification method; extracting keyboards in each manuscript; establishing manuscript coordinates according to the frequency of the extracted keywords so as to map each manuscript into a point in the manuscript classification space; calculating the distance between the manuscripts, and determining the manuscripts between which the distance is smaller than a distance threshold value as a class. According to the method, a great number of news manuscripts can be automatically clustered, and therefore manpower is saved.

Description

Multiple contribution is carried out to the method and apparatus of cluster

Technical field

The application relates to the method and apparatus multiple contribution being carried out to cluster.

Background technology

Society, quantity of information is that geometry quantity increases, and all can produce a large amount of records every day, study document or the contribution of various field such as news, history, science and technology etc., need sometimes to classify to these contributions.

Such as, for newspaper office, news website etc., every day can receive a large amount of Press release, and may need to classify to Press release to report more accurately.Ageing very strong due to Press release, it is very important for classifying to Press release as soon as possible.If by manually classifying to all contributions, then hard work amount can be produced, thus ageing being difficult to of news is caused to ensure.If first a large amount of Press release is divided into a few class by the method for automatic cluster, then through artificial adjustment, then can save a large amount of labor workload.

Therefore, there is the demand of the method and apparatus multiple contribution being carried out automatically to cluster.

Summary of the invention

In order to solve the problem, this application provides the method and apparatus multiple contribution being carried out to cluster, thus automatically can carry out cluster to a large amount of Press release, save manpower.

According to the first aspect of the application, provide a kind of method of multiple contribution being carried out to cluster, comprising:

Contribution classifying space is set up according to the classification of news category method;

Extract the keyword in each contribution;

Frequency according to the keyword extracted sets up contribution coordinate, thus each contribution is mapped as the point in contribution classifying space;

Calculate point that the first contribution in described multiple contribution the maps distance in contribution classifying space respectively and between the point that maps of other contributions, determine in calculated distance, whether to there is the value being less than the first predetermined distance threshold; And

If existed, then the contribution corresponding to the value being less than described first distance threshold is defined as belonging to the identical first kind with described first contribution.

According to the second aspect of the application, provide a kind of equipment multiple contribution being carried out to cluster, comprising:

Set up module, be configured to set up contribution classifying space according to the classification of news category method;

Extraction module, is configured to extract the keyword in each contribution;

Mapping block, is configured to set up contribution coordinate according to the frequency of the keyword extracted, thus each contribution is mapped as the point in contribution classifying space;

Computing module, be configured to calculate point that the first contribution in described multiple contribution the maps distance in contribution classifying space respectively and between the point that maps of other contributions, determine in calculated distance, whether to there is the value being less than the first predetermined distance threshold; And

Cluster module, is configured to the contribution corresponding to the value being less than described first distance threshold to be defined as belonging to the identical first kind with described first contribution.

Brief Description Of Drawings

Fig. 1 is the process flow diagram multiple contribution being carried out to the method for cluster according to the application's embodiment; And

Fig. 2 is the schematic diagram multiple contribution being carried out to the equipment of cluster according to the application's embodiment.

Embodiment

Below in conjunction with embodiment and accompanying drawing to being described in detail the method and apparatus that multiple contribution carries out cluster according to the application's embodiment.

In this application, " news category method " refers to the method for classifying to contribution according to the kind of news, such as, contribution can be categorized as finance and economics, physical culture, science and technology, politics, amusement class etc., can also be football, basketball, tennis, gymnastics etc. by classification sports.

In this application, " contribution classifying space " refers to the space that the classification divided using news category method is set up as dimension.

In this application, the entry in " uncorrelated dictionary " refers to and usually to occur in contribution in contribution but the entry irrelevant with the classification carried out according to news category method, such as, and " we ", " but " etc.

In this application, " cluster " refers to and the contribution with certain correlativity is divided into same class, such as, multiple contribution is divided into politics, amusement class etc.

First with reference to Fig. 1, the method for multiple contribution being carried out to cluster according to the application's embodiment will be described.

In a step 101, contribution classifying space is set up according to the classification of news category method.

In the exemplary embodiment, news category method can comprise multiple classifications such as such as finance and economics, physical culture, science and technology, politics, amusement.In some embodiments, can using the dimension of each classifications such as finance and economics, physical culture, science and technology, politics, amusement as contribution classifying space.Such as, news category method comprises N number of classification, then contribution classifying space can be N dimension space, and the coordinate of the point in contribution classifying space can be expressed as (T1, W1, T2, W2 ..., Tn, Wn), wherein, Ti is i-th dimension of contribution classifying space corresponding to the i-th classification of news category method, and Wi is the weight of Ti.

In a step 102, the keyword in each contribution is extracted.

In some embodiments, step 102 can comprise: calculate the frequency that entry occurs in contribution, and by the entry alternatively keyword higher than the frequency threshold (such as, five times) preset; And remove incoherent candidate keywords (such as, conventional " we ", " but " etc.), thus obtain keyword.

In the exemplary embodiment, uncorrelated dictionary can be preset, wherein record the frequency of occurrences in contribution higher but the word of classifying cannot be carried out according to news category method, such as, " we ", " but " etc.Such as, after acquisition candidate keywords " we ", " football ", judge that entry " we " is present in default uncorrelated dictionary, then " we " are removed from candidate keywords; And judge that entry " football " is not present in default uncorrelated dictionary, be then defined as keyword.

In step 103, the frequency according to the keyword extracted sets up contribution coordinate, thus each contribution is mapped as the point in contribution classifying space.

In the exemplary embodiment, setting up contribution according to the frequency of keyword extracted can comprise: the keyword extracted in contribution is all belonged to the classification carried out entry to news category method, that is, corresponding to a dimension of contribution classifying space, and Ti; And calculate the frequency belonging to all keywords of this classification in contribution and, using this frequency with as the value of contribution in this dimension, Wi, thus each contribution is mapped as the point in contribution classifying space.

Such as, occur football in contribution 8 times, tennis 6 time, dollar 6 times, all belongs to football and tennis to sport category by news category method, by dollar ownership to economic class, thus in contribution classifying space, the value of its sport category dimension is 8+6=14, the value of economic class dimension 6 times.In this way, each contribution can be mapped as the point in contribution classifying space.

In the exemplary embodiment, when the value W1-Wi of frequency in each dimension of contribution is less than frequency threshold, then this contribution is included in " unfiled ".

At step 104, calculate point that the first contribution in multiple contribution the maps distance in contribution classifying space respectively and between the point that maps of other contributions, determine in calculated distance, whether to there is the value being less than the first predetermined distance threshold.

In the exemplary embodiment, the time that can be formed according to contribution determines the first contribution, such as, and can using time contribution the earliest as the first contribution.

In another illustrative embodiments, from multiple contribution, Stochastic choice contribution is as the first contribution.

In another illustrative embodiments, by staff from multiple contribution Stochastic choice contribution as the first contribution.

In step 105, if determining to exist in calculated distance is less than the value of the first predetermined distance threshold, then the contribution corresponding to the value being less than the first distance threshold is defined as belonging to the identical first kind with the first contribution.

In the exemplary embodiment, the size of the first distance threshold can be set by operating personnel according to actual needs.Such as, if by first distance threshold arrange relatively little, make the correlativity of of a sort contribution relatively strong; On the contrary, if by first distance threshold arrange relatively large, make the correlativity of of a sort contribution relatively weak.

In some embodiments, step can also be comprised to the method that multiple contribution carries out cluster:

Determine in multiple contribution, whether to there are the multiple contributions not being confirmed as the first kind, if existed, then therefrom select the second contribution, and calculate point that the second contribution the maps distance in contribution classifying space and between the point that maps of other contributions not being confirmed as the first kind;

Determine in calculated distance, whether to there is the value being less than predetermined second distance threshold value; And

If existed, then the contribution corresponding to the value being less than second distance threshold value is defined as belonging to identical Equations of The Second Kind with the second contribution.

In some embodiments, by repeating similar step, all contributions can be made all to classify, that is: determine in multiple contribution, whether to there are the multiple contributions not being confirmed as the first kind or Equations of The Second Kind, if existed, then therefrom select the 3rd contribution, and calculate point that the 3rd contribution the maps distance in contribution classifying space and between the point that maps of other contributions not being confirmed as the first kind or Equations of The Second Kind; Determine whether to exist in calculated distance the value being less than the 3rd predetermined distance threshold; And if exist, then the contribution corresponding to the value being less than the 3rd distance threshold is defined as belonging to the 3rd identical class with the 3rd contribution, until all contributions are all classified.

By the above-mentioned method of multiple contribution being carried out to cluster according to the application's embodiment, automatic cluster can be carried out to a large amount of contributions, thus the manpower saved.

Referring to Fig. 2, the equipment multiple contribution being carried out to cluster according to the application's embodiment will be described.

As shown in the figure, this equipment can comprise with lower component.

Set up module 201, it can set up contribution classifying space according to the classification of news category method.In some embodiments, set up module 201 and can set up contribution classifying space according to history classification to each classification that entry carries out.

Extraction module 202, it can extract the keyword in each contribution.In some embodiments, extraction module 202 can comprise statistics parts and deleting parts.Statistics parts can add up the frequency that entry occurs in contribution, and by the alternatively keyword of the entry higher than frequency threshold.Deleting parts can delete incoherent entry from candidate keywords, thus obtains keyword.In the exemplary embodiment, deleting parts can judge whether candidate keywords is present in default uncorrelated dictionary, and if so, then this candidate keywords deleted, remaining candidate keywords is then as keyword.

Mapping block 203, it can set up contribution coordinate according to the frequency of the keyword extracted, thus each contribution is mapped as the point in contribution classifying space.In some embodiments, mapping block 203 can comprise ownership parts and summation component.The keyword extracted can belong to the classification according to news category method by ownership parts, that is, the dimension of contribution classifying space.The frequency that summation component can occur in contribution all keywords in this classification (that is, a certain dimension) and, and using this frequency with as the value of contribution in the dimension of the contribution classifying space corresponding to this classification.

Computing module 204, be configured to calculate point that the first contribution in multiple contribution the maps distance in contribution classifying space respectively and between the point that maps of other contributions, determine in calculated distance, whether to there is the value being less than the first predetermined distance threshold.

Cluster module 205, is configured to the contribution corresponding to the value being less than the first distance threshold to be defined as belonging to the identical first kind with the first contribution.In some embodiments, cluster module 205 can also determine whether there are the multiple contributions not being confirmed as the first kind in multiple contribution, if existed, then therefrom select the second contribution, and calculate point that the second contribution the maps distance in contribution classifying space and between the point that maps of other contributions not being confirmed as the first kind; Determine in calculated distance, whether to there is the value being less than predetermined second distance threshold value; And if exist, then the contribution corresponding to the value being less than second distance threshold value is defined as belonging to identical Equations of The Second Kind with the second contribution, until all contributions are all classified.

Should be appreciated that above embodiment is only exemplary herein, but not be the scope limiting the application.For a person skilled in the art, when not departing from spirit and the essence of the application, various modification and improvement can be made, but these modification and improve also should be considered as falling into the application protection domain among.

Claims

1. A method of clustering a plurality of manuscripts, comprising:

Create a manuscript classification space according to the classification of news taxonomy;

Extract key words in each manuscript;

Establish manuscript coordinates based on the frequency of the extracted keywords, thereby mapping each manuscript as a point in the manuscript classification space;

calculating the distances between the points mapped by the first manuscript among the plurality of manuscripts in the manuscript classification space and the points mapped by other manuscripts, and determining whether there is a distance smaller than a predetermined first distance threshold among the calculated distances; value; and

If it exists, determine that the manuscript corresponding to the value smaller than the first distance threshold value belongs to the same first category as the first manuscript.

2. The method according to claim 1, wherein the step of extracting keywords in each manuscript comprises:

Calculate the frequency of terms that appear in the manuscript and use terms above a frequency threshold as candidate keywords; and

Remove irrelevant entries from candidate keywords to obtain keywords.

3. method as claimed in claim 2, wherein, remove irrelevant entry from candidate keyword, thereby the step of obtaining keyword comprises:

judging whether the candidate keyword exists in a preset irrelevant thesaurus; and

If yes, remove the candidate keyword; if not, use the candidate keyword as a keyword.

4. The method as claimed in claim 1, wherein, the step of establishing the manuscript classification space according to the classification of news taxonomy comprises:

Each category of the news taxonomy is considered as a dimension of the manuscript taxonomy space.

5. The method according to claim 1, wherein, the step of establishing manuscript coordinates according to the frequency of extracted keywords comprises:

Attribute the extracted keywords to each category according to the news taxonomy;

Calculate the frequency sum of all keywords in the category that appear in the manuscript, and use the frequency sum as the value of the manuscript in the dimension of the manuscript classification space corresponding to the category.

6. The method according to claim 1, wherein the first manuscript is determined according to the time when the manuscript was formed, or selected randomly, or selected by a staff member.

7. The method of claim 6, further comprising:

Determine whether there are a plurality of manuscripts that are not determined as the first category among the plurality of manuscripts, and if so, select a second manuscript from them, and calculate the difference between the points mapped to the second manuscript and other manuscripts that are not determined in the manuscript classification space distance between points mapped for manuscripts of the first category;

determining whether any of the calculated distances has a value less than a predetermined second distance threshold; and

If there is, the manuscript corresponding to the value smaller than the second distance threshold is determined as belonging to the same second category as the second manuscript.

8. An apparatus for clustering a plurality of manuscripts, comprising:

a building module configured to create a manuscript classification space according to the categories of the news taxonomy;

an extraction module configured to extract keywords in each manuscript;

a mapping module configured to establish manuscript coordinates based on the frequency of the extracted keywords, thereby mapping each manuscript as a point in the manuscript taxonomy space;

The calculation module is configured to calculate the distance between the points mapped by the first manuscript among the plurality of manuscripts and the points mapped by other manuscripts in the manuscript classification space, and determine whether there is a distance smaller than a predetermined distance among the calculated distances. The value of the first distance threshold of ; and

A clustering module configured to determine manuscripts corresponding to values smaller than the first distance threshold as belonging to the same first category as the first manuscripts.

9. The apparatus of claim 8, said extraction module comprising:

A statistical component configured to count the frequency of terms appearing in the manuscript, and use terms higher than the frequency threshold as candidate keywords; and

The deletion component is configured to delete irrelevant terms from candidate keywords, thereby obtaining keywords.

10. The device of claim 8, the mapping module comprising:

an attribution component configured to attribute the extracted keywords to categories according to the news taxonomy,

The summing component is configured to calculate the frequency sum of all keywords in the category appearing in the manuscript, and use the frequency sum as the value of the manuscript in the dimension of the manuscript classification space corresponding to the category.