Summary of the invention
In order to solve the problem, this application provides the method and apparatus multiple contribution being carried out to cluster, thus automatically can carry out cluster to a large amount of Press release, save manpower.
According to the first aspect of the application, provide a kind of method of multiple contribution being carried out to cluster, comprising:
Contribution classifying space is set up according to the classification of news category method;
Extract the keyword in each contribution;
Frequency according to the keyword extracted sets up contribution coordinate, thus each contribution is mapped as the point in contribution classifying space;
Calculate point that the first contribution in described multiple contribution the maps distance in contribution classifying space respectively and between the point that maps of other contributions, determine in calculated distance, whether to there is the value being less than the first predetermined distance threshold; And
If existed, then the contribution corresponding to the value being less than described first distance threshold is defined as belonging to the identical first kind with described first contribution.
According to the second aspect of the application, provide a kind of equipment multiple contribution being carried out to cluster, comprising:
Set up module, be configured to set up contribution classifying space according to the classification of news category method;
Extraction module, is configured to extract the keyword in each contribution;
Mapping block, is configured to set up contribution coordinate according to the frequency of the keyword extracted, thus each contribution is mapped as the point in contribution classifying space;
Computing module, be configured to calculate point that the first contribution in described multiple contribution the maps distance in contribution classifying space respectively and between the point that maps of other contributions, determine in calculated distance, whether to there is the value being less than the first predetermined distance threshold; And
Cluster module, is configured to the contribution corresponding to the value being less than described first distance threshold to be defined as belonging to the identical first kind with described first contribution.
Embodiment
Below in conjunction with embodiment and accompanying drawing to being described in detail the method and apparatus that multiple contribution carries out cluster according to the application's embodiment.
In this application, " news category method " refers to the method for classifying to contribution according to the kind of news, such as, contribution can be categorized as finance and economics, physical culture, science and technology, politics, amusement class etc., can also be football, basketball, tennis, gymnastics etc. by classification sports.
In this application, " contribution classifying space " refers to the space that the classification divided using news category method is set up as dimension.
In this application, the entry in " uncorrelated dictionary " refers to and usually to occur in contribution in contribution but the entry irrelevant with the classification carried out according to news category method, such as, and " we ", " but " etc.
In this application, " cluster " refers to and the contribution with certain correlativity is divided into same class, such as, multiple contribution is divided into politics, amusement class etc.
First with reference to Fig. 1, the method for multiple contribution being carried out to cluster according to the application's embodiment will be described.
In a step 101, contribution classifying space is set up according to the classification of news category method.
In the exemplary embodiment, news category method can comprise multiple classifications such as such as finance and economics, physical culture, science and technology, politics, amusement.In some embodiments, can using the dimension of each classifications such as finance and economics, physical culture, science and technology, politics, amusement as contribution classifying space.Such as, news category method comprises N number of classification, then contribution classifying space can be N dimension space, and the coordinate of the point in contribution classifying space can be expressed as (T1, W1, T2, W2 ..., Tn, Wn), wherein, Ti is i-th dimension of contribution classifying space corresponding to the i-th classification of news category method, and Wi is the weight of Ti.
In a step 102, the keyword in each contribution is extracted.
In some embodiments, step 102 can comprise: calculate the frequency that entry occurs in contribution, and by the entry alternatively keyword higher than the frequency threshold (such as, five times) preset; And remove incoherent candidate keywords (such as, conventional " we ", " but " etc.), thus obtain keyword.
In the exemplary embodiment, uncorrelated dictionary can be preset, wherein record the frequency of occurrences in contribution higher but the word of classifying cannot be carried out according to news category method, such as, " we ", " but " etc.Such as, after acquisition candidate keywords " we ", " football ", judge that entry " we " is present in default uncorrelated dictionary, then " we " are removed from candidate keywords; And judge that entry " football " is not present in default uncorrelated dictionary, be then defined as keyword.
In step 103, the frequency according to the keyword extracted sets up contribution coordinate, thus each contribution is mapped as the point in contribution classifying space.
In the exemplary embodiment, setting up contribution according to the frequency of keyword extracted can comprise: the keyword extracted in contribution is all belonged to the classification carried out entry to news category method, that is, corresponding to a dimension of contribution classifying space, and Ti; And calculate the frequency belonging to all keywords of this classification in contribution and, using this frequency with as the value of contribution in this dimension, Wi, thus each contribution is mapped as the point in contribution classifying space.
Such as, occur football in contribution 8 times, tennis 6 time, dollar 6 times, all belongs to football and tennis to sport category by news category method, by dollar ownership to economic class, thus in contribution classifying space, the value of its sport category dimension is 8+6=14, the value of economic class dimension 6 times.In this way, each contribution can be mapped as the point in contribution classifying space.
In the exemplary embodiment, when the value W1-Wi of frequency in each dimension of contribution is less than frequency threshold, then this contribution is included in " unfiled ".
At step 104, calculate point that the first contribution in multiple contribution the maps distance in contribution classifying space respectively and between the point that maps of other contributions, determine in calculated distance, whether to there is the value being less than the first predetermined distance threshold.
In the exemplary embodiment, the time that can be formed according to contribution determines the first contribution, such as, and can using time contribution the earliest as the first contribution.
In another illustrative embodiments, from multiple contribution, Stochastic choice contribution is as the first contribution.
In another illustrative embodiments, by staff from multiple contribution Stochastic choice contribution as the first contribution.
In step 105, if determining to exist in calculated distance is less than the value of the first predetermined distance threshold, then the contribution corresponding to the value being less than the first distance threshold is defined as belonging to the identical first kind with the first contribution.
In the exemplary embodiment, the size of the first distance threshold can be set by operating personnel according to actual needs.Such as, if by first distance threshold arrange relatively little, make the correlativity of of a sort contribution relatively strong; On the contrary, if by first distance threshold arrange relatively large, make the correlativity of of a sort contribution relatively weak.
In some embodiments, step can also be comprised to the method that multiple contribution carries out cluster:
Determine in multiple contribution, whether to there are the multiple contributions not being confirmed as the first kind, if existed, then therefrom select the second contribution, and calculate point that the second contribution the maps distance in contribution classifying space and between the point that maps of other contributions not being confirmed as the first kind;
Determine in calculated distance, whether to there is the value being less than predetermined second distance threshold value; And
If existed, then the contribution corresponding to the value being less than second distance threshold value is defined as belonging to identical Equations of The Second Kind with the second contribution.
In some embodiments, by repeating similar step, all contributions can be made all to classify, that is: determine in multiple contribution, whether to there are the multiple contributions not being confirmed as the first kind or Equations of The Second Kind, if existed, then therefrom select the 3rd contribution, and calculate point that the 3rd contribution the maps distance in contribution classifying space and between the point that maps of other contributions not being confirmed as the first kind or Equations of The Second Kind; Determine whether to exist in calculated distance the value being less than the 3rd predetermined distance threshold; And if exist, then the contribution corresponding to the value being less than the 3rd distance threshold is defined as belonging to the 3rd identical class with the 3rd contribution, until all contributions are all classified.
By the above-mentioned method of multiple contribution being carried out to cluster according to the application's embodiment, automatic cluster can be carried out to a large amount of contributions, thus the manpower saved.
Referring to Fig. 2, the equipment multiple contribution being carried out to cluster according to the application's embodiment will be described.
As shown in the figure, this equipment can comprise with lower component.
Set up module 201, it can set up contribution classifying space according to the classification of news category method.In some embodiments, set up module 201 and can set up contribution classifying space according to history classification to each classification that entry carries out.
Extraction module 202, it can extract the keyword in each contribution.In some embodiments, extraction module 202 can comprise statistics parts and deleting parts.Statistics parts can add up the frequency that entry occurs in contribution, and by the alternatively keyword of the entry higher than frequency threshold.Deleting parts can delete incoherent entry from candidate keywords, thus obtains keyword.In the exemplary embodiment, deleting parts can judge whether candidate keywords is present in default uncorrelated dictionary, and if so, then this candidate keywords deleted, remaining candidate keywords is then as keyword.
Mapping block 203, it can set up contribution coordinate according to the frequency of the keyword extracted, thus each contribution is mapped as the point in contribution classifying space.In some embodiments, mapping block 203 can comprise ownership parts and summation component.The keyword extracted can belong to the classification according to news category method by ownership parts, that is, the dimension of contribution classifying space.The frequency that summation component can occur in contribution all keywords in this classification (that is, a certain dimension) and, and using this frequency with as the value of contribution in the dimension of the contribution classifying space corresponding to this classification.
Computing module 204, be configured to calculate point that the first contribution in multiple contribution the maps distance in contribution classifying space respectively and between the point that maps of other contributions, determine in calculated distance, whether to there is the value being less than the first predetermined distance threshold.
Cluster module 205, is configured to the contribution corresponding to the value being less than the first distance threshold to be defined as belonging to the identical first kind with the first contribution.In some embodiments, cluster module 205 can also determine whether there are the multiple contributions not being confirmed as the first kind in multiple contribution, if existed, then therefrom select the second contribution, and calculate point that the second contribution the maps distance in contribution classifying space and between the point that maps of other contributions not being confirmed as the first kind; Determine in calculated distance, whether to there is the value being less than predetermined second distance threshold value; And if exist, then the contribution corresponding to the value being less than second distance threshold value is defined as belonging to identical Equations of The Second Kind with the second contribution, until all contributions are all classified.
Should be appreciated that above embodiment is only exemplary herein, but not be the scope limiting the application.For a person skilled in the art, when not departing from spirit and the essence of the application, various modification and improvement can be made, but these modification and improve also should be considered as falling into the application protection domain among.