Nothing Special   »   [go: up one dir, main page]

CN104346411A - Method and equipment for clustering multiple manuscripts - Google Patents

Method and equipment for clustering multiple manuscripts Download PDF

Info

Publication number
CN104346411A
CN104346411A CN201310346857.4A CN201310346857A CN104346411A CN 104346411 A CN104346411 A CN 104346411A CN 201310346857 A CN201310346857 A CN 201310346857A CN 104346411 A CN104346411 A CN 104346411A
Authority
CN
China
Prior art keywords
contribution
keyword
classification
frequency
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201310346857.4A
Other languages
Chinese (zh)
Other versions
CN104346411B (en
Inventor
王露
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
New Founder Holdings Development Co ltd
Beijing Founder Electronics Co Ltd
Original Assignee
Peking University Founder Group Co Ltd
Beijing Founder Electronics Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University Founder Group Co Ltd, Beijing Founder Electronics Co Ltd filed Critical Peking University Founder Group Co Ltd
Priority to CN201310346857.4A priority Critical patent/CN104346411B/en
Publication of CN104346411A publication Critical patent/CN104346411A/en
Application granted granted Critical
Publication of CN104346411B publication Critical patent/CN104346411B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a method and equipment for clustering multiple manuscripts. The method comprises the steps of establishing a manuscript classification space according to the classification of a news classification method; extracting keyboards in each manuscript; establishing manuscript coordinates according to the frequency of the extracted keywords so as to map each manuscript into a point in the manuscript classification space; calculating the distance between the manuscripts, and determining the manuscripts between which the distance is smaller than a distance threshold value as a class. According to the method, a great number of news manuscripts can be automatically clustered, and therefore manpower is saved.

Description

Multiple contribution is carried out to the method and apparatus of cluster
Technical field
The application relates to the method and apparatus multiple contribution being carried out to cluster.
Background technology
Society, quantity of information is that geometry quantity increases, and all can produce a large amount of records every day, study document or the contribution of various field such as news, history, science and technology etc., need sometimes to classify to these contributions.
Such as, for newspaper office, news website etc., every day can receive a large amount of Press release, and may need to classify to Press release to report more accurately.Ageing very strong due to Press release, it is very important for classifying to Press release as soon as possible.If by manually classifying to all contributions, then hard work amount can be produced, thus ageing being difficult to of news is caused to ensure.If first a large amount of Press release is divided into a few class by the method for automatic cluster, then through artificial adjustment, then can save a large amount of labor workload.
Therefore, there is the demand of the method and apparatus multiple contribution being carried out automatically to cluster.
Summary of the invention
In order to solve the problem, this application provides the method and apparatus multiple contribution being carried out to cluster, thus automatically can carry out cluster to a large amount of Press release, save manpower.
According to the first aspect of the application, provide a kind of method of multiple contribution being carried out to cluster, comprising:
Contribution classifying space is set up according to the classification of news category method;
Extract the keyword in each contribution;
Frequency according to the keyword extracted sets up contribution coordinate, thus each contribution is mapped as the point in contribution classifying space;
Calculate point that the first contribution in described multiple contribution the maps distance in contribution classifying space respectively and between the point that maps of other contributions, determine in calculated distance, whether to there is the value being less than the first predetermined distance threshold; And
If existed, then the contribution corresponding to the value being less than described first distance threshold is defined as belonging to the identical first kind with described first contribution.
According to the second aspect of the application, provide a kind of equipment multiple contribution being carried out to cluster, comprising:
Set up module, be configured to set up contribution classifying space according to the classification of news category method;
Extraction module, is configured to extract the keyword in each contribution;
Mapping block, is configured to set up contribution coordinate according to the frequency of the keyword extracted, thus each contribution is mapped as the point in contribution classifying space;
Computing module, be configured to calculate point that the first contribution in described multiple contribution the maps distance in contribution classifying space respectively and between the point that maps of other contributions, determine in calculated distance, whether to there is the value being less than the first predetermined distance threshold; And
Cluster module, is configured to the contribution corresponding to the value being less than described first distance threshold to be defined as belonging to the identical first kind with described first contribution.
Brief Description Of Drawings
Fig. 1 is the process flow diagram multiple contribution being carried out to the method for cluster according to the application's embodiment; And
Fig. 2 is the schematic diagram multiple contribution being carried out to the equipment of cluster according to the application's embodiment.
Embodiment
Below in conjunction with embodiment and accompanying drawing to being described in detail the method and apparatus that multiple contribution carries out cluster according to the application's embodiment.
In this application, " news category method " refers to the method for classifying to contribution according to the kind of news, such as, contribution can be categorized as finance and economics, physical culture, science and technology, politics, amusement class etc., can also be football, basketball, tennis, gymnastics etc. by classification sports.
In this application, " contribution classifying space " refers to the space that the classification divided using news category method is set up as dimension.
In this application, the entry in " uncorrelated dictionary " refers to and usually to occur in contribution in contribution but the entry irrelevant with the classification carried out according to news category method, such as, and " we ", " but " etc.
In this application, " cluster " refers to and the contribution with certain correlativity is divided into same class, such as, multiple contribution is divided into politics, amusement class etc.
First with reference to Fig. 1, the method for multiple contribution being carried out to cluster according to the application's embodiment will be described.
In a step 101, contribution classifying space is set up according to the classification of news category method.
In the exemplary embodiment, news category method can comprise multiple classifications such as such as finance and economics, physical culture, science and technology, politics, amusement.In some embodiments, can using the dimension of each classifications such as finance and economics, physical culture, science and technology, politics, amusement as contribution classifying space.Such as, news category method comprises N number of classification, then contribution classifying space can be N dimension space, and the coordinate of the point in contribution classifying space can be expressed as (T1, W1, T2, W2 ..., Tn, Wn), wherein, Ti is i-th dimension of contribution classifying space corresponding to the i-th classification of news category method, and Wi is the weight of Ti.
In a step 102, the keyword in each contribution is extracted.
In some embodiments, step 102 can comprise: calculate the frequency that entry occurs in contribution, and by the entry alternatively keyword higher than the frequency threshold (such as, five times) preset; And remove incoherent candidate keywords (such as, conventional " we ", " but " etc.), thus obtain keyword.
In the exemplary embodiment, uncorrelated dictionary can be preset, wherein record the frequency of occurrences in contribution higher but the word of classifying cannot be carried out according to news category method, such as, " we ", " but " etc.Such as, after acquisition candidate keywords " we ", " football ", judge that entry " we " is present in default uncorrelated dictionary, then " we " are removed from candidate keywords; And judge that entry " football " is not present in default uncorrelated dictionary, be then defined as keyword.
In step 103, the frequency according to the keyword extracted sets up contribution coordinate, thus each contribution is mapped as the point in contribution classifying space.
In the exemplary embodiment, setting up contribution according to the frequency of keyword extracted can comprise: the keyword extracted in contribution is all belonged to the classification carried out entry to news category method, that is, corresponding to a dimension of contribution classifying space, and Ti; And calculate the frequency belonging to all keywords of this classification in contribution and, using this frequency with as the value of contribution in this dimension, Wi, thus each contribution is mapped as the point in contribution classifying space.
Such as, occur football in contribution 8 times, tennis 6 time, dollar 6 times, all belongs to football and tennis to sport category by news category method, by dollar ownership to economic class, thus in contribution classifying space, the value of its sport category dimension is 8+6=14, the value of economic class dimension 6 times.In this way, each contribution can be mapped as the point in contribution classifying space.
In the exemplary embodiment, when the value W1-Wi of frequency in each dimension of contribution is less than frequency threshold, then this contribution is included in " unfiled ".
At step 104, calculate point that the first contribution in multiple contribution the maps distance in contribution classifying space respectively and between the point that maps of other contributions, determine in calculated distance, whether to there is the value being less than the first predetermined distance threshold.
In the exemplary embodiment, the time that can be formed according to contribution determines the first contribution, such as, and can using time contribution the earliest as the first contribution.
In another illustrative embodiments, from multiple contribution, Stochastic choice contribution is as the first contribution.
In another illustrative embodiments, by staff from multiple contribution Stochastic choice contribution as the first contribution.
In step 105, if determining to exist in calculated distance is less than the value of the first predetermined distance threshold, then the contribution corresponding to the value being less than the first distance threshold is defined as belonging to the identical first kind with the first contribution.
In the exemplary embodiment, the size of the first distance threshold can be set by operating personnel according to actual needs.Such as, if by first distance threshold arrange relatively little, make the correlativity of of a sort contribution relatively strong; On the contrary, if by first distance threshold arrange relatively large, make the correlativity of of a sort contribution relatively weak.
In some embodiments, step can also be comprised to the method that multiple contribution carries out cluster:
Determine in multiple contribution, whether to there are the multiple contributions not being confirmed as the first kind, if existed, then therefrom select the second contribution, and calculate point that the second contribution the maps distance in contribution classifying space and between the point that maps of other contributions not being confirmed as the first kind;
Determine in calculated distance, whether to there is the value being less than predetermined second distance threshold value; And
If existed, then the contribution corresponding to the value being less than second distance threshold value is defined as belonging to identical Equations of The Second Kind with the second contribution.
In some embodiments, by repeating similar step, all contributions can be made all to classify, that is: determine in multiple contribution, whether to there are the multiple contributions not being confirmed as the first kind or Equations of The Second Kind, if existed, then therefrom select the 3rd contribution, and calculate point that the 3rd contribution the maps distance in contribution classifying space and between the point that maps of other contributions not being confirmed as the first kind or Equations of The Second Kind; Determine whether to exist in calculated distance the value being less than the 3rd predetermined distance threshold; And if exist, then the contribution corresponding to the value being less than the 3rd distance threshold is defined as belonging to the 3rd identical class with the 3rd contribution, until all contributions are all classified.
By the above-mentioned method of multiple contribution being carried out to cluster according to the application's embodiment, automatic cluster can be carried out to a large amount of contributions, thus the manpower saved.
Referring to Fig. 2, the equipment multiple contribution being carried out to cluster according to the application's embodiment will be described.
As shown in the figure, this equipment can comprise with lower component.
Set up module 201, it can set up contribution classifying space according to the classification of news category method.In some embodiments, set up module 201 and can set up contribution classifying space according to history classification to each classification that entry carries out.
Extraction module 202, it can extract the keyword in each contribution.In some embodiments, extraction module 202 can comprise statistics parts and deleting parts.Statistics parts can add up the frequency that entry occurs in contribution, and by the alternatively keyword of the entry higher than frequency threshold.Deleting parts can delete incoherent entry from candidate keywords, thus obtains keyword.In the exemplary embodiment, deleting parts can judge whether candidate keywords is present in default uncorrelated dictionary, and if so, then this candidate keywords deleted, remaining candidate keywords is then as keyword.
Mapping block 203, it can set up contribution coordinate according to the frequency of the keyword extracted, thus each contribution is mapped as the point in contribution classifying space.In some embodiments, mapping block 203 can comprise ownership parts and summation component.The keyword extracted can belong to the classification according to news category method by ownership parts, that is, the dimension of contribution classifying space.The frequency that summation component can occur in contribution all keywords in this classification (that is, a certain dimension) and, and using this frequency with as the value of contribution in the dimension of the contribution classifying space corresponding to this classification.
Computing module 204, be configured to calculate point that the first contribution in multiple contribution the maps distance in contribution classifying space respectively and between the point that maps of other contributions, determine in calculated distance, whether to there is the value being less than the first predetermined distance threshold.
Cluster module 205, is configured to the contribution corresponding to the value being less than the first distance threshold to be defined as belonging to the identical first kind with the first contribution.In some embodiments, cluster module 205 can also determine whether there are the multiple contributions not being confirmed as the first kind in multiple contribution, if existed, then therefrom select the second contribution, and calculate point that the second contribution the maps distance in contribution classifying space and between the point that maps of other contributions not being confirmed as the first kind; Determine in calculated distance, whether to there is the value being less than predetermined second distance threshold value; And if exist, then the contribution corresponding to the value being less than second distance threshold value is defined as belonging to identical Equations of The Second Kind with the second contribution, until all contributions are all classified.
Should be appreciated that above embodiment is only exemplary herein, but not be the scope limiting the application.For a person skilled in the art, when not departing from spirit and the essence of the application, various modification and improvement can be made, but these modification and improve also should be considered as falling into the application protection domain among.

Claims (10)

1. multiple contribution is carried out to a method for cluster, comprising:
Contribution classifying space is set up according to the classification of news category method;
Extract the keyword in each contribution;
Frequency according to the keyword extracted sets up contribution coordinate, thus each contribution is mapped as the point in contribution classifying space;
Calculate point that the first contribution in described multiple contribution the maps distance in contribution classifying space respectively and between the point that maps of other contributions, determine in calculated distance, whether to there is the value being less than the first predetermined distance threshold; And
If existed, then the contribution corresponding to the value being less than described first distance threshold is defined as belonging to the identical first kind with described first contribution.
2. the step of the keyword the method for claim 1, wherein extracted in each contribution comprises:
Calculate the frequency that entry occurs in contribution, and by the alternatively keyword of the entry higher than frequency threshold; And
From candidate keywords, remove incoherent entry, thus obtain keyword.
3. method as claimed in claim 2, wherein, removes incoherent entry from candidate keywords, thus the step obtaining keyword comprises:
Judge whether candidate keywords is present in default uncorrelated dictionary; And
If so, then this candidate keywords is removed; If not, then using this candidate keywords as keyword.
4. the step the method for claim 1, wherein setting up contribution classifying space according to the classification of news category method comprises:
Using the dimension of each classification of news category method as contribution classifying space.
5. the step the method for claim 1, wherein setting up contribution coordinate according to the frequency of keyword extracted comprises:
The keyword extracted is belonged to each classification according to news category method;
Calculate frequency that all keywords in this classification occur in contribution and, and using this frequency with as the value of contribution in the dimension of the contribution classifying space corresponding to this classification.
6. the method for claim 1, wherein described first contribution determines the time formed according to contribution, or Stochastic choice, or selected by staff.
7. method as claimed in claim 6, also comprises:
Determine in described multiple contribution, whether to there are the multiple contributions not being confirmed as the first kind, if existed, then therefrom select the second contribution, and calculate point that the second contribution the maps distance in contribution classifying space and between the point that maps of other contributions not being confirmed as the first kind;
Determine in calculated distance, whether to there is the value being less than predetermined second distance threshold value; And
If existed, then the contribution corresponding to the value being less than described second distance threshold value is defined as belonging to identical Equations of The Second Kind with described second contribution.
8. multiple contribution is carried out to an equipment for cluster, comprising:
Set up module, be configured to set up contribution classifying space according to the classification of news category method;
Extraction module, is configured to extract the keyword in each contribution;
Mapping block, is configured to set up contribution coordinate according to the frequency of the keyword extracted, thus each contribution is mapped as the point in contribution classifying space;
Computing module, be configured to calculate point that the first contribution in described multiple contribution the maps distance in contribution classifying space respectively and between the point that maps of other contributions, determine in calculated distance, whether to there is the value being less than the first predetermined distance threshold; And
Cluster module, is configured to the contribution corresponding to the value being less than described first distance threshold to be defined as belonging to the identical first kind with described first contribution.
9. equipment as claimed in claim 8, described extraction module comprises:
Statistics parts, are configured to the frequency that statistics entry occurs in contribution, and by the alternatively keyword of the entry higher than frequency threshold; And
Deleting parts, is configured to from candidate keywords, delete incoherent entry, thus obtains keyword.
10. equipment as claimed in claim 8, described mapping block comprises:
Ownership parts, are configured to the keyword extracted to belong to the classification according to news category method,
Summation component, be configured to calculate frequency that all keywords in this classification occur in contribution and, and using this frequency with as the value of contribution in the dimension of the contribution classifying space corresponding to this classification.
CN201310346857.4A 2013-08-09 2013-08-09 The method and apparatus that multiple contributions are clustered Expired - Fee Related CN104346411B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310346857.4A CN104346411B (en) 2013-08-09 2013-08-09 The method and apparatus that multiple contributions are clustered

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310346857.4A CN104346411B (en) 2013-08-09 2013-08-09 The method and apparatus that multiple contributions are clustered

Publications (2)

Publication Number Publication Date
CN104346411A true CN104346411A (en) 2015-02-11
CN104346411B CN104346411B (en) 2018-11-06

Family

ID=52502023

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310346857.4A Expired - Fee Related CN104346411B (en) 2013-08-09 2013-08-09 The method and apparatus that multiple contributions are clustered

Country Status (1)

Country Link
CN (1) CN104346411B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105243118A (en) * 2015-09-29 2016-01-13 武汉传神信息技术有限公司 Manuscript data classification method
CN105760526A (en) * 2016-03-01 2016-07-13 网易(杭州)网络有限公司 News classification method and device
CN108536695A (en) * 2017-03-02 2018-09-14 北京嘀嘀无限科技发展有限公司 A kind of polymerization and device of geographical location information point
CN109063983A (en) * 2018-07-18 2018-12-21 北京航空航天大学 A kind of natural calamity loss real time evaluating method based on social media data
CN111209390A (en) * 2020-01-06 2020-05-29 北大方正集团有限公司 News display method and system, and computer readable storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050071365A1 (en) * 2003-09-26 2005-03-31 Jiang-Liang Hou Method for keyword correlation analysis
CN101079026A (en) * 2007-07-02 2007-11-28 北京百问百答网络技术有限公司 Text similarity, acceptation similarity calculating method and system and application system
CN101174273A (en) * 2007-12-04 2008-05-07 清华大学 News event detecting method based on metadata analysis
CN101694670A (en) * 2009-10-20 2010-04-14 北京航空航天大学 Chinese Web document online clustering method based on common substrings
CN102346753A (en) * 2010-08-01 2012-02-08 青岛理工大学 Semi-supervised text clustering method and device fusing pairwise constraints and keywords
CN102855312A (en) * 2012-08-24 2013-01-02 武汉大学 Domain-and-theme-oriented Web service clustering method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050071365A1 (en) * 2003-09-26 2005-03-31 Jiang-Liang Hou Method for keyword correlation analysis
CN101079026A (en) * 2007-07-02 2007-11-28 北京百问百答网络技术有限公司 Text similarity, acceptation similarity calculating method and system and application system
CN101174273A (en) * 2007-12-04 2008-05-07 清华大学 News event detecting method based on metadata analysis
CN101694670A (en) * 2009-10-20 2010-04-14 北京航空航天大学 Chinese Web document online clustering method based on common substrings
CN102346753A (en) * 2010-08-01 2012-02-08 青岛理工大学 Semi-supervised text clustering method and device fusing pairwise constraints and keywords
CN102855312A (en) * 2012-08-24 2013-01-02 武汉大学 Domain-and-theme-oriented Web service clustering method

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105243118A (en) * 2015-09-29 2016-01-13 武汉传神信息技术有限公司 Manuscript data classification method
CN105760526A (en) * 2016-03-01 2016-07-13 网易(杭州)网络有限公司 News classification method and device
CN105760526B (en) * 2016-03-01 2019-05-07 网易(杭州)网络有限公司 A kind of method and apparatus of news category
CN108536695A (en) * 2017-03-02 2018-09-14 北京嘀嘀无限科技发展有限公司 A kind of polymerization and device of geographical location information point
CN108536695B (en) * 2017-03-02 2021-06-04 北京嘀嘀无限科技发展有限公司 Aggregation method and device of geographic position information points
CN109063983A (en) * 2018-07-18 2018-12-21 北京航空航天大学 A kind of natural calamity loss real time evaluating method based on social media data
CN109063983B (en) * 2018-07-18 2022-06-21 北京航空航天大学 Natural disaster damage real-time evaluation method based on social media data
CN111209390A (en) * 2020-01-06 2020-05-29 北大方正集团有限公司 News display method and system, and computer readable storage medium
CN111209390B (en) * 2020-01-06 2023-09-05 新方正控股发展有限责任公司 News display method and system and computer readable storage medium

Also Published As

Publication number Publication date
CN104346411B (en) 2018-11-06

Similar Documents

Publication Publication Date Title
KR102080362B1 (en) Query expansion
CN103744928B (en) A kind of network video classification method based on history access record
WO2017092622A1 (en) Legal provision search method and device
CN106156372B (en) A kind of classification method and device of internet site
JP6428795B2 (en) Model generation method, word weighting method, model generation device, word weighting device, device, computer program, and computer storage medium
CN104346411A (en) Method and equipment for clustering multiple manuscripts
CN103902597A (en) Method and device for determining search relevant categories corresponding to target keywords
CN103176962A (en) Statistical method and statistical system of text similarity
WO2017075912A1 (en) News events extracting method and system
CN103838754A (en) Information searching device and method
CN106294815B (en) A kind of clustering method and device of URL
CN103268330A (en) User interest extraction method based on image content
CN111368867B (en) File classifying method and system and computer readable storage medium
CN104239321B (en) A kind of data processing method and device of Search Engine-Oriented
CN102855245A (en) Image similarity determining method and image similarity determining equipment
KR101082589B1 (en) System for providing Aspect Level News Browsing Service that reduce Media-Bias Effect and Method therefor
CN103336771A (en) Data similarity detection method based on sliding window
CN107085568A (en) A kind of text similarity method of discrimination and device
CN102156746A (en) Method for evaluating performance of search engine
CN108875050B (en) Text-oriented digital evidence-obtaining analysis method and device and computer readable medium
CN104462347A (en) Keyword classifying method and device
CN101615255A (en) The method that a kind of video text multiframe merges
CN104778202B (en) The analysis method and system of event evolutionary process based on keyword
CN105512270A (en) Method and device for determining related objects
CN113011503B (en) Data evidence obtaining method of electronic equipment, storage medium and terminal

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20220617

Address after: 3007, Hengqin international financial center building, No. 58, Huajin street, Hengqin new area, Zhuhai, Guangdong 519031

Patentee after: New founder holdings development Co.,Ltd.

Patentee after: BEIJING FOUNDER ELECTRONICS Co.,Ltd.

Address before: 100871, Beijing, Haidian District, Cheng Fu Road, No. 298, Zhongguancun Fangzheng building, 5 floor

Patentee before: PEKING UNIVERSITY FOUNDER GROUP Co.,Ltd.

Patentee before: BEIJING FOUNDER ELECTRONICS Co.,Ltd.

TR01 Transfer of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20181106

CF01 Termination of patent right due to non-payment of annual fee