Nothing Special   »   [go: up one dir, main page]

CN102236654A - Web useless link filtering method based on content relevancy - Google Patents

Web useless link filtering method based on content relevancy Download PDF

Info

Publication number
CN102236654A
CN102236654A CN2010101559607A CN201010155960A CN102236654A CN 102236654 A CN102236654 A CN 102236654A CN 2010101559607 A CN2010101559607 A CN 2010101559607A CN 201010155960 A CN201010155960 A CN 201010155960A CN 102236654 A CN102236654 A CN 102236654A
Authority
CN
China
Prior art keywords
link
text
webpage
content
web
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2010101559607A
Other languages
Chinese (zh)
Inventor
汪敏
刘轩山
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
GUANGDONG UCAP INTERNET INFORMATION TECHNOLOGY CO LTD
Original Assignee
GUANGDONG UCAP INTERNET INFORMATION TECHNOLOGY CO LTD
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by GUANGDONG UCAP INTERNET INFORMATION TECHNOLOGY CO LTD filed Critical GUANGDONG UCAP INTERNET INFORMATION TECHNOLOGY CO LTD
Priority to CN2010101559607A priority Critical patent/CN102236654A/en
Publication of CN102236654A publication Critical patent/CN102236654A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a web useless link filtering method based on content relevancy. The method comprises the following steps of: removing irrelevant advertisement links and navigation links in a page by using text position information in the page by a statistical method; and carrying out relevancy analysis on contents of the page and the contents of the linked pages, and removing useless links having irrelevant contents. By the web useless link filtering method, the useless links can be better removed, and page rank computation is carried out on a purified link structure chart, so a page rank result can be better improved, the quality of the pages with a high page rank is improved, more high-value websites are introduced and the like.

Description

The invalid link filter method of the Web of content-based correlativity
Technical field
The present invention relates to the method for invalid link (uselesslinks) in a kind of filtration Web page, relate in particular to a kind of invalid link filter method of the Web page of content-based correlation analysis, belong to the Internet search technology field.
Background technology
Along with rapid development of Internet, the search engine that is used for the internet information inquiry is being brought into play the effect that becomes more and more important.For search engine, its main task is to find related web page and return to the user by page importance sorting.Colourful along with the abundant and page link of the growth of Web number of pages, content of pages, search engine begin to become more and more " unable to do what one wishes ".The reason here is a lot, and wherein the important point is exactly the spreading unchecked day by day of invalid link in the Web page.
By analysis, the link in the Web page can be divided into following four classes:
The artificial link that generates: major part is the artificial relatively content of two webpages of passing through in this class link, create according to their correlativity, and being classified as " peer link " by the Web page create person, therefore most of such link has very strong recommendation meaning.Also some web page contents pointed and this webpage content topic and uncorrelated, only certain a bit on some association a little.
The navigation type link: this class link is the Web page create, and the person utilizes corresponding module to generate, and is essentially identical for the webpage under the same website, mainly is that the user can be visited between the different field of this website.These links have been played certain navigation function for user capture, but and the not a bit relation of web page correlation recommendation.
Commercial paper link: this class link is to generate according to some kinematic functions in the webpage, generally is that the commercial interest for the website increases, and has accounted for very big proportion in link, and especially in the webpage for the com class, this part link has accounted for over half.This part link does not have contribution substantially for the relevant webpage of content recommendation.
Partial class link: this class link refers to that mainly the sub-pages of main page and sensing belongs to this class link of same website, be the website founder for some webpages new or more concern in the recommendering folder website, increase their clicking rate and in webpage, add.
Fig. 1 is a news category webpage that intercepts in the Sina, and wherein the link in 5. is the artificial link (the above-mentioned first kind) that generates.1. the link in is the link (the second above-mentioned class) that has the navigation meaning in the website, and they have pointed to the homepage of other class websites of Sina respectively; 2., the link in 6. belongs to partial class link (the 3rd above-mentioned class), generates their webpages that all to be some have nothing to do with this webpage of pointing to as can be seen on the content in order to recommend the same day up-to-date news; 4., 7. the link of part is the website for advertisement link (the 4th above-mentioned class) that economic interests increased.
By analysis, the inventor thinks the link type of recommending meaning for not having, comprise the link that do not have topic relativity in the first kind, second, third, the link of the 4th class, these are referred to as " invalid link "; The link that has topic relativity in the first kind then is called " effectively link ".
At interior a lot of search engines all is to utilize this method, has simultaneously obtained immense success in conjunction with factor such as anchor text, word frequency statistics.The success of link analysis method, the validity that is decided by the Web page link to a great extent, the rationality that depends on following hypothesis in other words: when there is a link of arriving webpage B in webpage A, the author that webpage A is described thinks that the content of webpage B is important, and as a rule, the content of webpage A and B has relevant theme.We can say that this content relevance hypothesis is the basis that the link analysis method is depended on for existence.
At the internet development initial stage, the link in the webpage meets the content relevance hypothesis basically, and the correlativity transmission between the webpage is significant.But along with the continuous development of Web technology and the continuous expansion of webpage quantity, increasing webpage is generated automatically by the webpage Core Generator, and therefore a lot of links have lost the correlativity meaning, cause the ratio of invalid link more and more higher.Simultaneously, along with the application of search engine, the supvr of a lot of websites has introduced a large amount of useless links in order to improve rank in search engine, a lot of Spam websites occurred.On the other hand, present most of commercial website all with commercial interest as final goal, this has just caused the introducing of mass advertising link.Based on above multiple reason, content relevance that links among the Web and recommendation meaning have been subjected to serious threat at present.If do not handle, the link structure figure that is constructed can not correctly reflect the incidence relation between the webpage, will be no longer authentic and valid based on the ranking results that such linked, diagram obtains.
Summary of the invention
Technical matters to be solved by this invention is to provide a kind of invalid link filter method of Web of content-based correlativity.This link filter method carries out link analysis again by the more rational link structure figure of structure, thereby has improved the effect that invalid link is filtered.
In order to realize above-mentioned goal of the invention, the present invention adopts following technical scheme:
The invalid link filter method of a kind of Web of content-based correlativity is characterized in that comprising following step:
(1) utilizes text position information in the webpage, remove by statistical method that incoherent commercial paper link and navigation type link in the webpage;
(2) content of web page contents and link webpage pointed is carried out correlation analysis, remove the incoherent invalid link of content.
Wherein, among described step (1), at first html document is converted into the dom tree structure, searching comprises body matter and the minimum subtree that link relevant with theme in the dom tree structure then, obtains needed link information.
For described dom tree structure, at first utilize blocking node that dom tree is divided into each subtree, the calculating linking ratio compares with predetermined threshold value in each subtree; If less than threshold value, then this piece is set to main body block, and retrospective search comprises nearest father's blocking node of this piece then, as destination node, exports the link in this father's blocking node, as the basis of subsequent analysis with this father's blocking node.
In the described step (2), before the content relevance that carries out webpage is analyzed, the text of webpage is carried out pre-service, extract and represent the content of each text to compare.
Carry out the pretreated process of text and comprise following step:
At first carry out text and cut speech, add up the word frequency in the text then, calculate the TF-IDF vector, form the vector space model corresponding with text collection; Utilize the proper vector of text to calculate content similarity between each text, and utilize the content similarity to remove the incoherent link of content in the webpage.
The overlapping degree of the entry that is comprised in the proper vector of described content similarity by each text is determined.
Perhaps, described content similarity is determined by the included angle cosine in the proper vector of each text.
In the described step (2), described content relevance analysis comprises three layer operations: ground floor is to carry out the content relevance analysis according to the inlet text; The second layer is that the title according to webpage carries out the content relevance analysis; The 3rd layer is to carry out the content relevance analysis according to the Web page text content; If all obtain the incoherent conclusion of Web page subject at these three layers, then this link of deletion in the lists of links of father's webpage.
The content-based correlation analysis of invalid link filter method provided by the present invention is realized, can make link after the filtration can reflect mutual relationship between the webpage more realistically, make web page interlinkage correlativity hypothesis more reasonable, thereby improve link analysis result's validity greatly.
Description of drawings
The present invention is described in further detail below in conjunction with the drawings and specific embodiments.
Fig. 1 is the synoptic diagram of a news category webpage intercepting in the Sina;
Fig. 2 has shown the example of a dom tree that is converted from html document;
Fig. 3 has shown that carrying out second goes on foot filter operation data result afterwards;
Fig. 4 shown and carried out after three ranks, comes the situation of preceding 100 page place website.
Embodiment
The invalid link filter method of Web proposed by the invention roughly can be divided into two-part operation: first is the text position information of utilizing in the webpage, by statistical method, removes links such as incoherent advertisement in the webpage, navigation; Second portion is on the basis of first, and the content of web page contents and link webpage pointed is carried out correlation analysis, removes the incoherent link of those contents.Be described in detail respectively below.
One. based on the filtration of text position
At present, most of webpage is to set up by unified template, and for relevant with the theme below that all is placed on a Web page text by the webpage person of foundation that links in the general webpage, so the filtration work of this part is based upon on this hypothesis basis.This filtration work comprises and at first html document is converted into the dom tree structure, in dom tree, seek then and comprise body matter and the minimum subtree that link relevant with theme, obtain the link information of needs, for the incoherent link of theme in follow-up content analysis and the removal webpage is prepared.
DOM (Document Object Model) is a DOM Document Object Model, is the standard interface standard that W3C formulates, and is a kind of application programming interface (API) for HTML and XML document use.After html document is resolved, be converted into the dom tree structure, each node of dom tree is an object, and the content in the html document is completely contained in each node.Fig. 2 is an example from the dom tree of html document conversion.In a specific embodiment of the present invention, adopt CyberNeko HTML Parser resolver html document to be resolved and generated dom tree.
As shown in Figure 1, html page can be divided into different zones.For a webpage, main part (the 3rd zone among Fig. 1) is based on text, and the number of links of other parts is many.Based on such phenomenon, defined the notion of link in the present invention than (Link ratio):
Linkratio(b i)=LinkCount(b i)/ContentLength(b i)
(1)
Wherein, the number of links in LinkCount (bi) the expression i piece, the length of non-linked contents in ContentLength (bi) the expression i piece.The threshold value (th) of link ratio is set, when a certain link is compared less than this threshold value, thinks that then this part is the main part in the webpage.
The present invention is directed to the dom tree structure, in dom tree, utilize blocking node that dom tree is divided into each subtree, calculating linking ratio in each subtree, compare with threshold value, if less than threshold value, then this piece is set to main body block, retrospective search comprises nearest father's blocking node of this piece then, as destination node, export the link in this node with this node, as the basis of subsequent analysis.
Owing to the granularity that the selection of blocking node has been determined webpage is carried out piecemeal, so the present invention by experiment, preferentially chooses table (div) and tr node as blocking node.
Two. remove the incoherent link of theme based on content of text
In webpage except the link structure, the content of text that link itself is had also provides a large amount of information for the analysis of webpage, wherein Lian Jie inlet text, link web page title and main contents pointed, utilize the content similarity between these information and the former web page text, just can come analysis chain to connect and whether have the recommendation meaning.Therefore, be necessary to utilize these information that link is refiltered, remove the incoherent link of theme.
Compare with other documents, web page text has limited structure, even have the structure of certain form in other words, also is to focus on form, but not content of text, and the structure of dissimilar contents is also inconsistent; In addition, the content of text is the form of natural language, and except the method for mating in full, computing machine is difficult to judge content similarity between the two.Therefore, before the content similarity analysis that carries out two (or a plurality of) webpages, carry out pre-service, extract the main contents that to represent two (or a plurality of) texts, then it be compared the content of text of webpage.
Expression for content of text, the well-defined text model of needs can be by the handled expression mode of computing machine to form, in general, text model can be divided three classes: boolean's model, probability model and vector space model (Vector Space Model), wherein vector space model is the text model that is widely adopted, it is more accurate with respect to boolean's model, does not also need the learning process of probability model.Therefore, the present invention uses vector space model to carry out the analysis of text similarity.
Body matter in all webpages is carried out the pretreated process of text comprises following step:
1. text is cut speech
For text, speech is the minimum independently significant language element of activity.Since between the English word with the space as natural delimiter, and Chinese is to be basic grapheme with word, does not have tangible separator between the word, therefore, the basis that Chinese word analysis is Chinese information processing is with crucial.
The present invention cuts the speech technology by text each content is handled, purpose is each word of distinguishing in the content, to reflect the content of text by word set (bag-of-word) representation of text, form the proper vector of text collection, carry out follow-up processing.
2. add up the word frequency
For follow-up calculating, need add up a large amount of vocabulary, therefore need to add up the occurrence frequency of each word in the lexical set that cuts out in each text and preserve.
3. calculate TF-IDF (term frequency-inverse document frequency, word frequency-reverse file frequency) vector
In vector space model, adopt the TF-IDF vector representation.The TF-IDF vector has reflected the word space of text collection, the corresponding word of its each component of a vector, and concrete TF-IDF is defined as:
d(i)=TF-IDF(i)=TF(W i,Doc)*IDF(W i)=TF(W i,Doc)*log(D/DF(W i)) (2)
Wherein (Wi Doc) is the occurrence frequency of word Wi in text Doc to TF, and D is total textual data, and DF (Wi) is that text number at least once appears in word Wi in total text collection.
4. formation vector space
After the character subset that has extracted text, just can set up the vector space model of text collection correspondence.In vector space model, the text space is counted as the vector space of being made up of one group of quadrature entry vector.
Each text representation is one of them proper vector:
V(d)=(t 1,w 1(d);...;t i,w i(d);...;t n,w n(d))
(3)
Wherein, ti is an i speech among the text d, and wi (d) is that ti is for the weight among the text d.
Wi (d) is generally defined as the function of the frequency tfi (d) that ti occurs in text d, the present invention utilizes the value of TF-IDF to be used as proper vector, promptly with the TF-IDF value of the word that calculates as the wi in the vector (d).
Obtain the proper vector of text, just can utilize proper vector to calculate two content similarities between (or a plurality of) text, and utilized the content similarity to remove the incoherent link of content in the webpage.
Utilizing proper vector to calculate the content similarity has a lot of methods, and the present invention mainly adopts two kinds of methods, and a kind of is the overlapping degree of considering the entry that comprised in two proper vectors.The definition text similarity is:
sim ( d i , d j ) = n ∩ ( d i , d j ) n ∪ ( d i , d j ) - - - ( 4 )
Wherein, sim (di, dj) expression text di, the text similarity between the dj, n (d i, d j) identical entry number that to be text di have with corresponding proper vector V of dj (di) and V (dj), n (d i, d j) be all entry numbers that V (di) and V (dj) are had.
Another kind is a method of considering the included angle cosine in two proper vectors.The definition text similarity is:
sim ( d i , d j ) = V ( d i ) * V ( d j ) | V ( d i ) | * | V ( d j ) | = Σ m = 1 n w im * w jm ( Σ m = 1 n w im 2 ) * ( Σ m = 1 n w jm 2 ) - - - ( 5 )
Wherein, and sim (di, dj) expression text di, the text similarity between the dj, V (di) is the proper vector of text di, wim represents the TF-IDF value of word tm in text di.
About removing the problem of invalid link, the content similarity analysis that carries out at web document and common document is differentiated, because the former has link information to utilize.In fact, the inlet text of link and corresponding web page title etc. can be used to carry out the content similarity analysis.These texts want brief a lot of compared with Web page text, and to a great extent, are brief summaries for Web page text.
Based on such idea, the present invention adopts a kind of mode of going forward one by one to judge, carries out three layer analysis in processing procedure, if the text that obtains when anterior layer is that theme is relevant, thinks that then their contents are relevant, no longer carry out following calculating; If the content that obtains is uncorrelated, then continue the analysis of one deck down.Have only when three layers all to have obtained incoherent result, think that just they are that content is incoherent, should link removal from tabulate.
Ground floor: carry out the content relevance analysis according to the inlet text
For the link in the webpage, the easiest obtain be exactly the link the inlet text.Link inlet text is the summary of the person of foundation of other webpages for webpage pointed, and it is valuable utilizing the inlet text to replace the raw content of webpage.Because the length of link inlet text is general more limited, the computation complexity of text similarity can be very not big yet like this.
Yet the quantity of the speech that is comprised in the inlet text is limited, and dimension is can be differentiated when setting up vector, can not calculate two included angle cosines between the vector, therefore can only adopt first kind of computing method.In a sample calculation, the inventor only chooses that the TF-IDF value is in the proper vector of the speech of top ten as father's webpage in father's webpage, utilize first method to obtain linking the webpage of sensing and the text similarity between father's webpage, then with threshold ratio, if greater than threshold value then think that they have content relevance, if less than threshold value then proceed down the calculating of one deck.
The second layer: the title according to webpage carries out the content relevance analysis
All can give the brief title of webpage of oneself for the general webpage person of foundation, and in the html document of webpage, represent with the title label, the title of common webpage is an explanation for webpage, usually some important keywords of text in the webpage have been comprised, utilize such title also can replace the text in the webpage to carry out the analysis of content similarity, but the title of webpage is the webpage person of foundation oneself to be added in the webpage, the subjectivity that has the webpage person of foundation, some webpage person of foundation can add a lot of uncorrelated speech in the title of webpage, improve the search engine rank of oneself, increase the click volume of webpage, therefore when analyzing, the analysis of title has been placed on after the inlet text.
Correlation calculations method for title is consistent with linking the inlet text.
The 3rd layer: carry out the content relevance analysis according to the Web page text content
Because the data volume of Web page text is bigger, therefore in calculation process, the two-layer result who negates who obtains before having only, just can carry out the analysis of this one deck, because in this one deck, the proper vector of text is (even without can be by increasing word, and the TF-IDF value be set to 0 and handles) that can have identical dimension, so just can utilize second kind of Calculation Method, obtain two content similarities between the webpage.
If it is uncorrelated all to obtain two (or a plurality of) Web page subjects at these three layers, then this link of deletion in the lists of links of father's webpage.
Below, the effect of the invalid link filter method of the content-based correlativity that the present invention proposes will be described by a series of experimental data, and utilize the comparison of the page PageRank value before and after removing, the improvement effect for the link analysis algorithm is described.
Experimental data collection used in the present invention obtains from CWT200g at random.The inventor is at collect 627036 main frames that Web service is provided in the Chinese scope that webpage finds in November, 2005, after eliminating the repetition website, removing the rubbish website, obtain 88303 websites, these websites are carried out webpage to be collected, the collection degree of depth of each website is 3, single website gathered data amount is not limit, obtain initial data set, carry out disappearing of webpage again and heavily handle, obtain unduplicated collections of web pages.According to the website size that collections of web pages reflected, to sample, the capacity that obtains at last is the CWT200g test set of 197GB.
The present invention has randomly drawed 1524077 webpages in 1421 websites from CWT200g, removed to preserve imperfectly, and only keeps the webpage of html, htm, xml, jsp, asp type, obtain 1 at last, 427,001 webpages are with this data set as subsequent experimental.
Table 1 is some statisticss for the experimental data collection, wherein page outdegree is the number that links in the webpage, internal links is the link number that the concentrated webpage of experimental data is pointed in the link in the webpage, the out-degree of adjusting be zero webpage numerical table show the external linkage of data centralization removed after, the out-degree that newly obtains is 0 webpage number.
Reference variable Quantity (unit: individual)
The Number of websites 1421
The webpage number 1427001
The average webpage number in website 1004.2
Number of links 80457759
Webpage on average links number 56.38
Out-degree is 0 webpage number 65256
Internal links 27312578
The internal links ratio 33.95%
The out-degree of adjusting is 0 net 220122
Number of pages
Table 1
The present invention adopts above-mentioned two steps operation to remove invalid link and since the operation of two steps handled be dissimilar invalid links, therefore the link ratios removed of two steps operation also are very different.
After the filtration of carrying out for two steps, the present invention has added up the ratio for each the step removal link of dissimilar webpages respectively.Fig. 3 carried out for second step to filter data result afterwards.As can be seen from Figure 3, the filtration of the first step is very high for the filtration ratio of link, and this is consistent with before analysis, and wherein exists a large amount of advertisement link in the webpage of com class, therefore after the filtration in this step, the link ratio of removal is the highest.But also can see simultaneously, for education and government's class webpage, by convention, can not comprise a large amount of advertisement link in these websites, but as can be seen from Figure 3, the removal ratio of this two classes webpage is also very high, find by observation for some webpages of data centralization, in the webpage of this two class, most link all concentrates on the both sides of webpage, and the link of this part mainly is linked as the master with " hot topic " or " up-to-date " so partial class, but estimate for correlativity from theme, this part link also belongs to invalid link, therefore analyzes from the angle of link analysis, and they also should filter out from link.
As can be seen from Figure 3, after having carried out two step filter operations, remaining effective link and the artificial effective ratio basically identical that links that obtains of evaluating and testing.In fact, the filtercondition of She Zhiing is comparatively loose in the present embodiment, therefore the artificial summary height that obtains of evaluating and testing of the effective link ratio that obtains.
Search engine is a commending system in essence, and it should recommend many high-quality websites to give the user as far as possible.Therefore for rank algorithm, should avoid the forward page of a large amount of ranks from same website as far as possible, otherwise will have a strong impact on the diversity (diversity) of recommendation.From this angle, the improvement effect of invalid link filter method provided by the present invention for PageRank will be analyzed further.
Fig. 4 will carry out after three ranks, and the situation that comes preceding 100 page place website compares.Wherein baseline represents not carry out any processing ranking result afterwards, and first has only carried out the first step to filter result afterwards, and second filtered result afterwards in two steps.As can be seen from Figure 4, the Number of websites that comprises among the PageRank row preceding 100 after the first step is 46, filters afterwards the website that comprises in preceding 100 and has 69 and carried out for second step.Carried out after the filtration of second step, the website number that comprises in preceding 100 has increased significantly, and shows by the filtration in this step to find how valuable website, offers the how valuable selection of user.
More than the invalid link filter method of Web of content-based correlativity provided by the present invention is had been described in detail.For one of ordinary skill in the art, any conspicuous change of under the prerequisite that does not deviate from connotation of the present invention it being done all will constitute to infringement of patent right of the present invention, with corresponding legal responsibilities.

Claims (10)

1. the invalid link filter method of the Web of a content-based correlativity is characterized in that comprising following step:
(1) utilizes text position information in the webpage, remove by statistical method that incoherent commercial paper link and navigation type link in the webpage;
(2) content of web page contents and link webpage pointed is carried out correlation analysis, remove the incoherent invalid link of content.
2. the invalid link filter method of Web as claimed in claim 1 is characterized in that:
Among the described step (1), at first html document is converted into the dom tree structure, searching comprises body matter and the minimum subtree that link relevant with theme in the dom tree structure then, obtains needed link information.
3. the invalid link filter method of Web as claimed in claim 2 is characterized in that:
Utilize CyberNeko HTML Parser resolver that html document is converted into the dom tree structure.
4. the invalid link filter method of Web as claimed in claim 2 is characterized in that:
For described dom tree structure, at first utilize blocking node that dom tree is divided into each subtree, the calculating linking ratio compares with predetermined threshold value in each subtree; If less than threshold value, then this piece is set to main body block, and retrospective search comprises nearest father's blocking node of this piece then, as destination node, exports the link in this father's blocking node, as the basis of subsequent analysis with this father's blocking node.
5. the invalid link filter method of Web as claimed in claim 4 is characterized in that:
When selecting blocking node, preferentially choose table (div) and tr node as blocking node.
6. the invalid link filter method of Web as claimed in claim 1 is characterized in that:
In the described step (2), before the content relevance that carries out webpage is analyzed, the text of webpage is carried out pre-service, extract and represent the content of each text to compare.
7. the invalid link filter method of Web as claimed in claim 6 is characterized in that:
Carry out the pretreated process of text and comprise following step:
At first carry out text and cut speech, add up the word frequency in the text then, calculate the TF-IDF vector, form the vector space model corresponding with text collection; Utilize the proper vector of text to calculate content similarity between each text, and utilize the content similarity to remove the incoherent link of content in the webpage.
8. the invalid link filter method of Web as claimed in claim 7 is characterized in that:
The overlapping degree of the entry that is comprised in the proper vector of described content similarity by each text is determined.
9. the invalid link filter method of Web as claimed in claim 7 is characterized in that:
Described content similarity is determined by the included angle cosine in the proper vector of each text.
10. the invalid link filter method of Web as claimed in claim 1 is characterized in that:
In the described step (2), described content relevance analysis comprises three layer operations: ground floor is to carry out the content relevance analysis according to the inlet text; The second layer is that the title according to webpage carries out the content relevance analysis; The 3rd layer is to carry out the content relevance analysis according to the Web page text content; If all obtain the incoherent conclusion of Web page subject at these three layers, then this link of deletion in the lists of links of father's webpage.
CN2010101559607A 2010-04-26 2010-04-26 Web useless link filtering method based on content relevancy Pending CN102236654A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2010101559607A CN102236654A (en) 2010-04-26 2010-04-26 Web useless link filtering method based on content relevancy

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2010101559607A CN102236654A (en) 2010-04-26 2010-04-26 Web useless link filtering method based on content relevancy

Publications (1)

Publication Number Publication Date
CN102236654A true CN102236654A (en) 2011-11-09

Family

ID=44887312

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2010101559607A Pending CN102236654A (en) 2010-04-26 2010-04-26 Web useless link filtering method based on content relevancy

Country Status (1)

Country Link
CN (1) CN102236654A (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102663062A (en) * 2012-03-30 2012-09-12 奇智软件(北京)有限公司 Method and device for processing invalid links in search result
CN103198062A (en) * 2012-01-04 2013-07-10 百度在线网络技术(北京)有限公司 Method for monitoring page dead link and JS error
CN104598532A (en) * 2014-12-29 2015-05-06 中国联合网络通信有限公司广东省分公司 Information processing method and device
CN105100904A (en) * 2014-05-09 2015-11-25 深圳市快播科技有限公司 Video advertisement blocking method, device and browser
CN105183784A (en) * 2015-08-14 2015-12-23 天津大学 Content based junk webpage detecting method and detecting apparatus thereof
CN105279204A (en) * 2014-07-25 2016-01-27 阿里巴巴集团控股有限公司 Information push method and apparatus
CN105930468A (en) * 2016-04-22 2016-09-07 江苏金鸽网络科技有限公司 Rule-based information relativity judgment method
CN106874313A (en) * 2015-12-14 2017-06-20 北京国双科技有限公司 The monitoring method and device of website name of tv column
CN106874310A (en) * 2015-12-14 2017-06-20 北京国双科技有限公司 The monitoring method and device of website name of tv column
CN107370718A (en) * 2016-05-12 2017-11-21 深圳市深信服电子科技有限公司 The detection method and device of black chain in webpage
US10042825B2 (en) 2014-12-04 2018-08-07 International Business Machines Corporation Detection and elimination for inapplicable hyperlinks
CN110020264A (en) * 2018-12-29 2019-07-16 阿里巴巴集团控股有限公司 A kind of determination method and device of broken hyperlink

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1536483A (en) * 2003-04-04 2004-10-13 陈文中 Method for extracting and processing network information and its system
CN1971555A (en) * 2005-11-24 2007-05-30 王凤仙 Method for testing and filtering links pointed to malicious website from return results of web searching
CN101035128A (en) * 2007-04-18 2007-09-12 大连理工大学 Three-folded webpage text content recognition and filtering method based on the Chinese punctuation

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1536483A (en) * 2003-04-04 2004-10-13 陈文中 Method for extracting and processing network information and its system
CN1971555A (en) * 2005-11-24 2007-05-30 王凤仙 Method for testing and filtering links pointed to malicious website from return results of web searching
CN101035128A (en) * 2007-04-18 2007-09-12 大连理工大学 Three-folded webpage text content recognition and filtering method based on the Chinese punctuation

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
王攀: "主题搜索引擎的设计与实现", 《CNKI中国知网》 *
荆涛等: "基于可视布局信息的网页噪音去除算法", 《华南理工大学学报(自然科学版)》 *

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103198062A (en) * 2012-01-04 2013-07-10 百度在线网络技术(北京)有限公司 Method for monitoring page dead link and JS error
CN102663062B (en) * 2012-03-30 2015-01-14 北京奇虎科技有限公司 Method and device for processing invalid links in search result
CN102663062A (en) * 2012-03-30 2012-09-12 奇智软件(北京)有限公司 Method and device for processing invalid links in search result
CN105100904A (en) * 2014-05-09 2015-11-25 深圳市快播科技有限公司 Video advertisement blocking method, device and browser
CN105279204A (en) * 2014-07-25 2016-01-27 阿里巴巴集团控股有限公司 Information push method and apparatus
CN105279204B (en) * 2014-07-25 2019-04-09 阿里巴巴集团控股有限公司 Information-pushing method and device
US10042824B2 (en) 2014-12-04 2018-08-07 International Business Machines Corporation Detection and elimination for inapplicable hyperlinks
US10042825B2 (en) 2014-12-04 2018-08-07 International Business Machines Corporation Detection and elimination for inapplicable hyperlinks
CN104598532A (en) * 2014-12-29 2015-05-06 中国联合网络通信有限公司广东省分公司 Information processing method and device
CN105183784A (en) * 2015-08-14 2015-12-23 天津大学 Content based junk webpage detecting method and detecting apparatus thereof
CN105183784B (en) * 2015-08-14 2020-04-28 天津大学 Content-based spam webpage detection method and detection device thereof
CN106874313A (en) * 2015-12-14 2017-06-20 北京国双科技有限公司 The monitoring method and device of website name of tv column
CN106874310A (en) * 2015-12-14 2017-06-20 北京国双科技有限公司 The monitoring method and device of website name of tv column
CN106874313B (en) * 2015-12-14 2020-05-22 北京国双科技有限公司 Website column name monitoring method and device
CN106874310B (en) * 2015-12-14 2020-08-11 北京国双科技有限公司 Website column name monitoring method and device
CN105930468A (en) * 2016-04-22 2016-09-07 江苏金鸽网络科技有限公司 Rule-based information relativity judgment method
CN105930468B (en) * 2016-04-22 2019-05-17 江苏金鸽网络科技有限公司 A kind of rule-based information correlativity determination method
CN107370718A (en) * 2016-05-12 2017-11-21 深圳市深信服电子科技有限公司 The detection method and device of black chain in webpage
CN107370718B (en) * 2016-05-12 2020-12-18 深信服科技股份有限公司 Method and device for detecting black chain in webpage
CN110020264A (en) * 2018-12-29 2019-07-16 阿里巴巴集团控股有限公司 A kind of determination method and device of broken hyperlink

Similar Documents

Publication Publication Date Title
CN102236654A (en) Web useless link filtering method based on content relevancy
Zhao et al. Fully automatic wrapper generation for search engines
CN100416570C (en) FAQ based Chinese natural language ask and answer method
US7243109B2 (en) Scheme for creating a ranked subject matter expert index
Kang et al. Modeling user interest in social media using news media and wikipedia
CN101246499B (en) Network information search method and system
KR100505848B1 (en) Search System
CN104063497B (en) Viewpoint treating method and apparatus and searching method and device
Singla et al. Studying trailfinding algorithms for enhanced web search
CN106339502A (en) Modeling recommendation method based on user behavior data fragmentation cluster
JP2009520264A5 (en)
CN105378730A (en) Social media content analysis and output
CN104899215A (en) Data processing method, recommendation source information organization, information recommendation method and information recommendation device
CN101097580A (en) Process for ordering network advertisement
CN103823847A (en) Keyword extension method and device
US7729899B2 (en) Data cleansing system and method
Oza et al. Elimination of noisy information from web pages
Musto et al. STaR: a social tag recommender system
Bonnefoy et al. LSIS/LIA at TREC 2012 Knowledge Base Acceleration.
Narayana et al. A novel and efficient approach for near duplicate page detection in web crawling
Chen et al. A query substitution-search result refinement approach for long query web searches
CN100357942C (en) Mobile internet intelligent information retrieval engine based on key-word retrieval
Shekhar et al. A WEBIR crawling framework for retrieving highly relevant web documents: evaluation based on rank aggregation and result merging algorithms
JP2010186474A (en) Retrieval modeling system using association degree dictionary and method
Yu et al. The design and realization of open-source search engine based on Nutch

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20111109