CN102236654A

CN102236654A - Web useless link filtering method based on content relevancy

Info

Publication number: CN102236654A
Application number: CN2010101559607A
Authority: CN
Inventors: 汪敏; 刘轩山
Original assignee: GUANGDONG UCAP INTERNET INFORMATION TECHNOLOGY CO LTD
Current assignee: GUANGDONG UCAP INTERNET INFORMATION TECHNOLOGY CO LTD
Priority date: 2010-04-26
Filing date: 2010-04-26
Publication date: 2011-11-09

Abstract

The invention discloses a web useless link filtering method based on content relevancy. The method comprises the following steps of: removing irrelevant advertisement links and navigation links in a page by using text position information in the page by a statistical method; and carrying out relevancy analysis on contents of the page and the contents of the linked pages, and removing useless links having irrelevant contents. By the web useless link filtering method, the useless links can be better removed, and page rank computation is carried out on a purified link structure chart, so a page rank result can be better improved, the quality of the pages with a high page rank is improved, more high-value websites are introduced and the like.

Description

The invalid link filter method of the Web of content-based correlativity

Technical field

The present invention relates to the method for invalid link (uselesslinks) in a kind of filtration Web page, relate in particular to a kind of invalid link filter method of the Web page of content-based correlation analysis, belong to the Internet search technology field.

Background technology

Along with rapid development of Internet, the search engine that is used for the internet information inquiry is being brought into play the effect that becomes more and more important.For search engine, its main task is to find related web page and return to the user by page importance sorting.Colourful along with the abundant and page link of the growth of Web number of pages, content of pages, search engine begin to become more and more " unable to do what one wishes ".The reason here is a lot, and wherein the important point is exactly the spreading unchecked day by day of invalid link in the Web page.

By analysis, the link in the Web page can be divided into following four classes:

The artificial link that generates: major part is the artificial relatively content of two webpages of passing through in this class link, create according to their correlativity, and being classified as " peer link " by the Web page create person, therefore most of such link has very strong recommendation meaning.Also some web page contents pointed and this webpage content topic and uncorrelated, only certain a bit on some association a little.

The navigation type link: this class link is the Web page create, and the person utilizes corresponding module to generate, and is essentially identical for the webpage under the same website, mainly is that the user can be visited between the different field of this website.These links have been played certain navigation function for user capture, but and the not a bit relation of web page correlation recommendation.

Commercial paper link: this class link is to generate according to some kinematic functions in the webpage, generally is that the commercial interest for the website increases, and has accounted for very big proportion in link, and especially in the webpage for the com class, this part link has accounted for over half.This part link does not have contribution substantially for the relevant webpage of content recommendation.

Partial class link: this class link refers to that mainly the sub-pages of main page and sensing belongs to this class link of same website, be the website founder for some webpages new or more concern in the recommendering folder website, increase their clicking rate and in webpage, add.

Fig. 1 is a news category webpage that intercepts in the Sina, and wherein the link in 5. is the artificial link (the above-mentioned first kind) that generates.1. the link in is the link (the second above-mentioned class) that has the navigation meaning in the website, and they have pointed to the homepage of other class websites of Sina respectively; 2., the link in 6. belongs to partial class link (the 3rd above-mentioned class), generates their webpages that all to be some have nothing to do with this webpage of pointing to as can be seen on the content in order to recommend the same day up-to-date news; 4., 7. the link of part is the website for advertisement link (the 4th above-mentioned class) that economic interests increased.

By analysis, the inventor thinks the link type of recommending meaning for not having, comprise the link that do not have topic relativity in the first kind, second, third, the link of the 4th class, these are referred to as " invalid link "; The link that has topic relativity in the first kind then is called " effectively link ".

At interior a lot of search engines all is to utilize this method, has simultaneously obtained immense success in conjunction with factor such as anchor text, word frequency statistics.The success of link analysis method, the validity that is decided by the Web page link to a great extent, the rationality that depends on following hypothesis in other words: when there is a link of arriving webpage B in webpage A, the author that webpage A is described thinks that the content of webpage B is important, and as a rule, the content of webpage A and B has relevant theme.We can say that this content relevance hypothesis is the basis that the link analysis method is depended on for existence.

At the internet development initial stage, the link in the webpage meets the content relevance hypothesis basically, and the correlativity transmission between the webpage is significant.But along with the continuous development of Web technology and the continuous expansion of webpage quantity, increasing webpage is generated automatically by the webpage Core Generator, and therefore a lot of links have lost the correlativity meaning, cause the ratio of invalid link more and more higher.Simultaneously, along with the application of search engine, the supvr of a lot of websites has introduced a large amount of useless links in order to improve rank in search engine, a lot of Spam websites occurred.On the other hand, present most of commercial website all with commercial interest as final goal, this has just caused the introducing of mass advertising link.Based on above multiple reason, content relevance that links among the Web and recommendation meaning have been subjected to serious threat at present.If do not handle, the link structure figure that is constructed can not correctly reflect the incidence relation between the webpage, will be no longer authentic and valid based on the ranking results that such linked, diagram obtains.

Summary of the invention

Technical matters to be solved by this invention is to provide a kind of invalid link filter method of Web of content-based correlativity.This link filter method carries out link analysis again by the more rational link structure figure of structure, thereby has improved the effect that invalid link is filtered.

In order to realize above-mentioned goal of the invention, the present invention adopts following technical scheme:

The invalid link filter method of a kind of Web of content-based correlativity is characterized in that comprising following step:

(1) utilizes text position information in the webpage, remove by statistical method that incoherent commercial paper link and navigation type link in the webpage;

(2) content of web page contents and link webpage pointed is carried out correlation analysis, remove the incoherent invalid link of content.

Wherein, among described step (1), at first html document is converted into the dom tree structure, searching comprises body matter and the minimum subtree that link relevant with theme in the dom tree structure then, obtains needed link information.

For described dom tree structure, at first utilize blocking node that dom tree is divided into each subtree, the calculating linking ratio compares with predetermined threshold value in each subtree; If less than threshold value, then this piece is set to main body block, and retrospective search comprises nearest father's blocking node of this piece then, as destination node, exports the link in this father's blocking node, as the basis of subsequent analysis with this father's blocking node.

In the described step (2), before the content relevance that carries out webpage is analyzed, the text of webpage is carried out pre-service, extract and represent the content of each text to compare.

Carry out the pretreated process of text and comprise following step:

At first carry out text and cut speech, add up the word frequency in the text then, calculate the TF-IDF vector, form the vector space model corresponding with text collection; Utilize the proper vector of text to calculate content similarity between each text, and utilize the content similarity to remove the incoherent link of content in the webpage.

The overlapping degree of the entry that is comprised in the proper vector of described content similarity by each text is determined.

Perhaps, described content similarity is determined by the included angle cosine in the proper vector of each text.

In the described step (2), described content relevance analysis comprises three layer operations: ground floor is to carry out the content relevance analysis according to the inlet text; The second layer is that the title according to webpage carries out the content relevance analysis; The 3rd layer is to carry out the content relevance analysis according to the Web page text content; If all obtain the incoherent conclusion of Web page subject at these three layers, then this link of deletion in the lists of links of father's webpage.

The content-based correlation analysis of invalid link filter method provided by the present invention is realized, can make link after the filtration can reflect mutual relationship between the webpage more realistically, make web page interlinkage correlativity hypothesis more reasonable, thereby improve link analysis result's validity greatly.

Description of drawings

The present invention is described in further detail below in conjunction with the drawings and specific embodiments.

Fig. 1 is the synoptic diagram of a news category webpage intercepting in the Sina;

Fig. 2 has shown the example of a dom tree that is converted from html document;

Fig. 3 has shown that carrying out second goes on foot filter operation data result afterwards;

Fig. 4 shown and carried out after three ranks, comes the situation of preceding 100 page place website.

Embodiment

The invalid link filter method of Web proposed by the invention roughly can be divided into two-part operation: first is the text position information of utilizing in the webpage, by statistical method, removes links such as incoherent advertisement in the webpage, navigation; Second portion is on the basis of first, and the content of web page contents and link webpage pointed is carried out correlation analysis, removes the incoherent link of those contents.Be described in detail respectively below.

One. based on the filtration of text position

At present, most of webpage is to set up by unified template, and for relevant with the theme below that all is placed on a Web page text by the webpage person of foundation that links in the general webpage, so the filtration work of this part is based upon on this hypothesis basis.This filtration work comprises and at first html document is converted into the dom tree structure, in dom tree, seek then and comprise body matter and the minimum subtree that link relevant with theme, obtain the link information of needs, for the incoherent link of theme in follow-up content analysis and the removal webpage is prepared.

DOM (Document Object Model) is a DOM Document Object Model, is the standard interface standard that W3C formulates, and is a kind of application programming interface (API) for HTML and XML document use.After html document is resolved, be converted into the dom tree structure, each node of dom tree is an object, and the content in the html document is completely contained in each node.Fig. 2 is an example from the dom tree of html document conversion.In a specific embodiment of the present invention, adopt CyberNeko HTML Parser resolver html document to be resolved and generated dom tree.

As shown in Figure 1, html page can be divided into different zones.For a webpage, main part (the 3rd zone among Fig. 1) is based on text, and the number of links of other parts is many.Based on such phenomenon, defined the notion of link in the present invention than (Link ratio):

Linkratio(b _i)＝LinkCount(b _i)/ContentLength(b _i)

(1)

Wherein, the number of links in LinkCount (bi) the expression i piece, the length of non-linked contents in ContentLength (bi) the expression i piece.The threshold value (th) of link ratio is set, when a certain link is compared less than this threshold value, thinks that then this part is the main part in the webpage.

The present invention is directed to the dom tree structure, in dom tree, utilize blocking node that dom tree is divided into each subtree, calculating linking ratio in each subtree, compare with threshold value, if less than threshold value, then this piece is set to main body block, retrospective search comprises nearest father's blocking node of this piece then, as destination node, export the link in this node with this node, as the basis of subsequent analysis.

Owing to the granularity that the selection of blocking node has been determined webpage is carried out piecemeal, so the present invention by experiment, preferentially chooses table (div) and tr node as blocking node.

Two. remove the incoherent link of theme based on content of text

In webpage except the link structure, the content of text that link itself is had also provides a large amount of information for the analysis of webpage, wherein Lian Jie inlet text, link web page title and main contents pointed, utilize the content similarity between these information and the former web page text, just can come analysis chain to connect and whether have the recommendation meaning.Therefore, be necessary to utilize these information that link is refiltered, remove the incoherent link of theme.

Compare with other documents, web page text has limited structure, even have the structure of certain form in other words, also is to focus on form, but not content of text, and the structure of dissimilar contents is also inconsistent; In addition, the content of text is the form of natural language, and except the method for mating in full, computing machine is difficult to judge content similarity between the two.Therefore, before the content similarity analysis that carries out two (or a plurality of) webpages, carry out pre-service, extract the main contents that to represent two (or a plurality of) texts, then it be compared the content of text of webpage.

Expression for content of text, the well-defined text model of needs can be by the handled expression mode of computing machine to form, in general, text model can be divided three classes: boolean's model, probability model and vector space model (Vector Space Model), wherein vector space model is the text model that is widely adopted, it is more accurate with respect to boolean's model, does not also need the learning process of probability model.Therefore, the present invention uses vector space model to carry out the analysis of text similarity.

Body matter in all webpages is carried out the pretreated process of text comprises following step:

1. text is cut speech

For text, speech is the minimum independently significant language element of activity.Since between the English word with the space as natural delimiter, and Chinese is to be basic grapheme with word, does not have tangible separator between the word, therefore, the basis that Chinese word analysis is Chinese information processing is with crucial.

The present invention cuts the speech technology by text each content is handled, purpose is each word of distinguishing in the content, to reflect the content of text by word set (bag-of-word) representation of text, form the proper vector of text collection, carry out follow-up processing.

2. add up the word frequency

For follow-up calculating, need add up a large amount of vocabulary, therefore need to add up the occurrence frequency of each word in the lexical set that cuts out in each text and preserve.

3. calculate TF-IDF (term frequency-inverse document frequency, word frequency-reverse file frequency) vector

In vector space model, adopt the TF-IDF vector representation.The TF-IDF vector has reflected the word space of text collection, the corresponding word of its each component of a vector, and concrete TF-IDF is defined as:

d(i)＝TF-IDF(i)＝TF(W _i，Doc)*IDF(W _i)＝TF(W _i，Doc)*log(D/DF(W _i)) (2)

Wherein (Wi Doc) is the occurrence frequency of word Wi in text Doc to TF, and D is total textual data, and DF (Wi) is that text number at least once appears in word Wi in total text collection.

4. formation vector space

After the character subset that has extracted text, just can set up the vector space model of text collection correspondence.In vector space model, the text space is counted as the vector space of being made up of one group of quadrature entry vector.

Each text representation is one of them proper vector:

V(d)＝(t ₁，w ₁(d)；...；t _i，w _i(d)；...；t _n，w _n(d))

(3)

Wherein, ti is an i speech among the text d, and wi (d) is that ti is for the weight among the text d.

Wi (d) is generally defined as the function of the frequency tfi (d) that ti occurs in text d, the present invention utilizes the value of TF-IDF to be used as proper vector, promptly with the TF-IDF value of the word that calculates as the wi in the vector (d).

Obtain the proper vector of text, just can utilize proper vector to calculate two content similarities between (or a plurality of) text, and utilized the content similarity to remove the incoherent link of content in the webpage.

Utilizing proper vector to calculate the content similarity has a lot of methods, and the present invention mainly adopts two kinds of methods, and a kind of is the overlapping degree of considering the entry that comprised in two proper vectors.The definition text similarity is:

sim (d_{i}, d_{j}) = \frac{n_{\cap} (d_{i}, d_{j})}{n_{\cup} (d_{i}, d_{j})} - - - (4)

Wherein, sim (di, dj) expression text di, the text similarity between the dj, n _∩(d _i, d _j) identical entry number that to be text di have with corresponding proper vector V of dj (di) and V (dj), n _∪(d _i, d _j) be all entry numbers that V (di) and V (dj) are had.

Another kind is a method of considering the included angle cosine in two proper vectors.The definition text similarity is:

sim (d_{i}, d_{j}) = \frac{V (d_{i}) * V (d_{j})}{| V (d_{i}) | * | V (d_{j}) |} = \frac{Σ_{m = 1}^{n} w_{im} * w_{jm}}{\sqrt{(Σ_{m = 1}^{n} w_{im}^{2}) * (Σ_{m = 1}^{n} w_{jm}^{2})}} - - - (5)

Wherein, and sim (di, dj) expression text di, the text similarity between the dj, V (di) is the proper vector of text di, wim represents the TF-IDF value of word tm in text di.

About removing the problem of invalid link, the content similarity analysis that carries out at web document and common document is differentiated, because the former has link information to utilize.In fact, the inlet text of link and corresponding web page title etc. can be used to carry out the content similarity analysis.These texts want brief a lot of compared with Web page text, and to a great extent, are brief summaries for Web page text.

Based on such idea, the present invention adopts a kind of mode of going forward one by one to judge, carries out three layer analysis in processing procedure, if the text that obtains when anterior layer is that theme is relevant, thinks that then their contents are relevant, no longer carry out following calculating; If the content that obtains is uncorrelated, then continue the analysis of one deck down.Have only when three layers all to have obtained incoherent result, think that just they are that content is incoherent, should link removal from tabulate.

Ground floor: carry out the content relevance analysis according to the inlet text

For the link in the webpage, the easiest obtain be exactly the link the inlet text.Link inlet text is the summary of the person of foundation of other webpages for webpage pointed, and it is valuable utilizing the inlet text to replace the raw content of webpage.Because the length of link inlet text is general more limited, the computation complexity of text similarity can be very not big yet like this.

Yet the quantity of the speech that is comprised in the inlet text is limited, and dimension is can be differentiated when setting up vector, can not calculate two included angle cosines between the vector, therefore can only adopt first kind of computing method.In a sample calculation, the inventor only chooses that the TF-IDF value is in the proper vector of the speech of top ten as father's webpage in father's webpage, utilize first method to obtain linking the webpage of sensing and the text similarity between father's webpage, then with threshold ratio, if greater than threshold value then think that they have content relevance, if less than threshold value then proceed down the calculating of one deck.

The second layer: the title according to webpage carries out the content relevance analysis

All can give the brief title of webpage of oneself for the general webpage person of foundation, and in the html document of webpage, represent with the title label, the title of common webpage is an explanation for webpage, usually some important keywords of text in the webpage have been comprised, utilize such title also can replace the text in the webpage to carry out the analysis of content similarity, but the title of webpage is the webpage person of foundation oneself to be added in the webpage, the subjectivity that has the webpage person of foundation, some webpage person of foundation can add a lot of uncorrelated speech in the title of webpage, improve the search engine rank of oneself, increase the click volume of webpage, therefore when analyzing, the analysis of title has been placed on after the inlet text.

Correlation calculations method for title is consistent with linking the inlet text.

The 3rd layer: carry out the content relevance analysis according to the Web page text content

Because the data volume of Web page text is bigger, therefore in calculation process, the two-layer result who negates who obtains before having only, just can carry out the analysis of this one deck, because in this one deck, the proper vector of text is (even without can be by increasing word, and the TF-IDF value be set to 0 and handles) that can have identical dimension, so just can utilize second kind of Calculation Method, obtain two content similarities between the webpage.

If it is uncorrelated all to obtain two (or a plurality of) Web page subjects at these three layers, then this link of deletion in the lists of links of father's webpage.

Below, the effect of the invalid link filter method of the content-based correlativity that the present invention proposes will be described by a series of experimental data, and utilize the comparison of the page PageRank value before and after removing, the improvement effect for the link analysis algorithm is described.

Experimental data collection used in the present invention obtains from CWT200g at random.The inventor is at collect 627036 main frames that Web service is provided in the Chinese scope that webpage finds in November, 2005, after eliminating the repetition website, removing the rubbish website, obtain 88303 websites, these websites are carried out webpage to be collected, the collection degree of depth of each website is 3, single website gathered data amount is not limit, obtain initial data set, carry out disappearing of webpage again and heavily handle, obtain unduplicated collections of web pages.According to the website size that collections of web pages reflected, to sample, the capacity that obtains at last is the CWT200g test set of 197GB.

The present invention has randomly drawed 1524077 webpages in 1421 websites from CWT200g, removed to preserve imperfectly, and only keeps the webpage of html, htm, xml, jsp, asp type, obtain 1 at last, 427,001 webpages are with this data set as subsequent experimental.

Table 1 is some statisticss for the experimental data collection, wherein page outdegree is the number that links in the webpage, internal links is the link number that the concentrated webpage of experimental data is pointed in the link in the webpage, the out-degree of adjusting be zero webpage numerical table show the external linkage of data centralization removed after, the out-degree that newly obtains is 0 webpage number.

Reference variable	Quantity (unit: individual)
		The Number of websites	1421
The webpage number	1427001
		The average webpage number in website	1004.2
Number of links	80457759
		Webpage on average links number	56.38
Out-degree is 0 webpage number	65256
		Internal links	27312578
The internal links ratio	33.95％
		The out-degree of adjusting is 0 net	220122
Number of pages

Table 1

The present invention adopts above-mentioned two steps operation to remove invalid link and since the operation of two steps handled be dissimilar invalid links, therefore the link ratios removed of two steps operation also are very different.

After the filtration of carrying out for two steps, the present invention has added up the ratio for each the step removal link of dissimilar webpages respectively.Fig. 3 carried out for second step to filter data result afterwards.As can be seen from Figure 3, the filtration of the first step is very high for the filtration ratio of link, and this is consistent with before analysis, and wherein exists a large amount of advertisement link in the webpage of com class, therefore after the filtration in this step, the link ratio of removal is the highest.But also can see simultaneously, for education and government's class webpage, by convention, can not comprise a large amount of advertisement link in these websites, but as can be seen from Figure 3, the removal ratio of this two classes webpage is also very high, find by observation for some webpages of data centralization, in the webpage of this two class, most link all concentrates on the both sides of webpage, and the link of this part mainly is linked as the master with " hot topic " or " up-to-date " so partial class, but estimate for correlativity from theme, this part link also belongs to invalid link, therefore analyzes from the angle of link analysis, and they also should filter out from link.

As can be seen from Figure 3, after having carried out two step filter operations, remaining effective link and the artificial effective ratio basically identical that links that obtains of evaluating and testing.In fact, the filtercondition of She Zhiing is comparatively loose in the present embodiment, therefore the artificial summary height that obtains of evaluating and testing of the effective link ratio that obtains.

Search engine is a commending system in essence, and it should recommend many high-quality websites to give the user as far as possible.Therefore for rank algorithm, should avoid the forward page of a large amount of ranks from same website as far as possible, otherwise will have a strong impact on the diversity (diversity) of recommendation.From this angle, the improvement effect of invalid link filter method provided by the present invention for PageRank will be analyzed further.

Fig. 4 will carry out after three ranks, and the situation that comes preceding 100 page place website compares.Wherein baseline represents not carry out any processing ranking result afterwards, and first has only carried out the first step to filter result afterwards, and second filtered result afterwards in two steps.As can be seen from Figure 4, the Number of websites that comprises among the PageRank row preceding 100 after the first step is 46, filters afterwards the website that comprises in preceding 100 and has 69 and carried out for second step.Carried out after the filtration of second step, the website number that comprises in preceding 100 has increased significantly, and shows by the filtration in this step to find how valuable website, offers the how valuable selection of user.

More than the invalid link filter method of Web of content-based correlativity provided by the present invention is had been described in detail.For one of ordinary skill in the art, any conspicuous change of under the prerequisite that does not deviate from connotation of the present invention it being done all will constitute to infringement of patent right of the present invention, with corresponding legal responsibilities.

Claims

1. the invalid link filter method of the Web of a content-based correlativity is characterized in that comprising following step:

2. the invalid link filter method of Web as claimed in claim 1 is characterized in that:

Among the described step (1), at first html document is converted into the dom tree structure, searching comprises body matter and the minimum subtree that link relevant with theme in the dom tree structure then, obtains needed link information.

3. the invalid link filter method of Web as claimed in claim 2 is characterized in that:

Utilize CyberNeko HTML Parser resolver that html document is converted into the dom tree structure.

4. the invalid link filter method of Web as claimed in claim 2 is characterized in that:

5. the invalid link filter method of Web as claimed in claim 4 is characterized in that:

When selecting blocking node, preferentially choose table (div) and tr node as blocking node.

6. the invalid link filter method of Web as claimed in claim 1 is characterized in that:

7. the invalid link filter method of Web as claimed in claim 6 is characterized in that:

Carry out the pretreated process of text and comprise following step:

8. the invalid link filter method of Web as claimed in claim 7 is characterized in that:

9. the invalid link filter method of Web as claimed in claim 7 is characterized in that:

Described content similarity is determined by the included angle cosine in the proper vector of each text.

10. the invalid link filter method of Web as claimed in claim 1 is characterized in that: