Nothing Special   »   [go: up one dir, main page]

CN107103012A - Recognize method, device and the server of violated webpage - Google Patents

Recognize method, device and the server of violated webpage Download PDF

Info

Publication number
CN107103012A
CN107103012A CN201610819394.2A CN201610819394A CN107103012A CN 107103012 A CN107103012 A CN 107103012A CN 201610819394 A CN201610819394 A CN 201610819394A CN 107103012 A CN107103012 A CN 107103012A
Authority
CN
China
Prior art keywords
dimensional array
webpage
group
violated
matched
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610819394.2A
Other languages
Chinese (zh)
Inventor
阙育飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Publication of CN107103012A publication Critical patent/CN107103012A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Fuzzy Systems (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The application provides a kind of method, device and server for recognizing violated webpage, and this method includes:Corresponding first two-dimensional array of Web page text of webpage to be matched is determined, the number of times that the whole words and each word that the first two-dimensional array is obtained including Web page text by participle occur in Web page text;Obtain multiple second two-dimensional arrays corresponding with multiple violated webpages from Sample Storehouse, each second two-dimensional array includes the violated webpage of correspondence and passes through the number of times that the whole words and each word that participle obtains occur in the violated webpage of the correspondence;The first two-dimensional array Similarity value corresponding with each second two-dimensional array in multiple second two-dimensional arrays is determined successively;If maximum Similarity value is more than the first predetermined threshold value in multiple Similarity values, it is violated webpage to determine webpage to be matched.The technical scheme of the application can avoid obtaining the testing result of mistake by deformation during keyword detection due to keyword in the prior art, improve the degree of accuracy for treating matching web monitor.

Description

Recognize method, device and the server of violated webpage
Technical field
The application is related to network technique field, more particularly to a kind of method, device and server for recognizing violated webpage.
Background technology
The user of a large amount of enterprise-levels builds a station on the Cloud Server that service provider provides at present, provider server in order to The web page contents on a little that ensure to build a station meet national policy and provided, it is necessary to be detected to the content in webpage, to ensure net Objectionable content is not present in page.In the prior art, recognized by keyword detection and whether there is objectionable content in webpage, due to Be present more deformation in keyword, therefore easily bypassed by disabled user, cause the degree of accuracy recognized to violated webpage not high.
The content of the invention
In view of this, the application provides a kind of new technical scheme, improves the degree of accuracy recognized to violated webpage.
To achieve the above object, the application offer technical scheme is as follows:
According to the first aspect of the application there is provided a kind of method for recognizing violated webpage, including:
Corresponding first two-dimensional array of Web page text of webpage to be matched is determined, first two-dimensional array includes the net The number of times that the whole words and each word that page text is obtained by participle occur in the Web page text;
Multiple second two-dimensional arrays corresponding with multiple violated webpages, the multiple second two-dimemsional number are obtained from Sample Storehouse The whole words and each word that each second two-dimensional array in group is obtained including the violated webpage of correspondence by participle are at this The number of times occurred in the violated webpage of correspondence;
Determine that first two-dimensional array is corresponding with each second two-dimensional array in multiple second two-dimensional arrays successively Similarity value, obtain each self-corresponding multiple Similarity values of the multiple second two-dimensional array;
If maximum Similarity value is more than the first predetermined threshold value in the multiple Similarity value, the net to be matched is determined Page is violated webpage.
According to the second aspect of the application there is provided a kind of device for recognizing violated webpage, including:
First determining module, corresponding first two-dimensional array of Web page text for determining webpage to be matched, described first What the whole words and each word that two-dimensional array is obtained including the Web page text by participle occurred in the Web page text Number of times;
Acquisition module, it is described for obtaining multiple second two-dimensional arrays corresponding with multiple violated webpages from Sample Storehouse Each second two-dimensional array in multiple second two-dimensional arrays include whole words that the violated webpage of correspondence obtains by participle with And the number of times that each word occurs in the violated webpage of the correspondence;
Second determining module, for determine successively first two-dimensional array that first determining module obtains with it is described Each corresponding Similarity value of the second two-dimensional array in the multiple second two-dimensional array that acquisition module is got, is obtained The multiple each self-corresponding multiple Similarity values of second two-dimensional array;
3rd determining module, if determined for second determining module maximum similar in the multiple Similarity value Angle value is more than the first predetermined threshold value, and it is violated webpage to determine the webpage to be matched.
According to the third aspect of the application there is provided a kind of server, the server includes:
Processor;Memory for storing the processor-executable instruction;
Wherein, the processor, corresponding first two-dimensional array of Web page text for determining webpage to be matched, described The whole words and each word that one two-dimensional array is obtained including the Web page text by participle occur in the Web page text Number of times;
Multiple second two-dimensional arrays corresponding with multiple violated webpages, the multiple second two-dimemsional number are obtained from Sample Storehouse The whole words and each word that each second two-dimensional array in group is obtained including the violated webpage of correspondence by participle are at this The number of times occurred in the violated webpage of correspondence;
Determine that first two-dimensional array is corresponding with each second two-dimensional array in multiple second two-dimensional arrays successively Similarity value, obtain each self-corresponding multiple Similarity values of the multiple second two-dimensional array;
If maximum Similarity value is more than the first predetermined threshold value in the multiple Similarity value, the net to be matched is determined Page is violated webpage.
According to the fourth aspect of the application there is provided a kind of method for recognizing violated webpage, methods described includes:
The corresponding two-dimensional array to be matched of Web page text of webpage to be matched is determined, the two-dimensional array to be matched includes: The number of times that the participle substring and each participle substring that the Web page text is obtained by participle occur in Web page text;
Multiple sample two-dimensional arrays corresponding with multiple violated webpages, the multiple sample two-dimemsional number are obtained from Sample Storehouse Each sample two-dimensional array in group includes:Participle substring that the violated Web page text of correspondence is obtained by participle and each point The number of times that lexon string occurs in the violated Web page text of the correspondence;
The Similarity value of two-dimensional array to be matched and at least one sample two-dimensional array is determined, at least one sample two is obtained The corresponding Similarity value of dimension group.
According to the 5th of the application the aspect there is provided a kind of device for recognizing violated webpage, described device includes:
First determining module, the corresponding two-dimensional array to be matched of Web page text for determining webpage to be matched is described to treat The participle substring and each participle substring that matching two-dimensional array is obtained including the Web page text by participle are in Web page text The number of times of middle appearance;
Acquisition module, it is described for obtaining multiple sample two-dimensional arrays corresponding with multiple violated webpages from Sample Storehouse Each sample two-dimensional array in multiple sample two-dimensional arrays includes:The violated Web page text of correspondence passes through the participle that participle is obtained The number of times that substring and each participle substring occur in the violated Web page text of the correspondence;
Second determining module, the Similarity value for determining two-dimensional array to be matched and at least one sample two-dimensional array, Obtain at least one corresponding Similarity value of sample two-dimensional array;
3rd determining module, if determining that there is at least one Similarity value is more than first for second determining module Predetermined threshold value, it is violated webpage to determine the webpage to be matched.
From above technical scheme, the application from Sample Storehouse by obtaining multiple violated webpages corresponding multiple Two two-dimensional arrays, determine that webpage to be matched is by the first two-dimensional array with multiple Similarity values of multiple second two-dimensional arrays No is violated webpage, can avoid obtaining the detection of mistake by deformation during keyword detection due to keyword in the prior art As a result, the degree of accuracy for treating matching web monitor is improved.
Brief description of the drawings
Fig. 1 shows the schematic flow sheet of the method for the violated webpage of identification according to one example embodiment of the present invention;
Fig. 2 shows the flow signal of the method for the violated webpage of identification in accordance with an alternative illustrative embodiment of the present invention Figure;
Fig. 3 A show the flow signal of the method for the violated webpage of identification according to another exemplary embodiment of the present invention Figure;
Fig. 3 B show the frame that the method for the violated webpage of identification in accordance with a further exemplary embodiment of the present invention is applicable Composition;
Fig. 4 shows the flow signal of the method for the violated webpage of identification in accordance with a further exemplary embodiment of the present invention Figure;
Fig. 5 shows the structural representation of the server according to one example embodiment of the present invention;
Fig. 6 shows the structural representation of the device of the violated webpage of identification according to one example embodiment of the present invention;
Fig. 7 shows the structural representation of the device of the violated webpage of identification in accordance with an alternative illustrative embodiment of the present invention Figure;
Fig. 8 shows the structural representation of the device of the violated webpage of identification according to another exemplary embodiment of the present invention Figure;
Fig. 9 shows the structural representation of the device of the violated webpage of identification in accordance with a further exemplary embodiment of the present invention Figure.
Embodiment
Here exemplary embodiment will be illustrated in detail, its example is illustrated in the accompanying drawings.Following description is related to During accompanying drawing, unless otherwise indicated, the same numbers in different accompanying drawings represent same or analogous key element.Following exemplary embodiment Described in embodiment do not represent all embodiments consistent with the application.On the contrary, they be only with it is such as appended The example of the consistent apparatus and method of some aspects be described in detail in claims, the application.
It is the purpose only merely for description specific embodiment in term used in this application, and is not intended to be limiting the application. " one kind ", " described " and "the" of singulative used in the application and appended claims are also intended to including majority Form, unless context clearly shows that other implications.It is also understood that term "and/or" used herein refers to and wrapped It may be combined containing one or more associated any or all of project listed.
It will be appreciated that though various information, but this may be described using term first, second, third, etc. in the application A little information should not necessarily be limited by these terms.These terms are only used for same type of information being distinguished from each other out.For example, not departing from In the case of the application scope, the first information can also be referred to as the second information, similarly, and the second information can also be referred to as One information.Depending on linguistic context, word as used in this " if " can be construed to " ... when " or " when ... When " or " in response to determining ".
In order the application to be further described, there is provided the following example:
Fig. 1 shows the schematic flow sheet of the method for the violated webpage of identification according to one example embodiment of the present invention; As shown in figure 1, comprising the following steps:
Step 101, determine that corresponding first two-dimensional array of Web page text of webpage to be matched (is referred to as to be matched two Dimension group), the whole words and each word that the first two-dimensional array is obtained including Web page text by participle go out in Web page text Existing number of times.
Step 102, multiple second two-dimensional arrays corresponding with multiple violated webpages are obtained from Sample Storehouse (to be referred to as Treat sample two-dimensional array), each second two-dimensional array in multiple second two-dimensional arrays includes the violated webpage of correspondence through undue The number of times that the whole words and each word that word is obtained occur in the violated webpage of the correspondence.
Step 103, the first two-dimensional array and each second two-dimensional array in multiple second two-dimensional arrays are determined successively Corresponding Similarity value, obtains each self-corresponding Similarity value of multiple second two-dimensional arrays.
Step 104, if Similarity value maximum in multiple Similarity values is more than the first predetermined threshold value, net to be matched is determined Page is violated webpage.
In above-mentioned steps 101, in one embodiment, participle can be carried out by treating the Web page text of matching webpage, obtained The number of times occurred to each word and each word in the Web page text in Web page text, passes through each word and each The corresponding number of times of word determines corresponding first two-dimensional array of the Web page text, wherein, the first two-dimensional array is used to represent that this is treated Web page contents with webpage, for example, webpage to be matched Web page text carry out participle after, obtain word " source code ", " needs ", " similitude ", " test ", " webpage ", " sample ", the number of times that each above-mentioned word occurs in the Web page text is respectively:Source generation Code=1, needs=1, similitude=2, test=1, webpage=1, sample=3.It is possible thereby to by " source code=1, need= 1st, similitude=2, test=1, webpage=1, sample=3 " obtain first two-dimensional array, are by first two-dimensional array Webpage to be matched can be represented.
It should be noted that the result " word " in the application obtained by participle is not word in the narrow sense, what it was referred to It is that participle substring concrete form here is probably a word, it is also possible to one by the participle substring obtained by word segmentation processing Individual word or phrase, specific word segmentation result depend on segmentation methods, and the application need not be simultaneously defined.
Further it will be understood that " the first two-dimensional array (two-dimensional array to be matched) " in the application is that one kind is used for The information of web page characteristics to be matched is described, is a kind of information for being used to describe Web page text to be matched more specifically.At some In the case of, be not in Web page text each word can the effective expression webpage feature, therefore according to the actual requirements, During two-dimensional array is obtained using word segmentation processing, some aid in treatment can be done, for example, remove the deactivation in Web page text " sharing ... " text at word, removal Web page text ending etc..Certainly, the application to the concrete mode of aid in treatment simultaneously It need not be defined.
In above-mentioned steps 102, multiple violated webpages can be obtained beforehand through the description similar to above-mentioned steps 101 each Corresponding second two-dimensional array, and multiple second two-dimensional arrays are stored in Sample Storehouse, will when finding new violated webpage New corresponding second two-dimensional array of violated webpage is updated in Sample Storehouse.Disobeyed for example, being stored with Sample Storehouse 100 Prohibit each self-corresponding second two-dimensional array of webpage, then the quantity of the second two-dimensional array is 100, it will be appreciated by those skilled in the art that , the whole words and the corresponding number of times of whole words that each second two-dimensional array is recorded can be with different, the second two dimensions The title of array is only for distinguishing corresponding first two-dimensional array of webpage to be matched, therefore the application is in the second two-dimensional array Content is not limited.
, in one embodiment, can be by calculating the first two-dimensional array and multiple second two-dimensional arrays in above-mentioned steps 103 Between Euclidean distance or COS distance it is similar between webpage to be matched and the violated webpage recorded in Sample Storehouse to determine Degree, for example, there is 100 violated webpages in Sample Storehouse, then 100 violated webpages correspond to second two-dimensional array respectively, That is, 100 the second two-dimensional arrays are had, therefore by the way that 100 Similarity values can be obtained after Similarity Measure.
In above-mentioned steps 104, if for example, the first two-dimensional array of webpage to be matched and the 1st violated net in Sample Storehouse Similarity value between corresponding second two-dimensional array of page is the maximum in 100 Similarity values, and webpage to be matched The first two-dimensional array the second two-dimensional array corresponding with the 1st violated webpage between Similarity value be more than the first default threshold Value, it may be determined that webpage to be matched is violated webpage.
According to step 103-104 description, it is to be understood that if final purpose is " to judge that webpage to be matched is No is violated webpage ", then as long as can determine webpage to be matched and any one violated webpage sample enough as (i.e. two-dimemsional number The similarity of group is more than the first predetermined threshold value), it is possible to it is violated webpage to determine webpage to be matched, not it needs to be determined that to be matched Webpage is most like with which violated webpage sample, i.e., need not calculate two-dimensional array to be matched and all violated webpage samples respectively The similarity of two-dimensional array, it is not required that calculate similarity maximum.In practical application, two-dimemsional number to be matched only need to be calculated one by one Group and the similarity of each violated webpage sample two-dimensional array, as long as the similarity that certain calculating is obtained is more than the first default threshold Value, it is possible to which it is violated webpage to determine corresponding webpage to be matched, without continuing to calculate other Similarity values.
Seen from the above description, it is corresponding multiple by obtaining multiple violated webpages from Sample Storehouse in the embodiment of the present invention Second two-dimensional array, it is to be matched to determine by the first two-dimensional array multiple Similarity values corresponding with multiple second two-dimensional arrays Whether webpage is violated webpage, can avoid obtaining mistake by deformation during keyword detection due to keyword in the prior art Testing result, improve treat matching web monitor the degree of accuracy.
Fig. 2 shows the flow signal of the method for the violated webpage of identification in accordance with an alternative illustrative embodiment of the present invention Figure;Second two-dimemsional number of the violated webpage in the first two-dimensional array and Sample Storehouse of the present embodiment how to determine webpage to be matched It is illustrative exemplified by Similarity value between group, as shown in Fig. 2 comprising the following steps:
Step 201, the first parameter value is determined according to first group of number of times, and the second parameter is determined according to second group of number of times Value.
Wherein, the participle substring that Web page text is obtained through participle is defined as first group of word, its each self-corresponding number of times is determined Justice is first group of number of times, and the participle substring of each violated Web page text in multiple violated webpages is defined as into second group of word, Its each self-corresponding number of times is defined as second group of number of times, multiple multiple second groups of second two-dimensional array (sample two-dimensional array) correspondences Number of times.
Step 202, it is determined that appearing in the 3rd group of word of the first two-dimensional array and the second two-dimensional array simultaneously, and determine to be somebody's turn to do The 3rd group of number of times and the 4th group of number of times that 3rd group of word is recorded respectively in the first two-dimensional array and second two-dimensional array.
Step 203, according to the 3rd group of number of times and the 4th group of number of times, the 3rd parameter value is determined.
Step 204, according to the first parameter value, the second parameter value, the 3rd parameter value, based on Euclidean distance method or cosine Distance method, determines the Similarity value of the first two-dimensional array and corresponding second two-dimensional array.
For example, the first two-dimensional array of webpage to be matched is:{ source code=a1 is, it is necessary to=b1, similitude=c1, test =d1, webpage=e1, sample=f1 }, corresponding second two-dimensional array of the 1st in Sample Storehouse violated webpage is:Need= B2, similitude=c2, test=d2, webpage=e2, algorithm=m2 }, below in conjunction with first two-dimensional array and the second two-dimemsional number The citing of group is illustrative to the present embodiment.
In above-mentioned steps 201, in one embodiment, first group of number of times and second group of number of times can be a dimension Group, for example, the corresponding one-dimension array of first group of number of times is [a1, b1, c1, d1, e1, f1], the corresponding dimension of second group of number of times Group is [b2, c2, d2, e2, m2].
In one embodiment, square of the corresponding number of times of each word in first group of word can be calculated, multiple are obtained One square value, calculate multiple first square values and value, the first parameter value is obtained, for example, by the above-mentioned net to be matched counted on The summed square of the corresponding number of times of all words of page obtains ɑ=a12+b12+c12+d12+e12+f12.It is understood that except Calculate quadratic sum outside, can also use other forms the first parameter algorithm, for example calculate cube and, calculate band weighting put down Side and etc..
In one embodiment, square of the corresponding number of times of each word in second group of word can be calculated, multiple are obtained Two square values, calculate multiple second square values and value, the second parameter value is obtained, for example, by the above-mentioned violated webpage counted on In the summed square of the corresponding number of times of all words obtain β=b22+c22+d22+e22+m22.It is understood that except calculating Outside quadratic sum, can also use other forms the second parameter algorithm, for example calculate cube and, calculate band weight quadratic sum Etc..
In above-mentioned steps 202, due to appearing in what is occurred in the 1st violated webpage in Web page text and Sample Storehouse simultaneously 3rd group of word be:Need, similitude, test, webpage, wherein, the number of times that " needs " occurs in webpage to be matched be b1 times, The number of times occurred in 1st violated webpage is b2 times.By that analogy, " similitude ", " test ", " webpage " can be counted respectively to exist The number of times occurred in webpage to be matched and the 1st violated webpage, obtains the 3rd group of number of times and the 4th group of number of times.In an embodiment In, the 3rd group of number of times and the 4th group of number of times can be an one-dimension array, for example, the 3rd group of corresponding one-dimension array of number of times is [b1, c1, d1, e1], the 4th group of corresponding one-dimension array of number of times is [b2, c2, d2, e2].
In one embodiment, can be by number of times of each word in the 3rd group of word in the 3rd group of number of times and this each Number of times of the word in the 4th group of number of times is multiplied, and obtains and the 3rd group of number of times multiple calculating knots that to include number of elements corresponding Really, multiple result of calculations are added, obtain the first parameter value, for example, " will need ", " similitude ", " test ", " webpage " respectively The number of times occurred in webpage to be matched is multiplied with the number of times occurred in violated webpage, obtains b1*b2, c1*c2, d1*d2, e1* E2, totally four result of calculation, four result of calculations are added, obtained:θ=b1*b2+c1*c2+d1*d2+e1*e2.It can manage Solution, adds with addition to except calculating product, can also use the 3rd parameter algorithm of other forms, for example, calculate with weighting Product add and etc..
In above-mentioned steps 204, when Similarity value is COS distance value, between webpage to be matched and the 1st violated webpage Similarity valueSqrt represents extracting operation.When Similarity value is Euclidean distance value, it can pass through Recorded respectively in the first two-dimensional array the second two-dimensional array corresponding with this 3rd group of the 3rd group of word that above-mentioned steps 203 are obtained Number of times and the 4th group of number of times;According to the 3rd group of number of times and the 4th group of number of times, based on Euclidean distance computational methods, the first two-dimemsional number is determined The Similarity value with corresponding second two-dimensional array is organized, for example,
In the present embodiment, due to the calculating of Similarity value with reference to whole words for being obtained in webpage to be matched after participle with And the whole words obtained after the participle in the violated webpage in Sample Storehouse, therefore the Similarity value can fully represent net to be matched The similitude between violated webpage in page and Sample Storehouse, when Similarity value reaches to a certain degree, you can represent net to be matched It is closely similar between page and violated webpage, and then can judge whether webpage to be matched is violated exactly by Similarity value Webpage.
Fig. 3 A show the flow signal of the method for the violated webpage of identification according to another exemplary embodiment of the present invention Figure, Fig. 3 B show the Organization Chart that the method for the violated webpage of identification in accordance with a further exemplary embodiment of the present invention is applicable; As shown in Figure 3A, comprise the following steps:
Step 301, corresponding first two-dimensional array of Web page text of webpage to be matched is determined, the first two-dimensional array includes net The number of times that the whole words and each word that page text is obtained by participle occur in Web page text.
Step 302, multiple second two-dimensional arrays corresponding with multiple violated webpages, multiple 22 are obtained from Sample Storehouse Each second two-dimensional array in dimension group includes the violated webpage of correspondence and passes through whole words and each word that participle is obtained The number of times occurred in the violated webpage of the correspondence.
Step 303, the first two-dimensional array and each second two-dimensional array in multiple second two-dimensional arrays are determined successively Corresponding Similarity value, obtains each self-corresponding multiple Similarity values of multiple second two-dimensional arrays.
Step 304, determine whether Similarity value maximum in multiple Similarity values is more than the first predetermined threshold value, if multiple Maximum Similarity value is more than the first predetermined threshold value in Similarity value, step 304 is performed, if maximum in multiple Similarity values Similarity value is less than the first predetermined threshold value, performs step 305.
Step 305, if Similarity value maximum in multiple Similarity values is more than the first predetermined threshold value, net to be matched is determined Page is violated webpage, and flow terminates.
Step 306, if Similarity value maximum in multiple Similarity values is less than the first predetermined threshold value, determine multiple similar Whether maximum Similarity value is more than the second predetermined threshold value in angle value, wherein, the second predetermined threshold value is less than the first predetermined threshold value, such as Maximum Similarity value is more than the second predetermined threshold value in really multiple Similarity values, step 307 is performed, if in multiple Similarity values Maximum Similarity value is less than the second predetermined threshold value, performs step 309.
Step 307, if Similarity value maximum in multiple Similarity values is more than the second predetermined threshold value, net to be matched is determined Page is doubtful violated webpage.
Step 308, webpage to be matched is added in Sample Storehouse, flow terminates.
Step 309, if Similarity value maximum in multiple Similarity values is less than the second predetermined threshold value, net to be matched is determined Page is normal webpage, and flow terminates.
Step 301- steps 303, the associated description of step 305 may refer to the associated description of above-mentioned embodiment illustrated in fig. 1, It will not be described in detail herein.
In above-mentioned steps 304, in one embodiment, the first predetermined threshold value can initially set up, and be treated by the later stage The degree of accuracy with web monitor updates first predetermined threshold value.
In above-mentioned steps 306, the set-up mode of the second predetermined threshold value can be identical with the set-up mode of the first predetermined threshold value, It will not be described in detail herein.
As shown in Figure 3 B, for example, in 100 violated webpages in Sample Storehouse, webpage to be matched and 100 violated webpages pair There should be 100 Similarity values, if the Similarity value between webpage to be matched and the 1st violated webpage is in 100 similarities It is maximum in value, when the Similarity value between webpage to be matched and the 1st violated webpage is more than the first predetermined threshold value, you can pass through Step 305 determines that webpage to be matched is violated webpage.
If the Similarity value between the webpage to be matched and the 1st violated webpage is less than the first predetermined threshold value, in order to keep away Exempt from due to causing webpage detection mistake to be matched when the sample size of the violated webpage in Sample Storehouse is not big enough, can be pre- by second If the threshold value magnitude relationship for being judged, determining the Similarity value and the second predetermined threshold value further to the Similarity value, if The Similarity value is less than the second predetermined threshold value, then the webpage to be matched can be considered as into normal webpage, if the Similarity value is located at Between first predetermined threshold value and the second predetermined threshold value, then it is considered as the webpage to be matched for doubtful violated webpage, can be by entering one The mode of step manual examination and verification determines whether the webpage to be matched is violated webpage, if violated webpage, then by the net to be matched Web update is into Sample Storehouse, so that the type of the violated webpage in abundant Sample Storehouse.
For the violated webpage in Sample Storehouse, it can initially be collected by manual type, and by real shown in the application Fig. 4 Apply the mode in example and extract Web page text in violated webpage, afterwards, Sample Storehouse can be updated by way of the present embodiment.
The present embodiment is on the basis of the advantageous effects with above-described embodiment, if Similarity value is pre- positioned at first If between threshold value and the second predetermined threshold value, by the way that the webpage to be matched is updated into Sample Storehouse, so as to automated rich Sample Storehouse, so ensure the later stage treat matching webpage monitoring it is more accurate.
Fig. 4 shows the flow signal of the method for the violated webpage of identification in accordance with a further exemplary embodiment of the present invention Figure;As shown in figure 4, comprising the following steps:
Step 401, the content treated in matching webpage is pre-processed.
Step 402, the initial row and ending for determining pretreated webpage to be matched are gone.
Step 403, when the distance between initial row and ending row are more than given threshold, initial row and ending row are determined Between content be Web page text.
In above-mentioned steps 401, js/css/html the character filterings unrelated with Web page text such as can be annotated to and fallen, protected Stay line feed.
In above-mentioned steps 402, the content in the Web page text after pre-processing is read line by line, is set according to type of webpage Suitable threshold value, a line after the text character number of the current line of reading is more than the suitable threshold value, and the current line Text is remained as, then the row can be set as initial row.Continue to read web page contents from initial row, until it is determined that a line therein Length be 0, and text size after the row is also 0, then can ensure that after the row do not have other text blocks, by the row It is designated as ending row.
In above-mentioned steps 403, it can be compared with the distance between initial row and ending row with given threshold, if greatly In the given threshold, then can determine that the content between initial row and ending row is Web page text.
Further, it is also possible to the text denoising to being designated as text, for example, " dividing at the ending of Web page text can be filtered out Enjoy ... ", so that it is guaranteed that the accuracy of the Web page text extracted.
It will be appreciated by persons skilled in the art that for violated webpage in Sample Storehouse in the application, can equally use Method in the present embodiment gets the Web page text of violated webpage, so as to improve the accuracy rate of webpage identification.
In the present embodiment, by extracting the Web page text in webpage to be matched, it can be ensured that the word in Web page text has Specific aim is had more than the Keywords matching on webpage of the prior art, simultaneously as by the content unrelated with Web page text Through filtering out, therefore interference of the irrelevant contents to Web page text content in the sidebar of webpage to be matched can be avoided, so as to There is more accurately hit rate with the identification for ensuring webpage to be matched.
Corresponding to the method for the above-mentioned violated webpage of identification, the application also proposed shown in Fig. 5 according to the one of the present invention The schematic configuration diagram of the server of exemplary embodiment.Fig. 5 is refer to, in hardware view, the server includes processor, inside Bus, network interface, internal memory and nonvolatile memory, are also possible that the hardware required for other business certainly.Processing Device reads corresponding computer program into internal memory and then run from nonvolatile memory, and identification is formed on logic level The device of violated webpage.Certainly, in addition to software realization mode, the application is not precluded from other implementations, such as logic Mode of device or software and hardware combining etc., that is to say, that the executive agent of following handling process is not limited to each logic Unit or hardware or logical device.
Fig. 6 is the structural representation of the device of the violated webpage of identification according to one example embodiment of the present invention;Such as Fig. 6 Shown, the device of the violated webpage of the identification can include:First determining module 61, acquisition module 62, the second determining module 63, Three determining modules 64.Wherein:
First determining module 61, corresponding first two-dimensional array of Web page text for determining webpage to be matched, the one or two The number of times that the whole words and each word that dimension group is obtained including Web page text by participle occur in Web page text;
Acquisition module 62 is more for obtaining multiple second two-dimensional arrays corresponding with multiple violated webpages from Sample Storehouse Each second two-dimensional array in individual second two-dimensional array include whole words that the violated webpage of correspondence obtains by participle and The number of times that each word occurs in the violated webpage of the correspondence;
Second determining module 63, for determining the first two-dimensional array and acquisition module that the first determining module 61 is obtained successively Each corresponding Similarity value of the second two-dimensional array in 62 multiple second two-dimensional arrays got, obtains multiple 22 Each self-corresponding multiple Similarity values of dimension group;
3rd determining module 64, if determining Similarity value maximum in multiple Similarity values for the second determining module 63 More than the first predetermined threshold value, it is violated webpage to determine webpage to be matched.
According to another specific embodiment herein:First determining module 61, acquisition module 62, the second determining module 63 And the 3rd determining module 64 can also have other functional configuration modes, wherein:
First determining module 61, the corresponding two-dimensional array to be matched of Web page text for determining webpage to be matched is treated Include with two-dimensional array:The participle substring and each participle substring that Web page text is obtained by participle occur in Web page text Number of times;
Acquisition module 62 is more for obtaining multiple sample two-dimensional arrays corresponding with multiple violated webpages from Sample Storehouse Each sample two-dimensional array in individual sample two-dimensional array includes:Participle that the violated Web page text of correspondence is obtained by participle The number of times that string and each participle substring occur in the violated Web page text of the correspondence;
Second determining module 63, for determining two-dimensional array to be matched and acquisition module 62 that the first determining module 61 is obtained At least one the corresponding Similarity value of sample two-dimensional array got, obtains at least one sample two-dimensional array corresponding similar Angle value;
3rd determining module 64, for being more than the first predetermined threshold value if there is at least one Similarity value, it is determined that treating It is violated webpage with webpage.
Fig. 7 shows the structural representation of the device of the violated webpage of identification in accordance with an alternative illustrative embodiment of the present invention Figure, as shown in fig. 7, on the basis of above-mentioned embodiment illustrated in fig. 6, each self-corresponding number of times of whole words of Web page text is defined as Each self-corresponding number of times of whole words of each violated webpage in first group of number of times, multiple violated webpages is defined as second group time Number, the multiple second group of number of times of multiple second two-dimensional array correspondences;Second determining module 63 may include:
First determining unit 631, for determining the first parameter value according to first group of number of times, and it is true according to second group of number of times Fixed second parameter value;
Second determining unit 632, for determining while appearing in the 3rd of the first two-dimensional array and second two-dimensional array Group word, and determine the 3rd group of number of times that the 3rd group of word record respectively in the first two-dimensional array and second two-dimensional array and 4th group of number of times;
3rd determining unit 633, for the 3rd group of number of times and the 4th group of number of times determined according to the second determining unit 632, Determine the 3rd parameter value;
4th determining unit 634, for determined according to the first determining unit 631 the first parameter value, the second determining unit The 3rd parameter value that 632 the second parameter values determined, the 3rd determining unit 633 are determined, based on COS distance computational methods, it is determined that The Similarity value of first two-dimensional array and corresponding second two-dimensional array.
According to another specific embodiment herein:First determining unit 631, the second determining unit the 632, the 3rd are determined The determining unit 634 of unit 633 and the 4th can also have other functional configuration modes, wherein:
First determining unit 631, for according to first group of number of times, the first parameter to be determined using default first parameter algorithm Value, and according to second group of number of times, the second parameter value is determined using default second parameter algorithm;
Second determining unit 632, for determining while appearing in the of two-dimensional array to be matched and the sample two-dimensional array Three component lexon strings, and determine that the third component lexon string is remembered respectively in two-dimensional array to be matched and the sample two-dimensional array The 3rd group of number of times and the 4th group of number of times of record;
3rd determining unit 633, for the 3rd group of number of times and the 4th group of number of times determined according to the second determining unit 632, The 3rd parameter value is determined using default 3rd parameter algorithm;
4th determining unit 634, for determined according to the first determining unit 631 the first parameter value, the second determining unit The 3rd parameter value that 632 the second parameter values determined, the 3rd determining unit 633 are determined, based on COS distance computational methods, it is determined that The Similarity value of two-dimensional array to be matched and corresponding sample two-dimensional array.
In one embodiment, the first determining unit 631 may include:
First computation subunit 6311, square for calculating the corresponding number of times of each word in first group of word, is obtained Multiple first square values;
First addition subelement 6312, obtained multiple first square values are calculated for calculating the first computation subunit 6311 And value, obtain the first parameter value.
In one embodiment, the second determining unit 632 may include:
Second computation subunit 6321, square for calculating the corresponding number of times of each word in second group of word, is obtained Multiple second square values;
Second addition subelement 6322, the sum for calculating multiple second square values that the second computation subunit 6321 is obtained Value, obtains the second parameter value.
In one embodiment, the 3rd determining unit 633 may include:
3rd computation subunit 6331, for by number of times of each word in the 3rd group of word in the 3rd group of number of times with should Number of times of each word in the 4th group of number of times is multiplied, and obtains the corresponding multiple result of calculations of quantity in the 3rd group of number of times;
3rd addition subelement 6332, is added for calculating multiple result of calculations that the 3rd computation subunit 6331 is obtained, Obtain the 3rd parameter value.
In one embodiment, the second determining module 63 may include:
First determining unit 635, for determining while appearing in the 3rd of the first two-dimensional array and second two-dimensional array Group word, and determine the 3rd group of number of times that the 3rd group of word record respectively in the first two-dimensional array and second two-dimensional array and 4th group of number of times;
Second determining unit 636, for the 3rd group of number of times and the 4th group of number of times determined according to the first determining unit 635, Based on Euclidean distance computational methods, the Similarity value of the first two-dimensional array and corresponding second two-dimensional array is determined.
In the present embodiment, the second determining unit 632 and the first determining unit 635 can be merged into same functional module.
Fig. 8 shows the structural representation of the device of the violated webpage of identification according to another exemplary embodiment of the present invention Figure, as shown in figure 8, on the basis of above-mentioned embodiment illustrated in fig. 6, in one embodiment, device may also include:
4th determining module 64, if similarity maximum in the multiple Similarity values obtained for the second determining module 62 Value is less than the first predetermined threshold value, determines whether Similarity value maximum in multiple Similarity values is more than the second predetermined threshold value, wherein, Second predetermined threshold value is less than the first predetermined threshold value;
5th determining module 65, if determining Similarity value maximum in multiple Similarity values for the 4th determining module 64 More than the second predetermined threshold value, it is doubtful violated webpage to determine webpage to be matched;
Add module 66, for webpage to be matched to be added in Sample Storehouse;
6th determining module 67, if determining Similarity value maximum in multiple Similarity values for the 4th determining module 65 Less than the second predetermined threshold value, it is normal webpage to determine webpage to be matched.
Fig. 9 shows the structural representation of the device of identification web page contents in accordance with a further exemplary embodiment of the present invention Figure, as shown in figure 9, on the basis of above-mentioned embodiment illustrated in fig. 6, the first determining module 61 may include:
Participle unit 611, the Web page text for treating matching webpage carries out participle, obtains each in Web page text The number of times that word and each word occur in Web page text;
5th determining unit 612, each word number of times corresponding with each word for being obtained by participle unit 611 Corresponding first two-dimensional array of Web page text is determined, the first two-dimensional array is used for the web page contents for representing webpage to be matched.
In one embodiment, device may also include:
Pretreatment module 68, is pre-processed for treating the content in matching webpage;
7th determining module 69, initial row and ending for determining the pretreated webpage to be matched of pretreatment module 68 OK;
8th determining module 60, for being set when the 7th determining module 69 determines that the distance between initial row and ending row are more than When determining threshold value, it is Web page text to determine the content between initial row and ending row, and the first determining module 61 determines that the 8th determines mould Corresponding first two-dimensional array of Web page text that block 60 is obtained.
Above-described embodiment is visible, by obtaining multiple violated webpages corresponding multiple in Sample Storehouse in the application Two two-dimensional arrays, by the first two-dimensional array with the Similarity value of multiple second two-dimensional arrays come determine webpage to be matched whether be Violated webpage, can avoid obtaining the detection knot of mistake by deformation during keyword detection due to keyword in the prior art Really, the degree of accuracy for treating matching web monitor is improved.
Those skilled in the art will readily occur to its of the application after considering specification and putting into practice invention disclosed herein Its embodiment.The application is intended to any modification, purposes or the adaptations of the application, these modifications, purposes or Person's adaptations follow the general principle of the application and including the undocumented common knowledge in the art of the application Or conventional techniques.Description and embodiments are considered only as exemplary, and the true scope of the application and spirit are by following Claim is pointed out.
It should also be noted that, term " comprising ", "comprising" or its any other variant are intended to nonexcludability Comprising so that process, method, commodity or equipment including a series of key elements are not only including those key elements, but also wrap Include other key elements being not expressly set out, or also include for this process, method, commodity or equipment intrinsic want Element.In the absence of more restrictions, the key element limited by sentence "including a ...", it is not excluded that wanted including described Also there is other identical element in process, method, commodity or the equipment of element.
The preferred embodiment of the application is the foregoing is only, not to limit the application, all essences in the application God is with principle, and any modification, equivalent substitution and improvements done etc. should be included within the scope of the application protection.

Claims (20)

1. a kind of method for recognizing violated webpage, it is characterised in that methods described includes:
Corresponding first two-dimensional array of Web page text of webpage to be matched is determined, first two-dimensional array is including the webpage just The number of times that the whole words and each word that text is obtained by participle occur in the Web page text;
Obtained from Sample Storehouse in multiple second two-dimensional arrays corresponding with multiple violated webpages, the multiple second two-dimensional array Each second two-dimensional array include whole words and each word that the violated webpage of correspondence obtains by participle in the correspondence The number of times occurred in violated webpage;
First two-dimensional array phase corresponding with each second two-dimensional array in multiple second two-dimensional arrays is determined successively Like angle value, each self-corresponding multiple Similarity values of the multiple second two-dimensional array are obtained;
If maximum Similarity value is more than the first predetermined threshold value in the multiple Similarity value, determine that the webpage to be matched is Violated webpage.
2. according to the method described in claim 1, it is characterised in that each self-corresponding number of times of whole words of the Web page text is determined Justice is that each self-corresponding number of times of whole words of each violated webpage in first group of number of times, the multiple violated webpage is defined as Second group of number of times, the multiple second group of number of times of multiple second two-dimensional array correspondences;It is described determine successively first two-dimensional array with Each corresponding Similarity value of the second two-dimensional array in multiple second two-dimensional arrays, including:
First parameter value is determined according to first group of number of times, and the second parameter value is determined according to second group of number of times;
It is determined that appearing in the 3rd group of word of first two-dimensional array and second two-dimensional array simultaneously, and determine the 3rd group The 3rd group of number of times and the 4th group of number of times that word is recorded respectively in first two-dimensional array and second two-dimensional array;
According to the 3rd group of number of times and the 4th group of number of times, the 3rd parameter value is determined;
According to first parameter value, second parameter value, the 3rd parameter value, based on COS distance computational methods, really Determine the Similarity value of first two-dimensional array and corresponding second two-dimensional array.
3. method according to claim 2, it is characterised in that described that first parameter is determined according to first group of number of times Value, including:
Square of the corresponding number of times of each word in first group of word is calculated, multiple first square values are obtained;
Calculate the multiple first square value and value, obtain the first parameter value.
4. method according to claim 2, it is characterised in that described that second ginseng is determined according to second group of number of times Numerical value, including:
Square of the corresponding number of times of each word in second group of word is calculated, multiple second square values are obtained;
Calculate the multiple second square value and value, obtain the second parameter value.
5. method according to claim 2, it is characterised in that described according to the 3rd group of number of times and described 4th group time Number determines the 3rd parameter value, including:
By number of times of each word in the 3rd group of word in the 3rd group of number of times and each word the described 4th Number of times in group number of times is multiplied, and obtains and the 3rd group of number of times multiple result of calculations that to include number of elements corresponding;
The multiple result of calculation is added, the 3rd parameter value is obtained.
6. method according to claim 2, it is characterised in that described to determine first two-dimensional array and multiple the successively Each corresponding Similarity value of the second two-dimensional array in two two-dimensional arrays, including:
It is determined that appearing in the 3rd group of word of first two-dimensional array and second two-dimensional array simultaneously, and determine the 3rd group The 3rd group of number of times and the 4th group of number of times that word is recorded respectively in first two-dimensional array and second two-dimensional array;
According to the 3rd group of number of times and the 4th group of number of times, based on Euclidean distance computational methods, first two dimension is determined The Similarity value of array and corresponding second two-dimensional array.
7. according to the method described in claim 1, it is characterised in that methods described also includes:
If maximum Similarity value is less than first predetermined threshold value in the multiple Similarity value, determine the multiple similar Whether maximum Similarity value is more than the second predetermined threshold value in angle value, wherein, it is pre- that second predetermined threshold value is less than described first If threshold value;
If maximum Similarity value is more than second predetermined threshold value in the multiple Similarity value, the net to be matched is determined Page is doubtful violated webpage;
The webpage to be matched is added in the Sample Storehouse;
If maximum Similarity value is less than second predetermined threshold value in the multiple Similarity value, the net to be matched is determined Page is normal webpage.
8. according to the method described in claim 1, it is characterised in that the Web page text for determining webpage to be matched corresponding the One two-dimensional array, including:
Participle is carried out to the Web page text of the webpage to be matched, each word and each word in the Web page text is obtained The number of times occurred in the Web page text;
Corresponding first two-dimemsional number of the Web page text is determined by each described word and the corresponding number of times of each described word Group, first two-dimensional array is used for the web page contents for representing the webpage to be matched.
9. according to the method described in claim 1, it is characterised in that the determination method of the Web page text of the webpage to be matched, Including:
Content in the webpage to be matched is pre-processed;
The initial row and ending for determining the pretreated webpage to be matched are gone;
When the distance between the initial row and the ending row are more than given threshold, the initial row and the ending are determined Content between row is the Web page text.
10. a kind of device for recognizing violated webpage, it is characterised in that described device includes:
First determining module, corresponding first two-dimensional array of Web page text for determining webpage to be matched, first two dimension The number of times that the whole words and each word that array is obtained including the Web page text by participle occur in the Web page text;
Acquisition module, it is the multiple for obtaining multiple second two-dimensional arrays corresponding with multiple violated webpages from Sample Storehouse Each second two-dimensional array in second two-dimensional array includes whole words that correspondingly violated webpage is obtained by participle and every The number of times that one word occurs in the violated webpage of the correspondence;
Second determining module, for determining that first two-dimensional array that first determining module is obtained is obtained with described successively Each corresponding Similarity value of the second two-dimensional array in the multiple second two-dimensional array that module is got, obtains described Multiple each self-corresponding multiple Similarity values of second two-dimensional array;
3rd determining module, if determining Similarity value maximum in the multiple Similarity value for second determining module More than the first predetermined threshold value, it is violated webpage to determine the webpage to be matched.
11. device according to claim 10, it is characterised in that each self-corresponding number of times of whole words of the Web page text Each self-corresponding number of times definition of whole words for each the violated webpage being defined as in first group of number of times, the multiple violated webpage For second group of number of times, the multiple second group of number of times of multiple second two-dimensional array correspondences;Second determining module includes:
First determining unit, for determining the first parameter value according to first group of number of times, and according to second group of number of times Determine the second parameter value;
Second determining unit, for determining while appearing in the 3rd group of first two-dimensional array and second two-dimensional array Word, and determine the 3rd group of number of times that the 3rd group of word is recorded respectively in first two-dimensional array and second two-dimensional array With the 4th group of number of times;
3rd determining unit, it is secondary for the 3rd group of number of times determined according to second determining unit and described 4th group Number, determines the 3rd parameter value;
4th determining unit, for first parameter value, the second determination list determined according to first determining unit The 3rd parameter value that second parameter value of member determination, the 3rd determining unit are determined, is calculated based on COS distance Method, determines the Similarity value of first two-dimensional array and corresponding second two-dimensional array.
12. device according to claim 10, it is characterised in that second determining module includes:
First determining unit, for determining while appearing in the 3rd group of first two-dimensional array and second two-dimensional array Word, and determine the 3rd group of number of times that the 3rd group of word is recorded respectively in first two-dimensional array and second two-dimensional array With the 4th group of number of times;
Second determining unit, it is secondary for the 3rd group of number of times determined according to first determining unit and described 4th group Number, based on Euclidean distance computational methods, determines the similarity of first two-dimensional array and corresponding second two-dimensional array Value.
13. device according to claim 10, it is characterised in that described device also includes:
4th determining module, if similarity maximum in the multiple Similarity value obtained for first determining module Value is less than first predetermined threshold value, determines whether Similarity value maximum in the multiple Similarity value is more than the second default threshold Value, wherein, second predetermined threshold value is less than first predetermined threshold value;
5th determining module, if determining Similarity value maximum in the multiple Similarity value for the 4th determining module More than second predetermined threshold value, it is doubtful violated webpage to determine the webpage to be matched;
Add module, for the webpage to be matched to be added in the Sample Storehouse;
6th determining module, if determining Similarity value maximum in the multiple Similarity value for the 4th determining module Less than second predetermined threshold value, it is normal webpage to determine the webpage to be matched.
14. device according to claim 10, it is characterised in that first determining module includes:
Participle unit, carries out participle for the Web page text to the webpage to be matched, obtains each in the Web page text The number of times that individual word and each word occur in the Web page text;
5th determining unit, for corresponding time of each word described in being obtained by the participle unit and each described word Number determines corresponding first two-dimensional array of the Web page text, and first two-dimensional array is used to represent the webpage to be matched Web page contents.
15. device according to claim 10, it is characterised in that described device also includes:
Pretreatment module, for being pre-processed to the content in the webpage to be matched;
7th determining module, initial row and ending for determining the pretreated webpage to be matched of the pretreatment module OK;
8th determining module, for determining that the distance between the initial row and the ending row are big when the 7th determining module When given threshold, it is the Web page text to determine the content between the initial row and the ending row.
16. a kind of server, it is characterised in that the server includes:
Processor;Memory for storing the processor-executable instruction;
Wherein, the processor, corresponding first two-dimensional array of Web page text for determining webpage to be matched, the described 1st Time that the whole words and each word that dimension group is obtained including the Web page text by participle occur in the Web page text Number;
Obtained from Sample Storehouse in multiple second two-dimensional arrays corresponding with multiple violated webpages, the multiple second two-dimensional array Each second two-dimensional array include whole words and each word that the violated webpage of correspondence obtains by participle in the correspondence The number of times occurred in violated webpage;
First two-dimensional array phase corresponding with each second two-dimensional array in multiple second two-dimensional arrays is determined successively Like angle value, each self-corresponding multiple Similarity values of the multiple second two-dimensional array are obtained;
If maximum Similarity value is more than the first predetermined threshold value in the multiple Similarity value, determine that the webpage to be matched is Violated webpage.
17. a kind of method for recognizing violated webpage, it is characterised in that methods described includes:
The corresponding two-dimensional array to be matched of Web page text of webpage to be matched is determined, the two-dimensional array to be matched includes:It is described The number of times that the participle substring and each participle substring that Web page text is obtained by participle occur in Web page text;
Obtained from Sample Storehouse in multiple sample two-dimensional arrays corresponding with multiple violated webpages, the multiple sample two-dimensional array Each sample two-dimensional array include:Participle substring and each participle that the violated Web page text of correspondence is obtained by participle The number of times that string occurs in the violated Web page text of the correspondence;
The Similarity value of two-dimensional array to be matched and at least one sample two-dimensional array is determined, at least one sample two-dimemsional number is obtained The corresponding Similarity value of group.
18. method according to claim 17, it is characterised in that the participle substring of the Web page text is defined as first group Participle substring, its each self-corresponding number of times is defined as each violated webpage in first group of number of times, the multiple violated webpage The participle substring of text is defined as the second component lexon string, and its each self-corresponding number of times is defined as second group of number of times, multiple samples The multiple second group of number of times of two-dimensional array correspondence;It is described to determine that two-dimensional array to be matched is similar at least one sample two-dimensional array Angle value, including:
According to first group of number of times, the first parameter value is determined using default first parameter algorithm, and according to described second Group number of times, the second parameter value is determined using default second parameter algorithm;
It is determined that appearing in the third component lexon string of the two-dimensional array to be matched and the sample two-dimensional array simultaneously, and determine The 3rd group of number of times that the third component lexon string is recorded respectively in the two-dimensional array to be matched and the sample two-dimensional array and 4th group of number of times;
According to the 3rd group of number of times and the 4th group of number of times, the 3rd parameter value is determined using default 3rd parameter algorithm;
According to first parameter value, second parameter value, the 3rd parameter value, based on COS distance computational methods, really Determine the Similarity value of the two-dimensional array to be matched and the corresponding sample two-dimensional array.
19. a kind of device for recognizing violated webpage, it is characterised in that described device includes:
First determining module, the corresponding two-dimensional array to be matched of Web page text for determining webpage to be matched is described to be matched The participle substring and each participle substring that two-dimensional array is obtained including the Web page text by participle go out in Web page text Existing number of times;
Acquisition module, it is the multiple for obtaining multiple sample two-dimensional arrays corresponding with multiple violated webpages from Sample Storehouse Each sample two-dimensional array in sample two-dimensional array includes:The violated Web page text of correspondence passes through the participle substring that participle is obtained And the number of times that each participle substring occurs in the violated Web page text of the correspondence;
Second determining module, the Similarity value for determining two-dimensional array to be matched and at least one sample two-dimensional array, is obtained At least one corresponding Similarity value of sample two-dimensional array;
3rd determining module, if determining that there is at least one Similarity value presets more than first for second determining module Threshold value, it is violated webpage to determine the webpage to be matched.
20. device according to claim 19, it is characterised in that the participle substring of the Web page text is defined as first group Participle substring, its each self-corresponding number of times is defined as each violated webpage in first group of number of times, the multiple violated webpage The participle substring of text is defined as the second component lexon string, and its each self-corresponding number of times is defined as second group of number of times, multiple samples The multiple second group of number of times of two-dimensional array correspondence;Second determining module includes:
First determining unit, for according to first group of number of times, the first parameter value to be determined using default first parameter algorithm, And according to second group of number of times, the second parameter value is determined using default second parameter algorithm;
Second determining unit, for determining while appearing in the 3rd group of the two-dimensional array to be matched and the sample two-dimensional array Participle substring, and determine that the third component lexon string is remembered respectively in the two-dimensional array to be matched and the sample two-dimensional array The 3rd group of number of times and the 4th group of number of times of record;
3rd determining unit, it is secondary for the 3rd group of number of times determined according to second determining unit and described 4th group Number, the 3rd parameter value is determined using default 3rd parameter algorithm;
4th determining unit, for first parameter value, the second determination list determined according to first determining unit The 3rd parameter value that second parameter value of member determination, the 3rd determining unit are determined, is calculated based on COS distance Method, determines the Similarity value of the two-dimensional array to be matched and the corresponding sample two-dimensional array.
CN201610819394.2A 2016-01-28 2016-09-12 Recognize method, device and the server of violated webpage Pending CN107103012A (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610061273 2016-01-28
CN2016100612736 2016-01-28

Publications (1)

Publication Number Publication Date
CN107103012A true CN107103012A (en) 2017-08-29

Family

ID=59658750

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610819394.2A Pending CN107103012A (en) 2016-01-28 2016-09-12 Recognize method, device and the server of violated webpage

Country Status (1)

Country Link
CN (1) CN107103012A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020107835A1 (en) * 2018-11-26 2020-06-04 平安科技(深圳)有限公司 Sample data processing method and device
CN112199569A (en) * 2020-10-29 2021-01-08 重庆撼地大数据有限公司 Method and system for identifying prohibited website, computer equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1529263A (en) * 2003-09-18 2004-09-15 北京邮电大学 Chinese text auto-segmenting and text plagiarism discrimination device and method
US20050273706A1 (en) * 2000-08-24 2005-12-08 Yahoo! Inc. Systems and methods for identifying and extracting data from HTML pages
CN102208992A (en) * 2010-06-13 2011-10-05 天津海量信息技术有限公司 Internet-facing filtration system of unhealthy information and method thereof
CN102541874A (en) * 2010-12-16 2012-07-04 中国移动通信集团公司 Webpage text content extracting method and device
CN102663093A (en) * 2012-04-10 2012-09-12 中国科学院计算机网络信息中心 Method and device for detecting bad website

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050273706A1 (en) * 2000-08-24 2005-12-08 Yahoo! Inc. Systems and methods for identifying and extracting data from HTML pages
CN1529263A (en) * 2003-09-18 2004-09-15 北京邮电大学 Chinese text auto-segmenting and text plagiarism discrimination device and method
CN102208992A (en) * 2010-06-13 2011-10-05 天津海量信息技术有限公司 Internet-facing filtration system of unhealthy information and method thereof
CN102541874A (en) * 2010-12-16 2012-07-04 中国移动通信集团公司 Webpage text content extracting method and device
CN102663093A (en) * 2012-04-10 2012-09-12 中国科学院计算机网络信息中心 Method and device for detecting bad website

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
龚静: "《中文文本聚类研究》", 31 March 2012, 中国传媒大学出版社 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020107835A1 (en) * 2018-11-26 2020-06-04 平安科技(深圳)有限公司 Sample data processing method and device
CN112199569A (en) * 2020-10-29 2021-01-08 重庆撼地大数据有限公司 Method and system for identifying prohibited website, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
US20200195667A1 (en) Url attack detection method and apparatus, and electronic device
CN106131071B (en) A kind of Web method for detecting abnormality and device
CN108833409B (en) Webshell detection method and device based on deep learning and semi-supervised learning
CN107204960B (en) Webpage identification method and device and server
CN107943954A (en) Detection method, device and the electronic equipment of webpage sensitive information
CN104899508B (en) A kind of multistage detection method for phishing site and system
CN106709345A (en) Deep learning method-based method and system for deducing malicious code rules and equipment
CN103617213B (en) Method and system for identifying newspage attributive characters
US8335750B1 (en) Associative pattern memory with vertical sensors, amplitude sampling, adjacent hashes and fuzzy hashes
CN107819790A (en) The recognition methods of attack message and device
CN109145030B (en) Abnormal data access detection method and device
CN105528422A (en) Focused crawler processing method and apparatus
CN106126719A (en) Information processing method and device
CN110138794A (en) A kind of counterfeit website identification method, device, equipment and readable storage medium storing program for executing
CN111368061B (en) Short text filtering method, device, medium and computer equipment
CN109784059B (en) Trojan file tracing method, system and equipment
CN107577944A (en) Website malicious code detecting method and device based on code syntax analyzer
EP3893128A1 (en) Crawler data recognition method, system and device
CN112200196A (en) Phishing website detection method, device, equipment and computer readable storage medium
CN107798080A (en) A kind of similar sample set construction method towards fishing URL detections
CN110909361A (en) Vulnerability detection method and device and computer equipment
CN110036367A (en) A kind of verification method and Related product of AI operation result
CN112199569A (en) Method and system for identifying prohibited website, computer equipment and storage medium
CN107992402A (en) Blog management method and log management apparatus
CN107103012A (en) Recognize method, device and the server of violated webpage

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20170829