CN107103012A - Recognize method, device and the server of violated webpage - Google Patents
Recognize method, device and the server of violated webpage Download PDFInfo
- Publication number
- CN107103012A CN107103012A CN201610819394.2A CN201610819394A CN107103012A CN 107103012 A CN107103012 A CN 107103012A CN 201610819394 A CN201610819394 A CN 201610819394A CN 107103012 A CN107103012 A CN 107103012A
- Authority
- CN
- China
- Prior art keywords
- dimensional array
- webpage
- group
- violated
- matched
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2462—Approximate or statistical queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Probability & Statistics with Applications (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Fuzzy Systems (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The application provides a kind of method, device and server for recognizing violated webpage, and this method includes:Corresponding first two-dimensional array of Web page text of webpage to be matched is determined, the number of times that the whole words and each word that the first two-dimensional array is obtained including Web page text by participle occur in Web page text;Obtain multiple second two-dimensional arrays corresponding with multiple violated webpages from Sample Storehouse, each second two-dimensional array includes the violated webpage of correspondence and passes through the number of times that the whole words and each word that participle obtains occur in the violated webpage of the correspondence;The first two-dimensional array Similarity value corresponding with each second two-dimensional array in multiple second two-dimensional arrays is determined successively;If maximum Similarity value is more than the first predetermined threshold value in multiple Similarity values, it is violated webpage to determine webpage to be matched.The technical scheme of the application can avoid obtaining the testing result of mistake by deformation during keyword detection due to keyword in the prior art, improve the degree of accuracy for treating matching web monitor.
Description
Technical field
The application is related to network technique field, more particularly to a kind of method, device and server for recognizing violated webpage.
Background technology
The user of a large amount of enterprise-levels builds a station on the Cloud Server that service provider provides at present, provider server in order to
The web page contents on a little that ensure to build a station meet national policy and provided, it is necessary to be detected to the content in webpage, to ensure net
Objectionable content is not present in page.In the prior art, recognized by keyword detection and whether there is objectionable content in webpage, due to
Be present more deformation in keyword, therefore easily bypassed by disabled user, cause the degree of accuracy recognized to violated webpage not high.
The content of the invention
In view of this, the application provides a kind of new technical scheme, improves the degree of accuracy recognized to violated webpage.
To achieve the above object, the application offer technical scheme is as follows:
According to the first aspect of the application there is provided a kind of method for recognizing violated webpage, including:
Corresponding first two-dimensional array of Web page text of webpage to be matched is determined, first two-dimensional array includes the net
The number of times that the whole words and each word that page text is obtained by participle occur in the Web page text;
Multiple second two-dimensional arrays corresponding with multiple violated webpages, the multiple second two-dimemsional number are obtained from Sample Storehouse
The whole words and each word that each second two-dimensional array in group is obtained including the violated webpage of correspondence by participle are at this
The number of times occurred in the violated webpage of correspondence;
Determine that first two-dimensional array is corresponding with each second two-dimensional array in multiple second two-dimensional arrays successively
Similarity value, obtain each self-corresponding multiple Similarity values of the multiple second two-dimensional array;
If maximum Similarity value is more than the first predetermined threshold value in the multiple Similarity value, the net to be matched is determined
Page is violated webpage.
According to the second aspect of the application there is provided a kind of device for recognizing violated webpage, including:
First determining module, corresponding first two-dimensional array of Web page text for determining webpage to be matched, described first
What the whole words and each word that two-dimensional array is obtained including the Web page text by participle occurred in the Web page text
Number of times;
Acquisition module, it is described for obtaining multiple second two-dimensional arrays corresponding with multiple violated webpages from Sample Storehouse
Each second two-dimensional array in multiple second two-dimensional arrays include whole words that the violated webpage of correspondence obtains by participle with
And the number of times that each word occurs in the violated webpage of the correspondence;
Second determining module, for determine successively first two-dimensional array that first determining module obtains with it is described
Each corresponding Similarity value of the second two-dimensional array in the multiple second two-dimensional array that acquisition module is got, is obtained
The multiple each self-corresponding multiple Similarity values of second two-dimensional array;
3rd determining module, if determined for second determining module maximum similar in the multiple Similarity value
Angle value is more than the first predetermined threshold value, and it is violated webpage to determine the webpage to be matched.
According to the third aspect of the application there is provided a kind of server, the server includes:
Processor;Memory for storing the processor-executable instruction;
Wherein, the processor, corresponding first two-dimensional array of Web page text for determining webpage to be matched, described
The whole words and each word that one two-dimensional array is obtained including the Web page text by participle occur in the Web page text
Number of times;
Multiple second two-dimensional arrays corresponding with multiple violated webpages, the multiple second two-dimemsional number are obtained from Sample Storehouse
The whole words and each word that each second two-dimensional array in group is obtained including the violated webpage of correspondence by participle are at this
The number of times occurred in the violated webpage of correspondence;
Determine that first two-dimensional array is corresponding with each second two-dimensional array in multiple second two-dimensional arrays successively
Similarity value, obtain each self-corresponding multiple Similarity values of the multiple second two-dimensional array;
If maximum Similarity value is more than the first predetermined threshold value in the multiple Similarity value, the net to be matched is determined
Page is violated webpage.
According to the fourth aspect of the application there is provided a kind of method for recognizing violated webpage, methods described includes:
The corresponding two-dimensional array to be matched of Web page text of webpage to be matched is determined, the two-dimensional array to be matched includes:
The number of times that the participle substring and each participle substring that the Web page text is obtained by participle occur in Web page text;
Multiple sample two-dimensional arrays corresponding with multiple violated webpages, the multiple sample two-dimemsional number are obtained from Sample Storehouse
Each sample two-dimensional array in group includes:Participle substring that the violated Web page text of correspondence is obtained by participle and each point
The number of times that lexon string occurs in the violated Web page text of the correspondence;
The Similarity value of two-dimensional array to be matched and at least one sample two-dimensional array is determined, at least one sample two is obtained
The corresponding Similarity value of dimension group.
According to the 5th of the application the aspect there is provided a kind of device for recognizing violated webpage, described device includes:
First determining module, the corresponding two-dimensional array to be matched of Web page text for determining webpage to be matched is described to treat
The participle substring and each participle substring that matching two-dimensional array is obtained including the Web page text by participle are in Web page text
The number of times of middle appearance;
Acquisition module, it is described for obtaining multiple sample two-dimensional arrays corresponding with multiple violated webpages from Sample Storehouse
Each sample two-dimensional array in multiple sample two-dimensional arrays includes:The violated Web page text of correspondence passes through the participle that participle is obtained
The number of times that substring and each participle substring occur in the violated Web page text of the correspondence;
Second determining module, the Similarity value for determining two-dimensional array to be matched and at least one sample two-dimensional array,
Obtain at least one corresponding Similarity value of sample two-dimensional array;
3rd determining module, if determining that there is at least one Similarity value is more than first for second determining module
Predetermined threshold value, it is violated webpage to determine the webpage to be matched.
From above technical scheme, the application from Sample Storehouse by obtaining multiple violated webpages corresponding multiple
Two two-dimensional arrays, determine that webpage to be matched is by the first two-dimensional array with multiple Similarity values of multiple second two-dimensional arrays
No is violated webpage, can avoid obtaining the detection of mistake by deformation during keyword detection due to keyword in the prior art
As a result, the degree of accuracy for treating matching web monitor is improved.
Brief description of the drawings
Fig. 1 shows the schematic flow sheet of the method for the violated webpage of identification according to one example embodiment of the present invention;
Fig. 2 shows the flow signal of the method for the violated webpage of identification in accordance with an alternative illustrative embodiment of the present invention
Figure;
Fig. 3 A show the flow signal of the method for the violated webpage of identification according to another exemplary embodiment of the present invention
Figure;
Fig. 3 B show the frame that the method for the violated webpage of identification in accordance with a further exemplary embodiment of the present invention is applicable
Composition;
Fig. 4 shows the flow signal of the method for the violated webpage of identification in accordance with a further exemplary embodiment of the present invention
Figure;
Fig. 5 shows the structural representation of the server according to one example embodiment of the present invention;
Fig. 6 shows the structural representation of the device of the violated webpage of identification according to one example embodiment of the present invention;
Fig. 7 shows the structural representation of the device of the violated webpage of identification in accordance with an alternative illustrative embodiment of the present invention
Figure;
Fig. 8 shows the structural representation of the device of the violated webpage of identification according to another exemplary embodiment of the present invention
Figure;
Fig. 9 shows the structural representation of the device of the violated webpage of identification in accordance with a further exemplary embodiment of the present invention
Figure.
Embodiment
Here exemplary embodiment will be illustrated in detail, its example is illustrated in the accompanying drawings.Following description is related to
During accompanying drawing, unless otherwise indicated, the same numbers in different accompanying drawings represent same or analogous key element.Following exemplary embodiment
Described in embodiment do not represent all embodiments consistent with the application.On the contrary, they be only with it is such as appended
The example of the consistent apparatus and method of some aspects be described in detail in claims, the application.
It is the purpose only merely for description specific embodiment in term used in this application, and is not intended to be limiting the application.
" one kind ", " described " and "the" of singulative used in the application and appended claims are also intended to including majority
Form, unless context clearly shows that other implications.It is also understood that term "and/or" used herein refers to and wrapped
It may be combined containing one or more associated any or all of project listed.
It will be appreciated that though various information, but this may be described using term first, second, third, etc. in the application
A little information should not necessarily be limited by these terms.These terms are only used for same type of information being distinguished from each other out.For example, not departing from
In the case of the application scope, the first information can also be referred to as the second information, similarly, and the second information can also be referred to as
One information.Depending on linguistic context, word as used in this " if " can be construed to " ... when " or " when ...
When " or " in response to determining ".
In order the application to be further described, there is provided the following example:
Fig. 1 shows the schematic flow sheet of the method for the violated webpage of identification according to one example embodiment of the present invention;
As shown in figure 1, comprising the following steps:
Step 101, determine that corresponding first two-dimensional array of Web page text of webpage to be matched (is referred to as to be matched two
Dimension group), the whole words and each word that the first two-dimensional array is obtained including Web page text by participle go out in Web page text
Existing number of times.
Step 102, multiple second two-dimensional arrays corresponding with multiple violated webpages are obtained from Sample Storehouse (to be referred to as
Treat sample two-dimensional array), each second two-dimensional array in multiple second two-dimensional arrays includes the violated webpage of correspondence through undue
The number of times that the whole words and each word that word is obtained occur in the violated webpage of the correspondence.
Step 103, the first two-dimensional array and each second two-dimensional array in multiple second two-dimensional arrays are determined successively
Corresponding Similarity value, obtains each self-corresponding Similarity value of multiple second two-dimensional arrays.
Step 104, if Similarity value maximum in multiple Similarity values is more than the first predetermined threshold value, net to be matched is determined
Page is violated webpage.
In above-mentioned steps 101, in one embodiment, participle can be carried out by treating the Web page text of matching webpage, obtained
The number of times occurred to each word and each word in the Web page text in Web page text, passes through each word and each
The corresponding number of times of word determines corresponding first two-dimensional array of the Web page text, wherein, the first two-dimensional array is used to represent that this is treated
Web page contents with webpage, for example, webpage to be matched Web page text carry out participle after, obtain word " source code ", " needs ",
" similitude ", " test ", " webpage ", " sample ", the number of times that each above-mentioned word occurs in the Web page text is respectively:Source generation
Code=1, needs=1, similitude=2, test=1, webpage=1, sample=3.It is possible thereby to by " source code=1, need=
1st, similitude=2, test=1, webpage=1, sample=3 " obtain first two-dimensional array, are by first two-dimensional array
Webpage to be matched can be represented.
It should be noted that the result " word " in the application obtained by participle is not word in the narrow sense, what it was referred to
It is that participle substring concrete form here is probably a word, it is also possible to one by the participle substring obtained by word segmentation processing
Individual word or phrase, specific word segmentation result depend on segmentation methods, and the application need not be simultaneously defined.
Further it will be understood that " the first two-dimensional array (two-dimensional array to be matched) " in the application is that one kind is used for
The information of web page characteristics to be matched is described, is a kind of information for being used to describe Web page text to be matched more specifically.At some
In the case of, be not in Web page text each word can the effective expression webpage feature, therefore according to the actual requirements,
During two-dimensional array is obtained using word segmentation processing, some aid in treatment can be done, for example, remove the deactivation in Web page text
" sharing ... " text at word, removal Web page text ending etc..Certainly, the application to the concrete mode of aid in treatment simultaneously
It need not be defined.
In above-mentioned steps 102, multiple violated webpages can be obtained beforehand through the description similar to above-mentioned steps 101 each
Corresponding second two-dimensional array, and multiple second two-dimensional arrays are stored in Sample Storehouse, will when finding new violated webpage
New corresponding second two-dimensional array of violated webpage is updated in Sample Storehouse.Disobeyed for example, being stored with Sample Storehouse 100
Prohibit each self-corresponding second two-dimensional array of webpage, then the quantity of the second two-dimensional array is 100, it will be appreciated by those skilled in the art that
, the whole words and the corresponding number of times of whole words that each second two-dimensional array is recorded can be with different, the second two dimensions
The title of array is only for distinguishing corresponding first two-dimensional array of webpage to be matched, therefore the application is in the second two-dimensional array
Content is not limited.
, in one embodiment, can be by calculating the first two-dimensional array and multiple second two-dimensional arrays in above-mentioned steps 103
Between Euclidean distance or COS distance it is similar between webpage to be matched and the violated webpage recorded in Sample Storehouse to determine
Degree, for example, there is 100 violated webpages in Sample Storehouse, then 100 violated webpages correspond to second two-dimensional array respectively,
That is, 100 the second two-dimensional arrays are had, therefore by the way that 100 Similarity values can be obtained after Similarity Measure.
In above-mentioned steps 104, if for example, the first two-dimensional array of webpage to be matched and the 1st violated net in Sample Storehouse
Similarity value between corresponding second two-dimensional array of page is the maximum in 100 Similarity values, and webpage to be matched
The first two-dimensional array the second two-dimensional array corresponding with the 1st violated webpage between Similarity value be more than the first default threshold
Value, it may be determined that webpage to be matched is violated webpage.
According to step 103-104 description, it is to be understood that if final purpose is " to judge that webpage to be matched is
No is violated webpage ", then as long as can determine webpage to be matched and any one violated webpage sample enough as (i.e. two-dimemsional number
The similarity of group is more than the first predetermined threshold value), it is possible to it is violated webpage to determine webpage to be matched, not it needs to be determined that to be matched
Webpage is most like with which violated webpage sample, i.e., need not calculate two-dimensional array to be matched and all violated webpage samples respectively
The similarity of two-dimensional array, it is not required that calculate similarity maximum.In practical application, two-dimemsional number to be matched only need to be calculated one by one
Group and the similarity of each violated webpage sample two-dimensional array, as long as the similarity that certain calculating is obtained is more than the first default threshold
Value, it is possible to which it is violated webpage to determine corresponding webpage to be matched, without continuing to calculate other Similarity values.
Seen from the above description, it is corresponding multiple by obtaining multiple violated webpages from Sample Storehouse in the embodiment of the present invention
Second two-dimensional array, it is to be matched to determine by the first two-dimensional array multiple Similarity values corresponding with multiple second two-dimensional arrays
Whether webpage is violated webpage, can avoid obtaining mistake by deformation during keyword detection due to keyword in the prior art
Testing result, improve treat matching web monitor the degree of accuracy.
Fig. 2 shows the flow signal of the method for the violated webpage of identification in accordance with an alternative illustrative embodiment of the present invention
Figure;Second two-dimemsional number of the violated webpage in the first two-dimensional array and Sample Storehouse of the present embodiment how to determine webpage to be matched
It is illustrative exemplified by Similarity value between group, as shown in Fig. 2 comprising the following steps:
Step 201, the first parameter value is determined according to first group of number of times, and the second parameter is determined according to second group of number of times
Value.
Wherein, the participle substring that Web page text is obtained through participle is defined as first group of word, its each self-corresponding number of times is determined
Justice is first group of number of times, and the participle substring of each violated Web page text in multiple violated webpages is defined as into second group of word,
Its each self-corresponding number of times is defined as second group of number of times, multiple multiple second groups of second two-dimensional array (sample two-dimensional array) correspondences
Number of times.
Step 202, it is determined that appearing in the 3rd group of word of the first two-dimensional array and the second two-dimensional array simultaneously, and determine to be somebody's turn to do
The 3rd group of number of times and the 4th group of number of times that 3rd group of word is recorded respectively in the first two-dimensional array and second two-dimensional array.
Step 203, according to the 3rd group of number of times and the 4th group of number of times, the 3rd parameter value is determined.
Step 204, according to the first parameter value, the second parameter value, the 3rd parameter value, based on Euclidean distance method or cosine
Distance method, determines the Similarity value of the first two-dimensional array and corresponding second two-dimensional array.
For example, the first two-dimensional array of webpage to be matched is:{ source code=a1 is, it is necessary to=b1, similitude=c1, test
=d1, webpage=e1, sample=f1 }, corresponding second two-dimensional array of the 1st in Sample Storehouse violated webpage is:Need=
B2, similitude=c2, test=d2, webpage=e2, algorithm=m2 }, below in conjunction with first two-dimensional array and the second two-dimemsional number
The citing of group is illustrative to the present embodiment.
In above-mentioned steps 201, in one embodiment, first group of number of times and second group of number of times can be a dimension
Group, for example, the corresponding one-dimension array of first group of number of times is [a1, b1, c1, d1, e1, f1], the corresponding dimension of second group of number of times
Group is [b2, c2, d2, e2, m2].
In one embodiment, square of the corresponding number of times of each word in first group of word can be calculated, multiple are obtained
One square value, calculate multiple first square values and value, the first parameter value is obtained, for example, by the above-mentioned net to be matched counted on
The summed square of the corresponding number of times of all words of page obtains ɑ=a12+b12+c12+d12+e12+f12.It is understood that except
Calculate quadratic sum outside, can also use other forms the first parameter algorithm, for example calculate cube and, calculate band weighting put down
Side and etc..
In one embodiment, square of the corresponding number of times of each word in second group of word can be calculated, multiple are obtained
Two square values, calculate multiple second square values and value, the second parameter value is obtained, for example, by the above-mentioned violated webpage counted on
In the summed square of the corresponding number of times of all words obtain β=b22+c22+d22+e22+m22.It is understood that except calculating
Outside quadratic sum, can also use other forms the second parameter algorithm, for example calculate cube and, calculate band weight quadratic sum
Etc..
In above-mentioned steps 202, due to appearing in what is occurred in the 1st violated webpage in Web page text and Sample Storehouse simultaneously
3rd group of word be:Need, similitude, test, webpage, wherein, the number of times that " needs " occurs in webpage to be matched be b1 times,
The number of times occurred in 1st violated webpage is b2 times.By that analogy, " similitude ", " test ", " webpage " can be counted respectively to exist
The number of times occurred in webpage to be matched and the 1st violated webpage, obtains the 3rd group of number of times and the 4th group of number of times.In an embodiment
In, the 3rd group of number of times and the 4th group of number of times can be an one-dimension array, for example, the 3rd group of corresponding one-dimension array of number of times is
[b1, c1, d1, e1], the 4th group of corresponding one-dimension array of number of times is [b2, c2, d2, e2].
In one embodiment, can be by number of times of each word in the 3rd group of word in the 3rd group of number of times and this each
Number of times of the word in the 4th group of number of times is multiplied, and obtains and the 3rd group of number of times multiple calculating knots that to include number of elements corresponding
Really, multiple result of calculations are added, obtain the first parameter value, for example, " will need ", " similitude ", " test ", " webpage " respectively
The number of times occurred in webpage to be matched is multiplied with the number of times occurred in violated webpage, obtains b1*b2, c1*c2, d1*d2, e1*
E2, totally four result of calculation, four result of calculations are added, obtained:θ=b1*b2+c1*c2+d1*d2+e1*e2.It can manage
Solution, adds with addition to except calculating product, can also use the 3rd parameter algorithm of other forms, for example, calculate with weighting
Product add and etc..
In above-mentioned steps 204, when Similarity value is COS distance value, between webpage to be matched and the 1st violated webpage
Similarity valueSqrt represents extracting operation.When Similarity value is Euclidean distance value, it can pass through
Recorded respectively in the first two-dimensional array the second two-dimensional array corresponding with this 3rd group of the 3rd group of word that above-mentioned steps 203 are obtained
Number of times and the 4th group of number of times;According to the 3rd group of number of times and the 4th group of number of times, based on Euclidean distance computational methods, the first two-dimemsional number is determined
The Similarity value with corresponding second two-dimensional array is organized, for example,
In the present embodiment, due to the calculating of Similarity value with reference to whole words for being obtained in webpage to be matched after participle with
And the whole words obtained after the participle in the violated webpage in Sample Storehouse, therefore the Similarity value can fully represent net to be matched
The similitude between violated webpage in page and Sample Storehouse, when Similarity value reaches to a certain degree, you can represent net to be matched
It is closely similar between page and violated webpage, and then can judge whether webpage to be matched is violated exactly by Similarity value
Webpage.
Fig. 3 A show the flow signal of the method for the violated webpage of identification according to another exemplary embodiment of the present invention
Figure, Fig. 3 B show the Organization Chart that the method for the violated webpage of identification in accordance with a further exemplary embodiment of the present invention is applicable;
As shown in Figure 3A, comprise the following steps:
Step 301, corresponding first two-dimensional array of Web page text of webpage to be matched is determined, the first two-dimensional array includes net
The number of times that the whole words and each word that page text is obtained by participle occur in Web page text.
Step 302, multiple second two-dimensional arrays corresponding with multiple violated webpages, multiple 22 are obtained from Sample Storehouse
Each second two-dimensional array in dimension group includes the violated webpage of correspondence and passes through whole words and each word that participle is obtained
The number of times occurred in the violated webpage of the correspondence.
Step 303, the first two-dimensional array and each second two-dimensional array in multiple second two-dimensional arrays are determined successively
Corresponding Similarity value, obtains each self-corresponding multiple Similarity values of multiple second two-dimensional arrays.
Step 304, determine whether Similarity value maximum in multiple Similarity values is more than the first predetermined threshold value, if multiple
Maximum Similarity value is more than the first predetermined threshold value in Similarity value, step 304 is performed, if maximum in multiple Similarity values
Similarity value is less than the first predetermined threshold value, performs step 305.
Step 305, if Similarity value maximum in multiple Similarity values is more than the first predetermined threshold value, net to be matched is determined
Page is violated webpage, and flow terminates.
Step 306, if Similarity value maximum in multiple Similarity values is less than the first predetermined threshold value, determine multiple similar
Whether maximum Similarity value is more than the second predetermined threshold value in angle value, wherein, the second predetermined threshold value is less than the first predetermined threshold value, such as
Maximum Similarity value is more than the second predetermined threshold value in really multiple Similarity values, step 307 is performed, if in multiple Similarity values
Maximum Similarity value is less than the second predetermined threshold value, performs step 309.
Step 307, if Similarity value maximum in multiple Similarity values is more than the second predetermined threshold value, net to be matched is determined
Page is doubtful violated webpage.
Step 308, webpage to be matched is added in Sample Storehouse, flow terminates.
Step 309, if Similarity value maximum in multiple Similarity values is less than the second predetermined threshold value, net to be matched is determined
Page is normal webpage, and flow terminates.
Step 301- steps 303, the associated description of step 305 may refer to the associated description of above-mentioned embodiment illustrated in fig. 1,
It will not be described in detail herein.
In above-mentioned steps 304, in one embodiment, the first predetermined threshold value can initially set up, and be treated by the later stage
The degree of accuracy with web monitor updates first predetermined threshold value.
In above-mentioned steps 306, the set-up mode of the second predetermined threshold value can be identical with the set-up mode of the first predetermined threshold value,
It will not be described in detail herein.
As shown in Figure 3 B, for example, in 100 violated webpages in Sample Storehouse, webpage to be matched and 100 violated webpages pair
There should be 100 Similarity values, if the Similarity value between webpage to be matched and the 1st violated webpage is in 100 similarities
It is maximum in value, when the Similarity value between webpage to be matched and the 1st violated webpage is more than the first predetermined threshold value, you can pass through
Step 305 determines that webpage to be matched is violated webpage.
If the Similarity value between the webpage to be matched and the 1st violated webpage is less than the first predetermined threshold value, in order to keep away
Exempt from due to causing webpage detection mistake to be matched when the sample size of the violated webpage in Sample Storehouse is not big enough, can be pre- by second
If the threshold value magnitude relationship for being judged, determining the Similarity value and the second predetermined threshold value further to the Similarity value, if
The Similarity value is less than the second predetermined threshold value, then the webpage to be matched can be considered as into normal webpage, if the Similarity value is located at
Between first predetermined threshold value and the second predetermined threshold value, then it is considered as the webpage to be matched for doubtful violated webpage, can be by entering one
The mode of step manual examination and verification determines whether the webpage to be matched is violated webpage, if violated webpage, then by the net to be matched
Web update is into Sample Storehouse, so that the type of the violated webpage in abundant Sample Storehouse.
For the violated webpage in Sample Storehouse, it can initially be collected by manual type, and by real shown in the application Fig. 4
Apply the mode in example and extract Web page text in violated webpage, afterwards, Sample Storehouse can be updated by way of the present embodiment.
The present embodiment is on the basis of the advantageous effects with above-described embodiment, if Similarity value is pre- positioned at first
If between threshold value and the second predetermined threshold value, by the way that the webpage to be matched is updated into Sample Storehouse, so as to automated rich
Sample Storehouse, so ensure the later stage treat matching webpage monitoring it is more accurate.
Fig. 4 shows the flow signal of the method for the violated webpage of identification in accordance with a further exemplary embodiment of the present invention
Figure;As shown in figure 4, comprising the following steps:
Step 401, the content treated in matching webpage is pre-processed.
Step 402, the initial row and ending for determining pretreated webpage to be matched are gone.
Step 403, when the distance between initial row and ending row are more than given threshold, initial row and ending row are determined
Between content be Web page text.
In above-mentioned steps 401, js/css/html the character filterings unrelated with Web page text such as can be annotated to and fallen, protected
Stay line feed.
In above-mentioned steps 402, the content in the Web page text after pre-processing is read line by line, is set according to type of webpage
Suitable threshold value, a line after the text character number of the current line of reading is more than the suitable threshold value, and the current line
Text is remained as, then the row can be set as initial row.Continue to read web page contents from initial row, until it is determined that a line therein
Length be 0, and text size after the row is also 0, then can ensure that after the row do not have other text blocks, by the row
It is designated as ending row.
In above-mentioned steps 403, it can be compared with the distance between initial row and ending row with given threshold, if greatly
In the given threshold, then can determine that the content between initial row and ending row is Web page text.
Further, it is also possible to the text denoising to being designated as text, for example, " dividing at the ending of Web page text can be filtered out
Enjoy ... ", so that it is guaranteed that the accuracy of the Web page text extracted.
It will be appreciated by persons skilled in the art that for violated webpage in Sample Storehouse in the application, can equally use
Method in the present embodiment gets the Web page text of violated webpage, so as to improve the accuracy rate of webpage identification.
In the present embodiment, by extracting the Web page text in webpage to be matched, it can be ensured that the word in Web page text has
Specific aim is had more than the Keywords matching on webpage of the prior art, simultaneously as by the content unrelated with Web page text
Through filtering out, therefore interference of the irrelevant contents to Web page text content in the sidebar of webpage to be matched can be avoided, so as to
There is more accurately hit rate with the identification for ensuring webpage to be matched.
Corresponding to the method for the above-mentioned violated webpage of identification, the application also proposed shown in Fig. 5 according to the one of the present invention
The schematic configuration diagram of the server of exemplary embodiment.Fig. 5 is refer to, in hardware view, the server includes processor, inside
Bus, network interface, internal memory and nonvolatile memory, are also possible that the hardware required for other business certainly.Processing
Device reads corresponding computer program into internal memory and then run from nonvolatile memory, and identification is formed on logic level
The device of violated webpage.Certainly, in addition to software realization mode, the application is not precluded from other implementations, such as logic
Mode of device or software and hardware combining etc., that is to say, that the executive agent of following handling process is not limited to each logic
Unit or hardware or logical device.
Fig. 6 is the structural representation of the device of the violated webpage of identification according to one example embodiment of the present invention;Such as Fig. 6
Shown, the device of the violated webpage of the identification can include:First determining module 61, acquisition module 62, the second determining module 63,
Three determining modules 64.Wherein:
First determining module 61, corresponding first two-dimensional array of Web page text for determining webpage to be matched, the one or two
The number of times that the whole words and each word that dimension group is obtained including Web page text by participle occur in Web page text;
Acquisition module 62 is more for obtaining multiple second two-dimensional arrays corresponding with multiple violated webpages from Sample Storehouse
Each second two-dimensional array in individual second two-dimensional array include whole words that the violated webpage of correspondence obtains by participle and
The number of times that each word occurs in the violated webpage of the correspondence;
Second determining module 63, for determining the first two-dimensional array and acquisition module that the first determining module 61 is obtained successively
Each corresponding Similarity value of the second two-dimensional array in 62 multiple second two-dimensional arrays got, obtains multiple 22
Each self-corresponding multiple Similarity values of dimension group;
3rd determining module 64, if determining Similarity value maximum in multiple Similarity values for the second determining module 63
More than the first predetermined threshold value, it is violated webpage to determine webpage to be matched.
According to another specific embodiment herein:First determining module 61, acquisition module 62, the second determining module 63
And the 3rd determining module 64 can also have other functional configuration modes, wherein:
First determining module 61, the corresponding two-dimensional array to be matched of Web page text for determining webpage to be matched is treated
Include with two-dimensional array:The participle substring and each participle substring that Web page text is obtained by participle occur in Web page text
Number of times;
Acquisition module 62 is more for obtaining multiple sample two-dimensional arrays corresponding with multiple violated webpages from Sample Storehouse
Each sample two-dimensional array in individual sample two-dimensional array includes:Participle that the violated Web page text of correspondence is obtained by participle
The number of times that string and each participle substring occur in the violated Web page text of the correspondence;
Second determining module 63, for determining two-dimensional array to be matched and acquisition module 62 that the first determining module 61 is obtained
At least one the corresponding Similarity value of sample two-dimensional array got, obtains at least one sample two-dimensional array corresponding similar
Angle value;
3rd determining module 64, for being more than the first predetermined threshold value if there is at least one Similarity value, it is determined that treating
It is violated webpage with webpage.
Fig. 7 shows the structural representation of the device of the violated webpage of identification in accordance with an alternative illustrative embodiment of the present invention
Figure, as shown in fig. 7, on the basis of above-mentioned embodiment illustrated in fig. 6, each self-corresponding number of times of whole words of Web page text is defined as
Each self-corresponding number of times of whole words of each violated webpage in first group of number of times, multiple violated webpages is defined as second group time
Number, the multiple second group of number of times of multiple second two-dimensional array correspondences;Second determining module 63 may include:
First determining unit 631, for determining the first parameter value according to first group of number of times, and it is true according to second group of number of times
Fixed second parameter value;
Second determining unit 632, for determining while appearing in the 3rd of the first two-dimensional array and second two-dimensional array
Group word, and determine the 3rd group of number of times that the 3rd group of word record respectively in the first two-dimensional array and second two-dimensional array and
4th group of number of times;
3rd determining unit 633, for the 3rd group of number of times and the 4th group of number of times determined according to the second determining unit 632,
Determine the 3rd parameter value;
4th determining unit 634, for determined according to the first determining unit 631 the first parameter value, the second determining unit
The 3rd parameter value that 632 the second parameter values determined, the 3rd determining unit 633 are determined, based on COS distance computational methods, it is determined that
The Similarity value of first two-dimensional array and corresponding second two-dimensional array.
According to another specific embodiment herein:First determining unit 631, the second determining unit the 632, the 3rd are determined
The determining unit 634 of unit 633 and the 4th can also have other functional configuration modes, wherein:
First determining unit 631, for according to first group of number of times, the first parameter to be determined using default first parameter algorithm
Value, and according to second group of number of times, the second parameter value is determined using default second parameter algorithm;
Second determining unit 632, for determining while appearing in the of two-dimensional array to be matched and the sample two-dimensional array
Three component lexon strings, and determine that the third component lexon string is remembered respectively in two-dimensional array to be matched and the sample two-dimensional array
The 3rd group of number of times and the 4th group of number of times of record;
3rd determining unit 633, for the 3rd group of number of times and the 4th group of number of times determined according to the second determining unit 632,
The 3rd parameter value is determined using default 3rd parameter algorithm;
4th determining unit 634, for determined according to the first determining unit 631 the first parameter value, the second determining unit
The 3rd parameter value that 632 the second parameter values determined, the 3rd determining unit 633 are determined, based on COS distance computational methods, it is determined that
The Similarity value of two-dimensional array to be matched and corresponding sample two-dimensional array.
In one embodiment, the first determining unit 631 may include:
First computation subunit 6311, square for calculating the corresponding number of times of each word in first group of word, is obtained
Multiple first square values;
First addition subelement 6312, obtained multiple first square values are calculated for calculating the first computation subunit 6311
And value, obtain the first parameter value.
In one embodiment, the second determining unit 632 may include:
Second computation subunit 6321, square for calculating the corresponding number of times of each word in second group of word, is obtained
Multiple second square values;
Second addition subelement 6322, the sum for calculating multiple second square values that the second computation subunit 6321 is obtained
Value, obtains the second parameter value.
In one embodiment, the 3rd determining unit 633 may include:
3rd computation subunit 6331, for by number of times of each word in the 3rd group of word in the 3rd group of number of times with should
Number of times of each word in the 4th group of number of times is multiplied, and obtains the corresponding multiple result of calculations of quantity in the 3rd group of number of times;
3rd addition subelement 6332, is added for calculating multiple result of calculations that the 3rd computation subunit 6331 is obtained,
Obtain the 3rd parameter value.
In one embodiment, the second determining module 63 may include:
First determining unit 635, for determining while appearing in the 3rd of the first two-dimensional array and second two-dimensional array
Group word, and determine the 3rd group of number of times that the 3rd group of word record respectively in the first two-dimensional array and second two-dimensional array and
4th group of number of times;
Second determining unit 636, for the 3rd group of number of times and the 4th group of number of times determined according to the first determining unit 635,
Based on Euclidean distance computational methods, the Similarity value of the first two-dimensional array and corresponding second two-dimensional array is determined.
In the present embodiment, the second determining unit 632 and the first determining unit 635 can be merged into same functional module.
Fig. 8 shows the structural representation of the device of the violated webpage of identification according to another exemplary embodiment of the present invention
Figure, as shown in figure 8, on the basis of above-mentioned embodiment illustrated in fig. 6, in one embodiment, device may also include:
4th determining module 64, if similarity maximum in the multiple Similarity values obtained for the second determining module 62
Value is less than the first predetermined threshold value, determines whether Similarity value maximum in multiple Similarity values is more than the second predetermined threshold value, wherein,
Second predetermined threshold value is less than the first predetermined threshold value;
5th determining module 65, if determining Similarity value maximum in multiple Similarity values for the 4th determining module 64
More than the second predetermined threshold value, it is doubtful violated webpage to determine webpage to be matched;
Add module 66, for webpage to be matched to be added in Sample Storehouse;
6th determining module 67, if determining Similarity value maximum in multiple Similarity values for the 4th determining module 65
Less than the second predetermined threshold value, it is normal webpage to determine webpage to be matched.
Fig. 9 shows the structural representation of the device of identification web page contents in accordance with a further exemplary embodiment of the present invention
Figure, as shown in figure 9, on the basis of above-mentioned embodiment illustrated in fig. 6, the first determining module 61 may include:
Participle unit 611, the Web page text for treating matching webpage carries out participle, obtains each in Web page text
The number of times that word and each word occur in Web page text;
5th determining unit 612, each word number of times corresponding with each word for being obtained by participle unit 611
Corresponding first two-dimensional array of Web page text is determined, the first two-dimensional array is used for the web page contents for representing webpage to be matched.
In one embodiment, device may also include:
Pretreatment module 68, is pre-processed for treating the content in matching webpage;
7th determining module 69, initial row and ending for determining the pretreated webpage to be matched of pretreatment module 68
OK;
8th determining module 60, for being set when the 7th determining module 69 determines that the distance between initial row and ending row are more than
When determining threshold value, it is Web page text to determine the content between initial row and ending row, and the first determining module 61 determines that the 8th determines mould
Corresponding first two-dimensional array of Web page text that block 60 is obtained.
Above-described embodiment is visible, by obtaining multiple violated webpages corresponding multiple in Sample Storehouse in the application
Two two-dimensional arrays, by the first two-dimensional array with the Similarity value of multiple second two-dimensional arrays come determine webpage to be matched whether be
Violated webpage, can avoid obtaining the detection knot of mistake by deformation during keyword detection due to keyword in the prior art
Really, the degree of accuracy for treating matching web monitor is improved.
Those skilled in the art will readily occur to its of the application after considering specification and putting into practice invention disclosed herein
Its embodiment.The application is intended to any modification, purposes or the adaptations of the application, these modifications, purposes or
Person's adaptations follow the general principle of the application and including the undocumented common knowledge in the art of the application
Or conventional techniques.Description and embodiments are considered only as exemplary, and the true scope of the application and spirit are by following
Claim is pointed out.
It should also be noted that, term " comprising ", "comprising" or its any other variant are intended to nonexcludability
Comprising so that process, method, commodity or equipment including a series of key elements are not only including those key elements, but also wrap
Include other key elements being not expressly set out, or also include for this process, method, commodity or equipment intrinsic want
Element.In the absence of more restrictions, the key element limited by sentence "including a ...", it is not excluded that wanted including described
Also there is other identical element in process, method, commodity or the equipment of element.
The preferred embodiment of the application is the foregoing is only, not to limit the application, all essences in the application
God is with principle, and any modification, equivalent substitution and improvements done etc. should be included within the scope of the application protection.
Claims (20)
1. a kind of method for recognizing violated webpage, it is characterised in that methods described includes:
Corresponding first two-dimensional array of Web page text of webpage to be matched is determined, first two-dimensional array is including the webpage just
The number of times that the whole words and each word that text is obtained by participle occur in the Web page text;
Obtained from Sample Storehouse in multiple second two-dimensional arrays corresponding with multiple violated webpages, the multiple second two-dimensional array
Each second two-dimensional array include whole words and each word that the violated webpage of correspondence obtains by participle in the correspondence
The number of times occurred in violated webpage;
First two-dimensional array phase corresponding with each second two-dimensional array in multiple second two-dimensional arrays is determined successively
Like angle value, each self-corresponding multiple Similarity values of the multiple second two-dimensional array are obtained;
If maximum Similarity value is more than the first predetermined threshold value in the multiple Similarity value, determine that the webpage to be matched is
Violated webpage.
2. according to the method described in claim 1, it is characterised in that each self-corresponding number of times of whole words of the Web page text is determined
Justice is that each self-corresponding number of times of whole words of each violated webpage in first group of number of times, the multiple violated webpage is defined as
Second group of number of times, the multiple second group of number of times of multiple second two-dimensional array correspondences;It is described determine successively first two-dimensional array with
Each corresponding Similarity value of the second two-dimensional array in multiple second two-dimensional arrays, including:
First parameter value is determined according to first group of number of times, and the second parameter value is determined according to second group of number of times;
It is determined that appearing in the 3rd group of word of first two-dimensional array and second two-dimensional array simultaneously, and determine the 3rd group
The 3rd group of number of times and the 4th group of number of times that word is recorded respectively in first two-dimensional array and second two-dimensional array;
According to the 3rd group of number of times and the 4th group of number of times, the 3rd parameter value is determined;
According to first parameter value, second parameter value, the 3rd parameter value, based on COS distance computational methods, really
Determine the Similarity value of first two-dimensional array and corresponding second two-dimensional array.
3. method according to claim 2, it is characterised in that described that first parameter is determined according to first group of number of times
Value, including:
Square of the corresponding number of times of each word in first group of word is calculated, multiple first square values are obtained;
Calculate the multiple first square value and value, obtain the first parameter value.
4. method according to claim 2, it is characterised in that described that second ginseng is determined according to second group of number of times
Numerical value, including:
Square of the corresponding number of times of each word in second group of word is calculated, multiple second square values are obtained;
Calculate the multiple second square value and value, obtain the second parameter value.
5. method according to claim 2, it is characterised in that described according to the 3rd group of number of times and described 4th group time
Number determines the 3rd parameter value, including:
By number of times of each word in the 3rd group of word in the 3rd group of number of times and each word the described 4th
Number of times in group number of times is multiplied, and obtains and the 3rd group of number of times multiple result of calculations that to include number of elements corresponding;
The multiple result of calculation is added, the 3rd parameter value is obtained.
6. method according to claim 2, it is characterised in that described to determine first two-dimensional array and multiple the successively
Each corresponding Similarity value of the second two-dimensional array in two two-dimensional arrays, including:
It is determined that appearing in the 3rd group of word of first two-dimensional array and second two-dimensional array simultaneously, and determine the 3rd group
The 3rd group of number of times and the 4th group of number of times that word is recorded respectively in first two-dimensional array and second two-dimensional array;
According to the 3rd group of number of times and the 4th group of number of times, based on Euclidean distance computational methods, first two dimension is determined
The Similarity value of array and corresponding second two-dimensional array.
7. according to the method described in claim 1, it is characterised in that methods described also includes:
If maximum Similarity value is less than first predetermined threshold value in the multiple Similarity value, determine the multiple similar
Whether maximum Similarity value is more than the second predetermined threshold value in angle value, wherein, it is pre- that second predetermined threshold value is less than described first
If threshold value;
If maximum Similarity value is more than second predetermined threshold value in the multiple Similarity value, the net to be matched is determined
Page is doubtful violated webpage;
The webpage to be matched is added in the Sample Storehouse;
If maximum Similarity value is less than second predetermined threshold value in the multiple Similarity value, the net to be matched is determined
Page is normal webpage.
8. according to the method described in claim 1, it is characterised in that the Web page text for determining webpage to be matched corresponding the
One two-dimensional array, including:
Participle is carried out to the Web page text of the webpage to be matched, each word and each word in the Web page text is obtained
The number of times occurred in the Web page text;
Corresponding first two-dimemsional number of the Web page text is determined by each described word and the corresponding number of times of each described word
Group, first two-dimensional array is used for the web page contents for representing the webpage to be matched.
9. according to the method described in claim 1, it is characterised in that the determination method of the Web page text of the webpage to be matched,
Including:
Content in the webpage to be matched is pre-processed;
The initial row and ending for determining the pretreated webpage to be matched are gone;
When the distance between the initial row and the ending row are more than given threshold, the initial row and the ending are determined
Content between row is the Web page text.
10. a kind of device for recognizing violated webpage, it is characterised in that described device includes:
First determining module, corresponding first two-dimensional array of Web page text for determining webpage to be matched, first two dimension
The number of times that the whole words and each word that array is obtained including the Web page text by participle occur in the Web page text;
Acquisition module, it is the multiple for obtaining multiple second two-dimensional arrays corresponding with multiple violated webpages from Sample Storehouse
Each second two-dimensional array in second two-dimensional array includes whole words that correspondingly violated webpage is obtained by participle and every
The number of times that one word occurs in the violated webpage of the correspondence;
Second determining module, for determining that first two-dimensional array that first determining module is obtained is obtained with described successively
Each corresponding Similarity value of the second two-dimensional array in the multiple second two-dimensional array that module is got, obtains described
Multiple each self-corresponding multiple Similarity values of second two-dimensional array;
3rd determining module, if determining Similarity value maximum in the multiple Similarity value for second determining module
More than the first predetermined threshold value, it is violated webpage to determine the webpage to be matched.
11. device according to claim 10, it is characterised in that each self-corresponding number of times of whole words of the Web page text
Each self-corresponding number of times definition of whole words for each the violated webpage being defined as in first group of number of times, the multiple violated webpage
For second group of number of times, the multiple second group of number of times of multiple second two-dimensional array correspondences;Second determining module includes:
First determining unit, for determining the first parameter value according to first group of number of times, and according to second group of number of times
Determine the second parameter value;
Second determining unit, for determining while appearing in the 3rd group of first two-dimensional array and second two-dimensional array
Word, and determine the 3rd group of number of times that the 3rd group of word is recorded respectively in first two-dimensional array and second two-dimensional array
With the 4th group of number of times;
3rd determining unit, it is secondary for the 3rd group of number of times determined according to second determining unit and described 4th group
Number, determines the 3rd parameter value;
4th determining unit, for first parameter value, the second determination list determined according to first determining unit
The 3rd parameter value that second parameter value of member determination, the 3rd determining unit are determined, is calculated based on COS distance
Method, determines the Similarity value of first two-dimensional array and corresponding second two-dimensional array.
12. device according to claim 10, it is characterised in that second determining module includes:
First determining unit, for determining while appearing in the 3rd group of first two-dimensional array and second two-dimensional array
Word, and determine the 3rd group of number of times that the 3rd group of word is recorded respectively in first two-dimensional array and second two-dimensional array
With the 4th group of number of times;
Second determining unit, it is secondary for the 3rd group of number of times determined according to first determining unit and described 4th group
Number, based on Euclidean distance computational methods, determines the similarity of first two-dimensional array and corresponding second two-dimensional array
Value.
13. device according to claim 10, it is characterised in that described device also includes:
4th determining module, if similarity maximum in the multiple Similarity value obtained for first determining module
Value is less than first predetermined threshold value, determines whether Similarity value maximum in the multiple Similarity value is more than the second default threshold
Value, wherein, second predetermined threshold value is less than first predetermined threshold value;
5th determining module, if determining Similarity value maximum in the multiple Similarity value for the 4th determining module
More than second predetermined threshold value, it is doubtful violated webpage to determine the webpage to be matched;
Add module, for the webpage to be matched to be added in the Sample Storehouse;
6th determining module, if determining Similarity value maximum in the multiple Similarity value for the 4th determining module
Less than second predetermined threshold value, it is normal webpage to determine the webpage to be matched.
14. device according to claim 10, it is characterised in that first determining module includes:
Participle unit, carries out participle for the Web page text to the webpage to be matched, obtains each in the Web page text
The number of times that individual word and each word occur in the Web page text;
5th determining unit, for corresponding time of each word described in being obtained by the participle unit and each described word
Number determines corresponding first two-dimensional array of the Web page text, and first two-dimensional array is used to represent the webpage to be matched
Web page contents.
15. device according to claim 10, it is characterised in that described device also includes:
Pretreatment module, for being pre-processed to the content in the webpage to be matched;
7th determining module, initial row and ending for determining the pretreated webpage to be matched of the pretreatment module
OK;
8th determining module, for determining that the distance between the initial row and the ending row are big when the 7th determining module
When given threshold, it is the Web page text to determine the content between the initial row and the ending row.
16. a kind of server, it is characterised in that the server includes:
Processor;Memory for storing the processor-executable instruction;
Wherein, the processor, corresponding first two-dimensional array of Web page text for determining webpage to be matched, the described 1st
Time that the whole words and each word that dimension group is obtained including the Web page text by participle occur in the Web page text
Number;
Obtained from Sample Storehouse in multiple second two-dimensional arrays corresponding with multiple violated webpages, the multiple second two-dimensional array
Each second two-dimensional array include whole words and each word that the violated webpage of correspondence obtains by participle in the correspondence
The number of times occurred in violated webpage;
First two-dimensional array phase corresponding with each second two-dimensional array in multiple second two-dimensional arrays is determined successively
Like angle value, each self-corresponding multiple Similarity values of the multiple second two-dimensional array are obtained;
If maximum Similarity value is more than the first predetermined threshold value in the multiple Similarity value, determine that the webpage to be matched is
Violated webpage.
17. a kind of method for recognizing violated webpage, it is characterised in that methods described includes:
The corresponding two-dimensional array to be matched of Web page text of webpage to be matched is determined, the two-dimensional array to be matched includes:It is described
The number of times that the participle substring and each participle substring that Web page text is obtained by participle occur in Web page text;
Obtained from Sample Storehouse in multiple sample two-dimensional arrays corresponding with multiple violated webpages, the multiple sample two-dimensional array
Each sample two-dimensional array include:Participle substring and each participle that the violated Web page text of correspondence is obtained by participle
The number of times that string occurs in the violated Web page text of the correspondence;
The Similarity value of two-dimensional array to be matched and at least one sample two-dimensional array is determined, at least one sample two-dimemsional number is obtained
The corresponding Similarity value of group.
18. method according to claim 17, it is characterised in that the participle substring of the Web page text is defined as first group
Participle substring, its each self-corresponding number of times is defined as each violated webpage in first group of number of times, the multiple violated webpage
The participle substring of text is defined as the second component lexon string, and its each self-corresponding number of times is defined as second group of number of times, multiple samples
The multiple second group of number of times of two-dimensional array correspondence;It is described to determine that two-dimensional array to be matched is similar at least one sample two-dimensional array
Angle value, including:
According to first group of number of times, the first parameter value is determined using default first parameter algorithm, and according to described second
Group number of times, the second parameter value is determined using default second parameter algorithm;
It is determined that appearing in the third component lexon string of the two-dimensional array to be matched and the sample two-dimensional array simultaneously, and determine
The 3rd group of number of times that the third component lexon string is recorded respectively in the two-dimensional array to be matched and the sample two-dimensional array and
4th group of number of times;
According to the 3rd group of number of times and the 4th group of number of times, the 3rd parameter value is determined using default 3rd parameter algorithm;
According to first parameter value, second parameter value, the 3rd parameter value, based on COS distance computational methods, really
Determine the Similarity value of the two-dimensional array to be matched and the corresponding sample two-dimensional array.
19. a kind of device for recognizing violated webpage, it is characterised in that described device includes:
First determining module, the corresponding two-dimensional array to be matched of Web page text for determining webpage to be matched is described to be matched
The participle substring and each participle substring that two-dimensional array is obtained including the Web page text by participle go out in Web page text
Existing number of times;
Acquisition module, it is the multiple for obtaining multiple sample two-dimensional arrays corresponding with multiple violated webpages from Sample Storehouse
Each sample two-dimensional array in sample two-dimensional array includes:The violated Web page text of correspondence passes through the participle substring that participle is obtained
And the number of times that each participle substring occurs in the violated Web page text of the correspondence;
Second determining module, the Similarity value for determining two-dimensional array to be matched and at least one sample two-dimensional array, is obtained
At least one corresponding Similarity value of sample two-dimensional array;
3rd determining module, if determining that there is at least one Similarity value presets more than first for second determining module
Threshold value, it is violated webpage to determine the webpage to be matched.
20. device according to claim 19, it is characterised in that the participle substring of the Web page text is defined as first group
Participle substring, its each self-corresponding number of times is defined as each violated webpage in first group of number of times, the multiple violated webpage
The participle substring of text is defined as the second component lexon string, and its each self-corresponding number of times is defined as second group of number of times, multiple samples
The multiple second group of number of times of two-dimensional array correspondence;Second determining module includes:
First determining unit, for according to first group of number of times, the first parameter value to be determined using default first parameter algorithm,
And according to second group of number of times, the second parameter value is determined using default second parameter algorithm;
Second determining unit, for determining while appearing in the 3rd group of the two-dimensional array to be matched and the sample two-dimensional array
Participle substring, and determine that the third component lexon string is remembered respectively in the two-dimensional array to be matched and the sample two-dimensional array
The 3rd group of number of times and the 4th group of number of times of record;
3rd determining unit, it is secondary for the 3rd group of number of times determined according to second determining unit and described 4th group
Number, the 3rd parameter value is determined using default 3rd parameter algorithm;
4th determining unit, for first parameter value, the second determination list determined according to first determining unit
The 3rd parameter value that second parameter value of member determination, the 3rd determining unit are determined, is calculated based on COS distance
Method, determines the Similarity value of the two-dimensional array to be matched and the corresponding sample two-dimensional array.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610061273 | 2016-01-28 | ||
CN2016100612736 | 2016-01-28 |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107103012A true CN107103012A (en) | 2017-08-29 |
Family
ID=59658750
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610819394.2A Pending CN107103012A (en) | 2016-01-28 | 2016-09-12 | Recognize method, device and the server of violated webpage |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107103012A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020107835A1 (en) * | 2018-11-26 | 2020-06-04 | 平安科技(深圳)有限公司 | Sample data processing method and device |
CN112199569A (en) * | 2020-10-29 | 2021-01-08 | 重庆撼地大数据有限公司 | Method and system for identifying prohibited website, computer equipment and storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1529263A (en) * | 2003-09-18 | 2004-09-15 | 北京邮电大学 | Chinese text auto-segmenting and text plagiarism discrimination device and method |
US20050273706A1 (en) * | 2000-08-24 | 2005-12-08 | Yahoo! Inc. | Systems and methods for identifying and extracting data from HTML pages |
CN102208992A (en) * | 2010-06-13 | 2011-10-05 | 天津海量信息技术有限公司 | Internet-facing filtration system of unhealthy information and method thereof |
CN102541874A (en) * | 2010-12-16 | 2012-07-04 | 中国移动通信集团公司 | Webpage text content extracting method and device |
CN102663093A (en) * | 2012-04-10 | 2012-09-12 | 中国科学院计算机网络信息中心 | Method and device for detecting bad website |
-
2016
- 2016-09-12 CN CN201610819394.2A patent/CN107103012A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050273706A1 (en) * | 2000-08-24 | 2005-12-08 | Yahoo! Inc. | Systems and methods for identifying and extracting data from HTML pages |
CN1529263A (en) * | 2003-09-18 | 2004-09-15 | 北京邮电大学 | Chinese text auto-segmenting and text plagiarism discrimination device and method |
CN102208992A (en) * | 2010-06-13 | 2011-10-05 | 天津海量信息技术有限公司 | Internet-facing filtration system of unhealthy information and method thereof |
CN102541874A (en) * | 2010-12-16 | 2012-07-04 | 中国移动通信集团公司 | Webpage text content extracting method and device |
CN102663093A (en) * | 2012-04-10 | 2012-09-12 | 中国科学院计算机网络信息中心 | Method and device for detecting bad website |
Non-Patent Citations (1)
Title |
---|
龚静: "《中文文本聚类研究》", 31 March 2012, 中国传媒大学出版社 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020107835A1 (en) * | 2018-11-26 | 2020-06-04 | 平安科技(深圳)有限公司 | Sample data processing method and device |
CN112199569A (en) * | 2020-10-29 | 2021-01-08 | 重庆撼地大数据有限公司 | Method and system for identifying prohibited website, computer equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20200195667A1 (en) | Url attack detection method and apparatus, and electronic device | |
CN106131071B (en) | A kind of Web method for detecting abnormality and device | |
CN108833409B (en) | Webshell detection method and device based on deep learning and semi-supervised learning | |
CN107204960B (en) | Webpage identification method and device and server | |
CN107943954A (en) | Detection method, device and the electronic equipment of webpage sensitive information | |
CN104899508B (en) | A kind of multistage detection method for phishing site and system | |
CN106709345A (en) | Deep learning method-based method and system for deducing malicious code rules and equipment | |
CN103617213B (en) | Method and system for identifying newspage attributive characters | |
US8335750B1 (en) | Associative pattern memory with vertical sensors, amplitude sampling, adjacent hashes and fuzzy hashes | |
CN107819790A (en) | The recognition methods of attack message and device | |
CN109145030B (en) | Abnormal data access detection method and device | |
CN105528422A (en) | Focused crawler processing method and apparatus | |
CN106126719A (en) | Information processing method and device | |
CN110138794A (en) | A kind of counterfeit website identification method, device, equipment and readable storage medium storing program for executing | |
CN111368061B (en) | Short text filtering method, device, medium and computer equipment | |
CN109784059B (en) | Trojan file tracing method, system and equipment | |
CN107577944A (en) | Website malicious code detecting method and device based on code syntax analyzer | |
EP3893128A1 (en) | Crawler data recognition method, system and device | |
CN112200196A (en) | Phishing website detection method, device, equipment and computer readable storage medium | |
CN107798080A (en) | A kind of similar sample set construction method towards fishing URL detections | |
CN110909361A (en) | Vulnerability detection method and device and computer equipment | |
CN110036367A (en) | A kind of verification method and Related product of AI operation result | |
CN112199569A (en) | Method and system for identifying prohibited website, computer equipment and storage medium | |
CN107992402A (en) | Blog management method and log management apparatus | |
CN107103012A (en) | Recognize method, device and the server of violated webpage |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170829 |