Nothing Special   »   [go: up one dir, main page]

CN110516710A - Web page classification method, device, computer installation and computer readable storage medium - Google Patents

Web page classification method, device, computer installation and computer readable storage medium Download PDF

Info

Publication number
CN110516710A
CN110516710A CN201910677072.2A CN201910677072A CN110516710A CN 110516710 A CN110516710 A CN 110516710A CN 201910677072 A CN201910677072 A CN 201910677072A CN 110516710 A CN110516710 A CN 110516710A
Authority
CN
China
Prior art keywords
label
feature
web page
webpage
training set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910677072.2A
Other languages
Chinese (zh)
Inventor
林鹏
吴潇
黄九鸣
张圣栋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan Xinghan Shuzhi Technology Co Ltd
Original Assignee
Hunan Xinghan Shuzhi Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan Xinghan Shuzhi Technology Co Ltd filed Critical Hunan Xinghan Shuzhi Technology Co Ltd
Priority to CN201910677072.2A priority Critical patent/CN110516710A/en
Publication of CN110516710A publication Critical patent/CN110516710A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The present invention is suitable for Internet technical field, Web page classification method, device, computer installation and computer readable storage medium are provided, this method comprises: the URL link and html source code of the URL link and html source code of the theme type webpage that will acquire and the list type webpage obtained are as training set;Webpage URL feature is extracted according to the URL link of training set, label characteristics are extracted according to the html source code of training set, the quantity of the URL link of preset threshold and the size of html source code will be greater than in the html source code of training set with associated URL link similarity as page feature;It by webpage URL feature, label characteristics and the page feature vectorization of training set, and inputs random forests algorithm model and is trained, obtain classifier;The webpage URL feature, label characteristics and page feature of test set are inputted classifier, obtain Web page classifying result by webpage URL feature, label characteristics and the page feature for obtaining test set.The accuracy rate of Web page classifying can be improved in Web page classification method provided by the invention.

Description

Web page classification method, device, computer installation and computer readable storage medium
Technical field
The invention belongs to Internet technical field more particularly to a kind of Web page classification method, device, computer installation and meters Calculation machine readable storage medium storing program for executing.
Background technique
With the rapid development of Internet, the value volume and range of product of webpage increases rapidly, so that people obtain from webpage to be had The difficulty of the information of value becomes larger.To make full use of web page contents, need to classify to webpage.Current Webpage classification technology Mainly classified by way of a kind of semi-automation to webpage, passes through sorting algorithm and manual examination and verification are coordinated to complete.In The algorithm stage, the general sorting algorithm traditional using such as naive Bayesian, decision tree and support vector machines to type of webpage into The preliminary judgement of row, then in the manual examination and verification stage, is examined by manually.Classical decision tree etc. is used in the algorithm stage Sorting algorithm realizes sorting technique by analyzing the html structure feature of different web pages, and existing deficiency mainly has: (1) certainly Plan tree algorithm can not on-line study, be intended to over-fitting, be easily trapped into locally optimal solution when being classified;(2) it chooses The html structure feature of html structure feature negligible amounts, consideration is not comprehensive, is easy to influence final classifying quality, classification is accurate Property is poor.In the web data in face of largely needing to classify, semi-automatic mode is unable to satisfy requirement, and pass through manually into Row examines, causes sorting technique scalability poor in this way, while time cost is also very high.It can be seen that existing webpage point Class technology has that lower accuracy, poor expandability and time are at high cost.
Summary of the invention
The embodiment of the present invention provides a kind of Web page classification method, it is intended to solve accuracy existing for existing Webpage classification technology Lower, poor expandability and time problem at high cost.
The invention is realized in this way a kind of Web page classification method, comprising:
The URL chain of the URL link and html source code of the N number of theme type webpage that will acquire and the N number of list type webpage obtained It connects with html source code as training set;
The webpage URL feature that the training set is extracted according to the URL link of the training set, according to the training set Html source code extracts the label characteristics of the training set, will be similar to associated URL link in the html source code of the training set Page feature of the degree greater than the quantity of the URL link of preset threshold and the size of html source code as the training set;
By the webpage URL feature, the label characteristics and the page feature vectorization of the training set, and will Vectorization webpage URL feature, vectorization label characteristics and vectorization page feature input random forests algorithm model are instructed Practice, obtains classifier;
The URL chain of the URL link and html source code of the M theme type webpage that will acquire and the M list type webpage obtained It connects with html source code as test set;
Webpage URL feature, label characteristics and the page feature for obtaining the test set, by the webpage URL of the test set Feature, label characteristics and page feature input the classifier, obtain Web page classifying result.
Optionally, the webpage URL feature that the URL link according to the training set extracts the training set includes following Process:
Whether the URL link for judging the training set includes temporal characteristics, domain name feature and passive feature, by judging result Webpage URL feature as the training set.
Optionally, the label characteristics that the html source code according to the training set extracts the training set include following mistake Journey:
The noise label of the html source code of the training set and the corresponding content of the noise label are subjected to delete processing, Obtain effective label;
The label characteristics of effective label are obtained, the label characteristics include: label sequence number, label text length, a left side Tag length, right tag length, label text punctuation mark quantity, label level, leaf Label Merging number, n omicronn-leaf subtab Merge number and total Label Merging number;
Effective label is ranked up using recursion elimination algorithm, chooses the highest R label that sort as reservation Label;
In the case where the label of same type is not present in the Hold sticker, by the label characteristics of the Hold sticker Label characteristics as the training set;
There are in the case where the label of same type in the Hold sticker, by the same type in the Hold sticker Label Merging, determine the label characteristics of the label of the same type, will be in the Hold sticker except the same type The label characteristics of the label characteristics of label except label and the label characteristics of the determination as the training set.
Optionally, webpage URL feature, label characteristics and the page feature for obtaining the test set includes following mistake Journey:
The webpage URL feature that the test set is extracted according to the URL link of the test set, according to the test set Html source code extracts the label characteristics of the test set, will be similar to associated URL link in the html source code of the test set Page feature of the degree greater than the quantity of the URL link of preset threshold and the size of html source code as the test set.
Optionally, webpage URL feature, label characteristics and the page feature for obtaining the test set, by the test Webpage URL feature, label characteristics and the page feature of collection input the classifier, after obtaining Web page classifying result, the net Page classification method further includes following procedure:
Whether the nicety of grading and recall rate for judging the Web page classifying result are greater than preset threshold;
In the case where the nicety of grading of the Web page classifying result and recall rate are greater than preset threshold, then by the webpage Classification results are as final result;It is less than or equal to preset threshold in the nicety of grading and recall rate of the Web page classifying result In the case of, then the configuration parameter of the classifier is adjusted, until obtaining the webpage of nicety of grading and recall rate greater than preset threshold Classification results.
The present invention also provides a kind of Web page classifying devices, comprising:
First obtains module, the URL link and html source code of N number of theme type webpage for will acquire and obtains N number of The URL link and html source code of list type webpage are as training set;
Processing module, for extracting the webpage URL feature of the training set according to the URL link of the training set, according to The html source code of the training set extracts the label characteristics of the training set, by the html source code of the training set with it is associated The page of the URL link similarity greater than the quantity of the URL link of preset threshold and the size of html source code as the training set Feature;
Training module, for the webpage URL feature, the label characteristics and the page of the training set is special Vectorization is levied, and vectorization webpage URL feature, vectorization label characteristics and vectorization page feature input random forest are calculated Method model is trained, and obtains classifier;
Second obtains module, and the URL link and html source code of the M theme type webpage for will acquire and the M obtained are a The URL link and html source code of list type webpage are as test set;
Categorization module, for obtaining webpage URL feature, label characteristics and the page feature of the test set, by the survey The webpage URL feature, label characteristics and page feature for trying collection input the classifier, obtain Web page classifying result.
Optionally, the processing module is also used to judge whether the URL link of the training set includes temporal characteristics, domain Name feature and passive feature, using judging result as the webpage URL feature of the training set.
Optionally, the processing module further include:
Submodule is deleted, for the noise label of the html source code of the training set and the noise label is corresponding interior Hold and carry out delete processing, obtains effective label;
Acquisition submodule, for obtaining the label characteristics of effective label, the label characteristics include: label sequence number, Label text length, left tag length, right tag length, label text punctuation mark quantity, label level, leaf Label Merging Number, non-leaf Label Merging number and total Label Merging number;
Sorting sub-module chooses the highest R that sorts for being ranked up using recursion elimination algorithm to effective label A label is as Hold sticker;
First processing submodule, in the case where for the label of same type to be not present in the Hold sticker, by institute State label characteristics of the label characteristics of Hold sticker as the training set;
Second processing submodule, in the Hold sticker there are in the case where the label of same type, will be described The Label Merging of same type in Hold sticker determines the label characteristics of the label of the same type, and the reservation is marked The label characteristics of label in label in addition to the label of the same type and the label characteristics of the determination are as the training The label characteristics of collection.
Optionally, the Web page classifying device further include:
Whether judgment module, nicety of grading and recall rate for judging the Web page classifying result are greater than preset threshold;
The case where adjusting module, being greater than preset threshold for the nicety of grading and recall rate in the Web page classifying result Under, then using the Web page classifying result as final result;It is less than in the nicety of grading and recall rate of the Web page classifying result Or in the case where being equal to preset threshold, then the configuration parameter of the classifier is adjusted, until obtaining nicety of grading and recall rate is big In the Web page classifying result of preset threshold.
The present invention also provides a kind of computer installation, the computer installation includes processor, and the processor is for holding The step of Web page classification method as described above is realized in line storage when computer program.
The present invention also provides a kind of computer readable storage mediums, are stored thereon with computer program, the computer journey The step of Web page classification method as described above is realized when sequence is executed by processor.
Web page classification method provided by the invention, by the webpage URL feature, label characteristics and the page that obtain training set Feature, and the webpage URL feature, label characteristics and page feature of the training set after vectorization are inputted into random forests algorithm mould Type is trained, and obtains classifier, and the webpage URL feature, label characteristics and page feature of test set are inputted classifier, obtained The Web page classifying of test set realizes the Web page classifying process of full automation, by vectorization as a result, without a large amount of artificial investments The webpage URL feature of training set afterwards, label characteristics and page feature input random forests algorithm model are trained, can be with The more reasonable classifier of parameter configuration is obtained by structure of web page feature abundant training, and then can be improved by classifier A large amount of manual times are saved in the accuracy of Web page classifying, and scalability is high.Since Web page classifying process is fully automated, Webpage a large amount of for the whole network can effectively distinguish rapidly type of webpage, promote people and obtain key message from webpage Efficiency.
Detailed description of the invention
Fig. 1 is the implementation flow chart of Web page classification method provided in an embodiment of the present invention;
Fig. 2 html source code provided in an embodiment of the present invention according to the training set extracts the label characteristics of the training set Implementation flow chart;
The exemplary diagram of Fig. 3 dom tree provided in an embodiment of the present invention;
Fig. 4 is provided in an embodiment of the present invention<div>label and<p>the nested schematic diagram of label;
Implementation flow chart of Fig. 5 Web page classification method provided in an embodiment of the present invention after obtaining Web page classifying result;
Fig. 6 is the structural schematic diagram that the present invention implements a kind of Web page classifying device provided;
Fig. 7 is the structural schematic diagram of processing module provided in an embodiment of the present invention;
Fig. 8 is another structural schematic diagram that the present invention implements a kind of Web page classifying device provided.
Specific embodiment
In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and It is not used in the restriction present invention.
Fig. 1 show the implementation flow chart of Web page classification method provided in an embodiment of the present invention.The Web page classification method packet Include following procedure:
Step S101, the URL link and html source code of the N number of theme type webpage that will acquire and the N number of list type net obtained The URL link and html source code of page are as training set.
In the present embodiment, the N is positive integer, and the quantity of N is more, and training set scale is bigger, such as N can be 2000 or 1000 etc..The theme type webpage refers to: the content for including in webpage is more, the specific webpage of Web page subject.Theme type Webpage is usually the detailed description to some event or information, and relatively conventional theme type webpage has: news web page, blog net Page, forum Web pages etc..The list type webpage refers to: the hyperlink for including in webpage is more and webpage than comparatively dense. Hyperlink in webpage is directed to other webpages in the website, and the text for being included is the simple general introduction to webpage is directed toward.It is logical Chang Liebiao type webpage is mainly the theme navigation page and website homepage of each website.
Supplementary explanation, URL is the abbreviation of uniform resource locator, full name in English UniformResource Locator.HTML is the abbreviation of hypertext markup language, and full name in English is Hyper TextMarkup Language.
Step S102 extracts the webpage URL feature of the training set according to the URL link of the training set, according to described The html source code of training set extracts the label characteristics of the training set, by the html source code of the training set with associated URL Link similarity is special as the page of the training set greater than the size of the quantity of the URL link of preset threshold and html source code Sign.
Optionally, in step S102, the URL link according to the training set extracts the webpage of the training set URL feature, including following procedure:
Whether the URL link for judging the training set includes temporal characteristics, domain name feature and passive feature, by judging result Webpage URL feature as the training set.
In the present embodiment, the temporal characteristics can be matched to from URL link by timed regular expression Temporal characteristics.Table 1 is please referred to, table 1 is the sample table of temporal characteristics regular expression, is matched to by temporal characteristics expression formula Temporal characteristics include 2019-01-28,2019-1-28,01-28-2019,2019-0128,20190128.Temporal characteristics can also be with To be matched to obtain other numerical value to specific URL link according to temporal characteristics regular expression, herein with no restrictions.
The sample table of 1 temporal characteristics regular expression of table
In the present embodiment, domain name feature can be the word with specific instruction function, and domain name feature can be according to big Amount website ULR link is counted.Domain name is included in URL link, each domain name is that uniquely, there is no correspond to Chinese, for example, domain name feature may include following word: news, tech, stock1, ent, sports, auto, finance、book、edu、comic、games、baby、astro、laby、change、www、mil、bj、eladies、 business、money、it、digi、teamchina、yule、house、cul、learning、health、travel、women、 nba、golf、weiqi、music、mobile、war、discover、history、jiankang、view、caozi、renjian、 home、mobile。
In the present embodiment, passive feature may be to the feature that classification results have a negative impact, such as in URL link Suffix " list, tv, video, index ,/" can be used as passive feature, such as: URL link www.xxxx.tv without particular meaning Or in www.xxxx.com/list, " tv ", " list " are passive feature.
Optionally, whether the URL link for judging the training set includes temporal characteristics, domain name feature and passive spy Sign, using judging result as the webpage URL feature of the training set, including following procedure:
It whether examines in the survey grid page URL link of the training set comprising temporal characteristics, domain name feature and other passive features, If temporal characteristics exist, temporal characteristics are recorded as true, if temporal characteristics are not present, temporal characteristics are recorded as false; If domain name feature exists, domain name feature is recorded as true, if domain name feature is not present, domain name feature is recorded as false; If passive feature exists, passive feature is recorded as true, if passive feature is not present, passive feature is recorded as false; Using record result as the webpage URL feature of the training set.
In this way, webpage URL feature can be obtained more accurately.
Optionally, described that the instruction is extracted according to the html source code of the training set referring to Fig. 2, in step S102 The label characteristics for practicing collection include following procedure:
Step S1021, by the noise label of the html source code of the training set and the corresponding content of the noise label into Row delete processing obtains effective label.
In the present embodiment, noise label refers to cannot generate the label helped positively to Web page classifying, such as<head>label, <font>label etc..The label for specifically needing to clear up can be the sample table of noise label refering to table 2, table 2, show in table 2 Common noise label.
The sample table of 2 noise label of table
<head> <font> <em>
<img> <link> <script>
<style> <li> <b>
<strong> <noscript> <ul>
<span> <iframe> <a>
<br> <select> <i>
<wbr> <ins>
<nbsp> <input>
Effective label, which refers to, generates the label helped positively to Web page classifying, such as<div>,<html>,<body>,<title>, <h1>extremely<h6>,<p>label etc..Wherein,<div>: layout abstract factory is mainly used for beautifying webpage.<html>: a html The initial labels of structure of web page, each html is by this label.<body>: the most important mark of html webpage structure Label, are also body matter label, general web page contents are put between this set of tags.<title>: html webpage structure it is unique Title is shown.<h1>~<h6>: for indicating the title of different significance levels in html webpage structure.<p>: paragraph tag, packet Containing a large amount of texts.
Step S1022, obtains the label characteristics of effective label, and the label characteristics include: label sequence number, label text This length, left tag length, right tag length, label text punctuation mark quantity, label level, leaf Label Merging number, Non-leaf Label Merging number and total Label Merging number.
In the present embodiment, before the label characteristics for obtaining effective label, the positive feature category of criterion label is preserved Property, it prepares to extract label characteristics.The positive feature attribute that the needs of each label save totally 10, respectively tag name (tag_name), label text content (tag_content), the attribute (tag_attributes) of label, preorder traversal sequence Number (tag_id), label text length (tag_id), left tag length (tag_left_len), right tag length (tag_ Right_len), the level (tag_tree_ of punctuation mark quantity (tag_punct_num), label in dom tree in label Level), whether it is leaf node (leaf), positive feature attribute can be stored in the characteristic attribute column that each label need to save In table, tabular form can be the preservation sample table of the positive feature attribute of effective label refering to table 3, table 3.In the present embodiment In, the label characteristics of effective label can be determined according to the positive feature attribute of each effective label.
The preservation sample table of the positive feature attribute of the effective label of table 3
In the present embodiment, the label sequence number refers to that the label uses preorder traversal plan in dom tree since root node The number slightly searched, the number of initial root node are 0.
In the present embodiment, DOM is the abbreviation of DOM Document Object Model Document ObjectModel, and HTML DOM is then It is specially adapted for the DOM Document Object Model of HTML/XHTML.HTML table is shown as the tree construction of label by DOM, that is, often say The specific structure of dom tree, dom tree can be refering to Fig. 3.
The label text length refers to the length of all characters in the text node in label.The left tag length refers to Start the length for all characters for including in the angle brackets of label.The right tag length, which refers in the angle brackets of end-tag, includes All characters length.For example,<div></div>: left and right tag length is 3.<div id="menu"></div>: it is left Tag length is 13, and right tag length is 3.
The label level refers to depth of the label node in dom tree, can traverse to obtain by the level of tree.It please join Fig. 3 is read, if with<html>for 0 layer, then<title>label is 2 layers in figure,<a>label with<title>same depth <h1>label is also 2 layers.
The leaf Label Merging number belongs to during referring to the union operation for carrying out same node point to tree interior joint The number of leaf node.The non-leaf Label Merging number is then that the union operation of same node point is not belonging to leaf section in the process The number of point.Total Label Merging number refers to the synthesis of leaf Label Merging number and non-leaf Label Merging number.
Merge label: what dom tree was made of when indicating webpage multiple nodes, these label nodes are can to weigh Existing for multiple or nesting, referring to Fig. 4, wherein<div>label and<p>label repeats.These identical label nodes The function of realization is identical, has identical characteristic, the copy of node is regular and level is obvious.Therefore, in selected label and After corresponding attribute, processing is merged to label, reinforces the feature of label, has positive influence to classification results.
Union operation process: union operation is the process of a circulation, the attribute value until extracting all selection labels, It can terminate.One cycle process are as follows: first determine whether current label is feature tag, if not, continue to recycle next time. If so, being divided into two kinds of situations here, if the label occurs for the first time, assignment is carried out for 9 attributes of the label.If not Occur for the first time, be divided into two kinds of situations again here, first is that current label and already existing same label be not at set in it is same One layer, then the new label that current label is occurred as first time, carries out assignment for 9 attributes of the label, due to will be The same label of such situation is distinguished in array, is that will enclose behind the same label in different layers in the present embodiment Hierachy number, such as p label are in the 10th layer and are expressed as p_10, thus can effectively distinguish.Second is that current label with deposited Same label be in tree same layer, it is believed that the two labels be identical label, merging treatment is done to this two label, The attribute value of current label is added with it.
Leaf label: only having content of text in finger joint point, does not include other labels, such as:<div>today rains</div>.
N omicronn-leaf subtab: such as:
<div>
<p>today rains<p>
</div>
In the example<div>it is not just leaf label, and<p>it is leaf label.
Step S1023 is ranked up effective label using recursion elimination algorithm, chooses the highest R mark of sequence Label are used as Hold sticker.
Further filtering is done to the label retained in S1022 using feature recursion elimination algorithm in the step, reject to point Class result influences the smallest several labels, chooses highest 11 labels that sort as the label finally retained, comprising:<div> Label,<html>label,<body>label,<title>label,<h1>,<h2>,<h13>,<h4>,<h5>,<h6>label,<p> Label.
Step S1024, in the case where the label of same type is not present in the Hold sticker, by the Hold sticker Label characteristics of the label characteristics as the training set.
Step S1025 will be in the Hold sticker there are in the case where the label of same type in the Hold sticker Same type Label Merging, determine the label characteristics of the label of the same type, by the Hold sticker except described The label characteristics of label except the label of same type and the label characteristics of the determination are special as the label of the training set Sign.
In the present embodiment, the Label Merging of the same type in the Hold sticker may comprise steps of: root The biggish label of amount of text is searched out according to the label text length of the label of same type, selects the biggish label of text amount Label characteristics of the feature tag as the label of same type.Such as: there are two in remaining label<p>label, then according to label It is biggish that text size searches out amount of text<p>label is as final<p>feature tag.
It is described to be greater than in the html source code of the training set with associated URL link similarity in step S102 The quantity of the URL link of preset threshold and the size of html source code include following procedure as the page feature of the training set:
Count the hyperlink number for being greater than preset threshold in current web page html source code with the URL link similarity of current web page Amount and webpage html source code size are as page feature.
Preset threshold can be default number, or the customized numerical value of user, for example, preset threshold is 0.5.
In the present embodiment, URL link similarity calculation mode are as follows: from left to right matched identical one by one to two URL Number of characters divided by longer URL in two URL length.
It is understood that the html source code of theme type webpage such as news web page is bigger, list type webpage such as theme The html source code of navigation page is smaller.URL in the html source code of theme type webpage such as news web page with Present News webpage Link similarity is greater than the hyperlink small number of preset threshold, in the html source code of list type webpage such as theme navigation page with The hyperlink quantity that the URL link similarity of current topic navigation page is greater than preset threshold is relatively more.In the present embodiment, pass through Count the hyperlink quantity and net for being greater than preset threshold in current web page html source code with the URL link similarity of current web page Webpage can be improved using page feature as an assessment parameter of Web page classifying as page feature in page html source code size The accuracy of classification.
Step S103, by the webpage URL feature, the label characteristics and the page feature of the training set to Quantization, and vectorization webpage URL feature, vectorization label characteristics and vectorization page feature are inputted into random forests algorithm mould Type is trained, and obtains classifier.
In the present embodiment, random forest is exactly a kind of algorithm by the thought of integrated study that more trees is integrated, it Basic unit be decision tree, and its essence belongs to a big branch of machine learning, integrated study (Ensemble Learning) algorithm.
In the present embodiment, the webpage URL feature includes time, domain name and passive feature.The page feature packet It includes: being greater than hyperlink quantity and the source webpage HTML of preset threshold in webpage html source code with the URL link similarity of webpage Code size.Label characteristics: 11 labels such as<div>that extracts herein,<html>, 9 attributes of each label.By all features Attribute Digital Display, therefore vectorization procedure can directly use correlated characteristic attribute.
Step S104, the URL link and html source code of the M theme type webpage that will acquire and the M list type net obtained The URL link and html source code of page are as test set.
In the present embodiment, M is positive integer, and M is bigger, and the data volume of test set is more.For example, M can take 500,300 etc. Value.
Step S105, webpage URL feature, label characteristics and the page feature for obtaining the test set, by the test set Webpage URL feature, label characteristics and page feature input the classifier, obtain Web page classifying result.
Optionally, step S105 may comprise steps of:
The webpage URL feature that the test set is extracted according to the URL link of the test set, according to the test set Html source code extracts the label characteristics of the test set, will be similar to associated URL link in the html source code of the test set Page feature of the degree greater than the quantity of the URL link of preset threshold and the size of html source code as the test set.
Optionally, the URL link according to the test set extracts the webpage URL feature of the test set, including with Lower process:
Whether the URL link for judging the test set includes temporal characteristics, domain name feature and passive feature, obtains first and sentences Break as a result, using first judging result as the webpage URL feature of the test set.
Optionally, whether the URL link for judging the test set includes temporal characteristics, domain name feature and passive spy Sign, obtains the first judging result, using first judging result as the webpage URL feature of the test set, including following mistake Journey:
It whether examines in the survey grid page URL link of the test set comprising temporal characteristics, domain name feature and other passive features, If temporal characteristics exist, temporal characteristics are recorded as true, if temporal characteristics are not present, temporal characteristics are recorded as false; If domain name feature exists, domain name feature is recorded as true, if domain name feature is not present, domain name feature is recorded as false; If passive feature exists, passive feature is recorded as true, if passive feature is not present, passive feature is recorded as false; Using record result as the webpage URL feature of the test set.
Optionally, the html source code according to the test set extracts the label characteristics of the test set, including following Process:
The noise label of the html source code of the test set and the corresponding content of the noise label are subjected to delete processing, Obtain effective label;
The label characteristics of effective label are obtained, the label characteristics include: label sequence number, label text length, a left side Tag length, right tag length, label text punctuation mark quantity, label level, leaf Label Merging number, n omicronn-leaf subtab Merge number and total Label Merging number;
Effective label is ranked up using recursion elimination algorithm, chooses the highest R label that sort as reservation Label;
In the case where the label of same type is not present in the Hold sticker, by the label characteristics of the Hold sticker Label characteristics as the test set;
There are in the case where the label of same type in the Hold sticker, by the same type in the Hold sticker Label Merging, determine the label characteristics of the label of the same type, will be in the Hold sticker except the same type The label characteristics of the label characteristics of label except label and the label characteristics of the determination as the test set.
Optionally, referring to Fig. 5, after step S105, the Web page classification method further includes following procedure:
Step S106, judges whether the nicety of grading of the Web page classifying result and recall rate are greater than preset threshold.
In the present embodiment, preset threshold can be system default numerical value or the customized numerical value of user.Nicety of grading Preset threshold, the preset threshold of recall rate can be different numbers, herein with no restrictions.
Step S107, in the case where the nicety of grading of the Web page classifying result and recall rate are greater than preset threshold, then Using the Web page classifying result as final result;It is less than or equal in the nicety of grading and recall rate of the Web page classifying result In the case where preset threshold, then the configuration parameter of the classifier is adjusted, until obtaining nicety of grading and recall rate greater than default The Web page classifying result of threshold value.
In the present embodiment, adjust the configuration parameter of the classifier is substantially exactly continuous change random forests algorithm The parameter of model carries out repetition training to classifier.Until the classifying quality of a certain group of parameter is preferable.Then random forests algorithm Model tests other test sets using same set parameter, if equally have preferable classifying quality to get to point Class precision and recall rate are greater than the Web page classifying of preset threshold as a result, then terminating to test, and record relevant parameter, random forest is calculated The parameter of method model is set as the classifier that the relevant parameter of the record is classified as subsequent web pages.Otherwise front model is repeated Training process, until obtaining relatively good classifying quality.In the present embodiment, the random forest ginseng in random forests algorithm model Number includes: maximum characteristic max_features, the decision tree depth capacity max_depth, inside considered when decision tree divides Smallest sample number min_samples_split, the minimum sample number min_samples_leaf of leaf node needed for node is subdivided, The smallest sample weights of leaf node and min_weight_fraction_leaf, maximum leaf node number max_leaf_ Nodes: by limiting maximum leaf node number, node division minimum impurity level min_impurity_split etc..Due to described Classifier obtains optimization model after being substantially adjusted to the parameter of random forests algorithm model, therefore according to classification essence Whether degree and recall rate are less than or equal to the judging result of preset threshold, further optimize, can effectively provide to parameter The classifying quality of classifier.
Web page classification method provided by the invention, by the webpage URL feature, label characteristics and the page that obtain training set Feature, and the webpage URL feature, label characteristics and page feature of the training set after vectorization are inputted into random forests algorithm mould Type is trained, and obtains classifier, and the webpage URL feature, label characteristics and page feature of test set are inputted classifier, obtained The Web page classifying of test set realizes the Web page classifying process of full automation, by vectorization as a result, without a large amount of artificial investments The webpage URL feature of training set afterwards, label characteristics and page feature input random forests algorithm model are trained, can be with The more reasonable classifier of parameter configuration is obtained by structure of web page feature abundant training, and then can be improved by classifier A large amount of manual times are saved in the accuracy of Web page classifying, and scalability is high.Since Web page classifying process is fully automated, Webpage a large amount of for the whole network can effectively distinguish rapidly type of webpage, promote people and obtain key message from webpage Efficiency.
Fig. 6 shows a kind of structural schematic diagram of Web page classifying device 600 provided in an embodiment of the present invention, for the ease of saying It is bright, it illustrates only and implements relevant part in the present invention.The Web page classifying device 600, comprising:
First processing module 601, the URL link and html source code of N number of theme type webpage for will acquire and obtains The URL link and html source code of N number of list type webpage are as training set.
In the present embodiment, the N is positive integer, and the quantity of N is more, and training set scale is bigger, such as N can be 2000 or 1000 etc..The theme type webpage refers to: the content for including in webpage is more, the specific webpage of Web page subject.Theme type Webpage is usually the detailed description to some event or information, and relatively conventional theme type webpage has: news web page, blog net Page, forum Web pages etc..The list type webpage refers to: the hyperlink for including in webpage is more and webpage than comparatively dense. Hyperlink in webpage is directed to other webpages in the website, and the text for being included is the simple general introduction to webpage is directed toward.It is logical Chang Liebiao type webpage is mainly the theme navigation page and website homepage of each website.Supplementary explanation, URL are that unified resource is fixed The abbreviation of position symbol, full name in English are Uniform Resource Locator.HTML is the abbreviation of hypertext markup language, English Full name is Hyper Text Markup Language.
Processing module 602, for extracting the webpage URL feature of the training set, root according to the URL link of the training set The label characteristics that the training set is extracted according to the html source code of the training set, by the html source code of the training set be associated with URL link similarity be greater than preset threshold URL link quantity and html source code page of the size as the training set Region feature.
Optionally, processing module 602 are also used to judge whether the URL link of the training set includes temporal characteristics, domain name Feature and passive feature, using judging result as the webpage URL feature of the training set.
In the present embodiment, the temporal characteristics can be matched to from URL link by timed regular expression Temporal characteristics.Table 1 is please referred to, includes 2019-01-28,2019-1- by the temporal characteristics that temporal characteristics expression formula is matched to 28,01-28-2019,2019-0128,20190128.Temporal characteristics may be according to temporal characteristics regular expression to specific URL link is matched to obtain other numerical value, herein with no restrictions.
In the present embodiment, domain name feature can be the word with specific instruction function, and domain name feature can be according to big Amount website ULR link is counted.Domain name is included in URL link, each domain name is that uniquely, there is no correspond to Chinese, for example, domain name feature may include following word: news, tech, stock1, ent, sports, auto, finance、book、edu、comic、games、baby、astro、laby、change、www、mil、bj、eladies、 business、money、it、digi、teamchina、yule、house、cul、learning、health、travel、women、 nba、golf、weiqi、music、mobile、war、discover、history、jiankang、view、caozi、renjian、 home、mobile。
In the present embodiment, passive feature may be to the feature that classification results have a negative impact, such as in URL link Suffix " list, tv, video, index ,/" can be used as passive feature, such as: URL link www.xxxx.tv without particular meaning Or in www.xxxx.com/list, " tv ", " list " are passive feature.
Optionally, the processing module 602, be also used to examine in the survey grid page URL link of the training set whether comprising when Between feature, domain name feature and other passive features if temporal characteristics exist temporal characteristics are recorded as true, if the time is special Sign is not present, then temporal characteristics are recorded as false;If domain name feature exists, domain name feature is recorded as true, if domain name is special Sign is not present, then domain name feature is recorded as false;If passive feature exists, passive feature is recorded as true, if passive special Sign is not present, then passive feature is recorded as false;Using record result as the webpage URL feature of the training set.
In this way, webpage URL feature can be obtained more accurately.
Optionally, the processing module 602, further includes:
Submodule 6021 is deleted, for the noise label of the html source code of the training set and the noise label is corresponding Content carry out delete processing, obtain effective label.
In the present embodiment, noise label refers to cannot generate the label helped positively to Web page classifying, such as<head>label, <font>label etc..Specifically needing the label cleared up may include the noise label in table 2.
Effective label, which refers to, generates the label helped positively to Web page classifying, such as<div>,<html>,<body>,<title>, <h1>extremely<h6>,<p>label etc..Wherein,<div>: layout abstract factory is mainly used for beautifying webpage.<html>: a html The initial labels of structure of web page, each html is by this label.<body>: the most important mark of html webpage structure Label, are also body matter label, general web page contents are put between this set of tags.<title>: html webpage structure it is unique Title is shown.<h1>~<h6>: for indicating the title of different significance levels in html webpage structure.<p>: paragraph tag, packet Containing a large amount of texts.
Acquisition submodule 6022, for obtaining the label characteristics of effective label, the label characteristics include: label sequence Number, label text length, left tag length, right tag length, label text punctuation mark quantity, label level, leaf label Merge number, non-leaf Label Merging number and total Label Merging number.
In the present embodiment, before the label characteristics for obtaining effective label, the positive feature category of criterion label is preserved Property, it prepares to extract label characteristics.The positive feature attribute that the needs of each label save totally 10, respectively tag name (tag_name), label text content (tag_content), the attribute (tag_attributes) of label, preorder traversal sequence Number (tag_id), label text length (tag_id), left tag length (tag_left_len), right tag length (tag_ Right_len), the level (tag_tree_ of punctuation mark quantity (tag_punct_num), label in dom tree in label Level), whether it is leaf node (leaf), positive feature attribute can be stored in the characteristic attribute column that each label need to save In table, tabular form can be refering to table 3.In the present embodiment, it can be determined according to the positive feature attribute of each effective label The label characteristics of effective label.
In the present embodiment, the label sequence number refers to that the label uses preorder traversal plan in dom tree since root node The number slightly searched, the number of initial root node are 0.
In the present embodiment, DOM is the abbreviation of DOM Document Object Model Document ObjectModel, and HTML DOM is then It is specially adapted for the DOM Document Object Model of HTML/XHTML.HTML table is shown as the tree construction of label by DOM, that is, often say The specific structure of dom tree, dom tree can be refering to Fig. 3.
The label text length refers to the length of all characters in the text node in label.The left tag length refers to Start the length for all characters for including in the angle brackets of label.The right tag length, which refers in the angle brackets of end-tag, includes All characters length.For example,<div></div>: left and right tag length is 3.<div id="menu"></div>: it is left Tag length is 13, and right tag length is 3.
The label level refers to depth of the label node in dom tree, can traverse to obtain by the level of tree.It please join Fig. 3 is read, if with<html>for 0 layer, then<title>label is 2 layers in figure,<a>label with<title>same depth <h1>label is also 2 layers.
The leaf Label Merging number belongs to during referring to the union operation for carrying out same node point to tree interior joint The number of leaf node.The non-leaf Label Merging number is then that the union operation of same node point is not belonging to leaf section in the process The number of point.Total Label Merging number refers to the synthesis of leaf Label Merging number and non-leaf Label Merging number.
Merge label: what dom tree was made of when indicating webpage multiple nodes, these label nodes are can to weigh Existing for multiple or nesting, referring to Fig. 4, wherein<div>label and<p>label repeats.These identical label nodes The function of realization is identical, has identical characteristic, the copy of node is regular and level is obvious.Therefore, in selected label and After corresponding attribute, processing is merged to label, reinforces the feature of label, has positive influence to classification results.
Union operation process: union operation is the process of a circulation, the attribute value until extracting all selection labels, It can terminate.One cycle process are as follows: first determine whether current label is feature tag, if not, continue to recycle next time. If so, being divided into two kinds of situations here, if the label occurs for the first time, assignment is carried out for 9 attributes of the label.If not Occur for the first time, be divided into two kinds of situations again here, first is that current label and already existing same label be not at set in it is same One layer, then the new label that current label is occurred as first time, carries out assignment for 9 attributes of the label, due to will be The same label of such situation is distinguished in array, is that will enclose behind the same label in different layers in the present embodiment Hierachy number, such as p label are in the 10th layer and are expressed as p_10, thus can effectively distinguish.Second is that current label with deposited Same label be in tree same layer, it is believed that the two labels be identical label, merging treatment is done to this two label, The attribute value of current label is added with it.
Leaf label: only having content of text in finger joint point, does not include other labels, such as:<div>today rains</div>.
N omicronn-leaf subtab: such as:
<div>
<p>today rains<p>
</div>
In the example<div>it is not just leaf label, and<p>it is leaf label.
Sorting sub-module 6023 chooses sequence most for being ranked up using recursion elimination algorithm to effective label R high label is as Hold sticker.
Further filtering is done to the label retained in acquisition submodule 6022 using feature recursion elimination algorithm in the step, Rejecting influences the smallest several labels to classification results, chooses highest 11 labels that sort as the label finally retained, packet It includes:<div>label,<html>label,<body>label,<title>label,<h1>,<h2>,<h13>,<h4>,<h5>,<h6> Label,<p>label.
First processing submodule 6024, in the case where for the label of same type to be not present in the Hold sticker, Using the label characteristics of the Hold sticker as the label characteristics of the training set.
Second processing submodule 6025, for, there are in the case where the label of same type, being incited somebody to action in the Hold sticker The Label Merging of same type in the Hold sticker determines the label characteristics of the label of the same type, by the guarantor Described in the label characteristics of the label characteristics and the determination of staying the label in label in addition to the label of the same type are used as The label characteristics of training set.
In the present embodiment, the Label Merging of the same type in the Hold sticker may comprise steps of: root The biggish label of amount of text is searched out according to the label text length of the label of same type, selects the biggish label of text amount Label characteristics of the feature tag as the label of same type.Such as: there are two in remaining label<p>label, then according to label It is biggish that text size searches out amount of text<p>label is as final<p>feature tag.
In the present embodiment, the processing module 602 counts the URL chain in current web page html source code with current web page Hyperlink quantity and webpage html source code size of the similarity greater than preset threshold are connect as page feature.
Preset threshold can be default number, or the customized numerical value of user, for example, preset threshold is 0.5.
In the present embodiment, URL link similarity calculation mode are as follows: from left to right matched identical one by one to two URL Number of characters divided by longer URL in two URL length.
It is understood that the html source code of theme type webpage such as news web page is bigger, list type webpage such as theme The html source code of navigation page is smaller.URL in the html source code of theme type webpage such as news web page with Present News webpage Link similarity is greater than the hyperlink small number of preset threshold, in the html source code of list type webpage such as theme navigation page with The hyperlink quantity that the URL link similarity of current topic navigation page is greater than preset threshold is relatively more.In the present embodiment, pass through Count the hyperlink quantity and net for being greater than preset threshold in current web page html source code with the URL link similarity of current web page Webpage can be improved using page feature as an assessment parameter of Web page classifying as page feature in page html source code size The accuracy of classification.
Training module 603, for by the webpage URL feature, the label characteristics and the page of the training set Feature vector, and vectorization webpage URL feature, vectorization label characteristics and vectorization page feature are inputted into random forest Algorithm model is trained, and obtains classifier.
In the present embodiment, random forest is exactly a kind of algorithm by the thought of integrated study that more trees is integrated, it Basic unit be decision tree, and its essence belongs to a big branch of machine learning, integrated study (Ensemble Learning) algorithm.
In the present embodiment, the webpage URL feature includes time, domain name and passive feature.The page feature packet It includes: being greater than hyperlink quantity and the source webpage HTML of preset threshold in webpage html source code with the URL link similarity of webpage Code size.Label characteristics: 11 labels such as<div>that extracts herein,<html>, 9 attributes of each label.By all features Attribute Digital Display, therefore vectorization procedure can directly use correlated characteristic attribute.
Second obtains module 604, the URL link and html source code of the M theme type webpage for will acquire and acquisition The URL link and html source code of M list type webpage are as test set.
In the present embodiment, M is positive integer, and M is bigger, and the data volume of test set is more.For example, M can take 500,300 etc. Value.
Second obtains module 604, will for obtaining webpage URL feature, label characteristics and the page feature of the test set Webpage URL feature, label characteristics and the page feature of the test set input the classifier, obtain Web page classifying result.
Optionally, described second module 604 is obtained, is also used to extract the test according to the URL link of the test set The webpage URL feature of collection, the label characteristics of the test set are extracted according to the html source code of the test set, by the test set Html source code in associated URL link similarity be greater than preset threshold URL link quantity and html source code size Page feature as the test set.
Optionally, described second module 604 is obtained, is also used to judge whether the URL link of the test set includes the time Feature, domain name feature and passive feature, obtain the first judging result, using first judging result as the net of the test set Page URL feature.
Optionally, described second module 604 is obtained, is also used to examine and whether is wrapped in the survey grid page URL link of the test set Temporal characteristics are recorded as true if temporal characteristics exist containing temporal characteristics, domain name feature and other passive features, if when Between feature be not present, then temporal characteristics are recorded as false;If domain name feature exists, domain name feature is recorded as true, if domain Name feature is not present, then domain name feature is recorded as false;If passive feature exists, passive feature is recorded as true, if disappearing Pole feature is not present, then passive feature is recorded as false;Using record result as the webpage URL feature of the test set.
Optionally, described second module 604 is obtained, be also used to the noise label of the html source code of the test set and institute It states the corresponding content of noise label and carries out delete processing, obtain effective label;
The label characteristics of effective label are obtained, the label characteristics include: label sequence number, label text length, a left side Tag length, right tag length, label text punctuation mark quantity, label level, leaf Label Merging number, n omicronn-leaf subtab Merge number and total Label Merging number;
Effective label is ranked up using recursion elimination algorithm, chooses the highest R label that sort as reservation Label;
In the case where the label of same type is not present in the Hold sticker, by the label characteristics of the Hold sticker Label characteristics as the test set;
There are in the case where the label of same type in the Hold sticker, by the same type in the Hold sticker Label Merging, determine the label characteristics of the label of the same type, will be in the Hold sticker except the same type The label characteristics of the label characteristics of label except label and the label characteristics of the determination as the test set.
Optionally, referring to Fig. 7, the Web page classifying device further include:
Whether judgment module 606, nicety of grading and recall rate for judging the Web page classifying result are greater than default threshold Value.
In the present embodiment, preset threshold can be system default numerical value or the customized numerical value of user.Nicety of grading Preset threshold, the preset threshold of recall rate can be different numbers, herein with no restrictions.
Module 607 is adjusted, the feelings of preset threshold are greater than for the nicety of grading and recall rate in the Web page classifying result Under condition, then using the Web page classifying result as final result;It is small in the nicety of grading and recall rate of the Web page classifying result In or in the case where being equal to preset threshold, then the configuration parameter of the classifier is adjusted, until obtain nicety of grading and recall rate Greater than the Web page classifying result of preset threshold.
In the present embodiment, adjust the configuration parameter of the classifier is substantially exactly continuous change random forests algorithm The parameter of model carries out repetition training to classifier.Until the classifying quality of a certain group of parameter is preferable.Then random forests algorithm Model tests other test sets using same set parameter, if equally have preferable classifying quality to get to point Class precision and recall rate are greater than the Web page classifying of preset threshold as a result, then terminating to test, and record relevant parameter, random forest is calculated The parameter of method model is set as the classifier that the relevant parameter of the record is classified as subsequent web pages.Otherwise front model is repeated Training process, until obtaining relatively good classifying quality.In the present embodiment, the random forest ginseng in random forests algorithm model Number includes: maximum characteristic max_features, the decision tree depth capacity max_depth, inside considered when decision tree divides Smallest sample number min_samples_split, the minimum sample number min_samples_leaf of leaf node needed for node is subdivided, The smallest sample weights of leaf node and min_weight_fraction_leaf, maximum leaf node number max_leaf_ Nodes: by limiting maximum leaf node number, node division minimum impurity level min_impurity_split etc..Due to described Classifier obtains optimization model after being substantially adjusted to the parameter of random forests algorithm model, therefore according to classification essence Whether degree and recall rate are less than or equal to the judging result of preset threshold, further optimize, can effectively provide to parameter The classifying quality of classifier.
Web page classifying device provided by the invention, by the webpage URL feature, label characteristics and the page that obtain training set Feature, and the webpage URL feature, label characteristics and page feature of the training set after vectorization are inputted into random forests algorithm mould Type is trained, and obtains classifier, and the webpage URL feature, label characteristics and page feature of test set are inputted classifier, obtained The Web page classifying of test set realizes the Web page classifying process of full automation, by vectorization as a result, without a large amount of artificial investments The webpage URL feature of training set afterwards, label characteristics and page feature input random forests algorithm model are trained, can be with The more reasonable classifier of parameter configuration is obtained by structure of web page feature abundant training, and then can be improved by classifier A large amount of manual times are saved in the accuracy of Web page classifying, and scalability is high.Since Web page classifying process is fully automated, Webpage a large amount of for the whole network can effectively distinguish rapidly type of webpage, promote people and obtain key message from webpage Efficiency.
The embodiment of the present invention provides a kind of computer installation, which includes processor, and processor is for executing The step of Web page classification method that above-mentioned each embodiment of the method provides is realized in memory when computer program.
Illustratively, computer program can be divided into one or more modules, one or more module is stored In memory, and by processor it executes, to complete the present invention.One or more modules, which can be, can complete specific function Series of computation machine program instruction section, the instruction segment is for describing implementation procedure of the computer program in computer installation.Example Such as, computer program can be divided into the step of Web page classification method that above-mentioned each embodiment of the method provides.
It will be understood by those skilled in the art that the description of above-mentioned computer installation is only example, do not constitute to calculating The restriction of machine device may include component more more or fewer than foregoing description, perhaps combine certain components or different portions Part, such as may include input-output equipment, network access equipment, bus etc..
Alleged processor can be central processing unit (Central ProcessingUnit, CPU), can also be other General processor, digital signal processor (Digital Signal Processor, DSP), specific integrated circuit (Application Specific Integrated Circuit, ASIC), ready-made programmable gate array (Field- Programmable Gate Array, FPGA) either other programmable logic device, discrete gate or transistor logic, Discrete hardware components etc..General processor can be microprocessor or the processor is also possible to any conventional processor Deng the processor is the control centre of the computer installation, utilizes various interfaces and the entire computer installation of connection Various pieces.
The memory can be used for storing the computer program and/or module, and the processor is by operation or executes Computer program in the memory and/or module are stored, and calls the data being stored in memory, described in realization The various functions of computer installation.The memory can mainly include storing program area and storage data area, wherein storage program It area can application program (such as sound-playing function, image player function etc.) needed for storage program area, at least one function Deng;Storage data area, which can be stored, uses created data (such as audio data, phone directory etc.) etc. according to mobile phone.In addition, Memory may include high-speed random access memory, can also include nonvolatile memory, such as hard disk, memory, grafting Formula hard disk, intelligent memory card (Smart Media Card, SMC), secure digital (Secure Digital, SD) card, flash card (Flash Card), at least one disk memory, flush memory device or other volatile solid-state parts.
If the integrated module/unit of the computer installation is realized in the form of SFU software functional unit and as independent Product when selling or using, can store in a computer readable storage medium.Based on this understanding, the present invention is real All or part of the process in existing above-described embodiment method, can also instruct relevant hardware come complete by computer program At the computer program can be stored in a computer readable storage medium, which is being executed by processor When, it can be achieved that the step of above-mentioned each Web page classification method embodiment.Wherein, the computer program includes computer program generation Code, the computer program code can be source code form, object identification code form, executable file or certain intermediate forms Deng.The computer-readable medium may include: any entity or device, record that can carry the computer program code Medium, USB flash disk, mobile hard disk, magnetic disk, CD, computer storage, read-only memory (ROM, Read-Only Memory), with Machine access memory (RAM, Random Access Memory), electric carrier signal, electric signal and software distribution medium etc..
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention Made any modifications, equivalent replacements, and improvements etc., should all be included in the protection scope of the present invention within mind and principle.

Claims (12)

1. a kind of Web page classification method, which is characterized in that the Web page classification method includes:
The URL link of the URL link and html source code of the N number of theme type webpage that will acquire and the N number of list type webpage obtained and Html source code is as training set;
The webpage URL feature that the training set is extracted according to the URL link of the training set, according to the source HTML of the training set Code extracts the label characteristics of the training set, will be greater than in the html source code of the training set with associated URL link similarity The page feature of the quantity of the URL link of preset threshold and the size of html source code as the training set;
By the webpage URL feature, the label characteristics and the page feature vectorization of the training set, and by vector Change webpage URL feature, vectorization label characteristics and vectorization page feature input random forests algorithm model to be trained, obtain To classifier;
The URL link of the URL link and html source code of the M theme type webpage that will acquire and the M list type webpage obtained and Html source code is as test set;
Webpage URL feature, label characteristics and the page feature for obtaining the test set, by the webpage URL feature of the test set, Label characteristics and page feature input the classifier, obtain Web page classifying result.
2. Web page classification method according to claim 1, which is characterized in that the URL link according to the training set The webpage URL feature for extracting the training set includes following procedure:
Whether the URL link for judging the training set includes temporal characteristics, domain name feature and passive feature, using judging result as The webpage URL feature of the training set.
3. Web page classification method according to claim 1, which is characterized in that the html source code according to the training set The label characteristics for extracting the training set include following procedure:
The noise label of the html source code of the training set and the corresponding content of the noise label are subjected to delete processing, obtained Effective label;
The label characteristics of effective label are obtained, the label characteristics include: label sequence number, label text length, left label Length, right tag length, label text punctuation mark quantity, label level, leaf Label Merging number, non-leaf Label Merging Number and total Label Merging number;
Effective label is ranked up using recursion elimination algorithm, chooses the highest R label that sort as Hold sticker;
In the Hold sticker be not present same type label in the case where, using the label characteristics of the Hold sticker as The label characteristics of the training set;
There are in the case where the label of same type in the Hold sticker, by the mark of the same type in the Hold sticker Label merge, and determine the label characteristics of the label of the same type, and the label of the same type will be removed in the Hold sticker Except label label characteristics and the determination label characteristics of the label characteristics as the training set.
4. Web page classification method according to claim 1, which is characterized in that the webpage URL for obtaining the test set Feature, label characteristics and page feature include following procedure:
The webpage URL feature that the test set is extracted according to the URL link of the test set, according to the source HTML of the test set Code extracts the label characteristics of the test set, will be greater than in the html source code of the test set with associated URL link similarity The page feature of the quantity of the URL link of preset threshold and the size of html source code as the test set.
5. Web page classification method described in any one of -4 according to claim 1, which is characterized in that described to obtain Web page classifying knot After fruit, the Web page classification method further includes following procedure:
Whether the nicety of grading and recall rate for judging the Web page classifying result are greater than preset threshold;
In the case where the nicety of grading of the Web page classifying result and recall rate are greater than preset threshold, then by the Web page classifying As a result it is used as final result;The case where the nicety of grading and recall rate of the Web page classifying result are less than or equal to preset threshold Under, then the configuration parameter of the classifier is adjusted, until obtaining the Web page classifying of nicety of grading and recall rate greater than preset threshold As a result.
6. a kind of Web page classifying device, which is characterized in that the Web page classifying device includes:
First obtains module, the URL link and html source code of N number of theme type webpage for will acquire and the N number of list obtained The URL link and html source code of type webpage are as training set;
Processing module, for extracting the webpage URL feature of the training set according to the URL link of the training set, according to described The html source code of training set extracts the label characteristics of the training set, by the html source code of the training set with associated URL Link similarity is special as the page of the training set greater than the size of the quantity of the URL link of preset threshold and html source code Sign;
Training module, for by the webpage URL feature, the label characteristics and the page feature of the training set to Quantization, and vectorization webpage URL feature, vectorization label characteristics and vectorization page feature are inputted into random forests algorithm mould Type is trained, and obtains classifier;
Second obtains module, the URL link and html source code of the M theme type webpage for will acquire and the M list obtained The URL link and html source code of type webpage are as test set;
Categorization module, for obtaining webpage URL feature, label characteristics and the page feature of the test set, by the test set Webpage URL feature, label characteristics and page feature input the classifier, obtain Web page classifying result.
7. Web page classifying device according to claim 6, which is characterized in that the processing module is also used to judge described Whether the URL link of training set includes temporal characteristics, domain name feature and passive feature, using judging result as the training set Webpage URL feature.
8. Web page classifying device according to claim 6, which is characterized in that the processing module further include:
Delete submodule, for by the noise label of the html source code of the training set and the corresponding content of the noise label into Row delete processing obtains effective label;
Acquisition submodule, for obtaining the label characteristics of effective label, the label characteristics include: label sequence number, label Text size, left tag length, right tag length, label text punctuation mark quantity, label level, leaf Label Merging Number, non-leaf Label Merging number and total Label Merging number;
Sorting sub-module chooses the highest R mark of sequence for being ranked up using recursion elimination algorithm to effective label Label are used as Hold sticker;
First processing submodule, in the case where for the label of same type to be not present in the Hold sticker, by the guarantor Stay the label characteristics of label as the label characteristics of the training set;
Second processing submodule, in the Hold sticker there are in the case where the label of same type, by the reservation The Label Merging of same type in label determines the label characteristics of the label of the same type, will be in the Hold sticker The label characteristics of label in addition to the label of the same type and the label characteristics of the determination are as the training set Label characteristics.
9. Web page classifying device according to claim 6, which is characterized in that described second obtains module, is also used to basis The URL link of the test set extracts the webpage URL feature of the test set, extracts institute according to the html source code of the test set The label characteristics for stating test set will be greater than preset threshold with associated URL link similarity in the html source code of the test set URL link quantity and html source code page feature of the size as the test set.
10. according to the described in any item Web page classifying devices of claim 6-9, which is characterized in that further include:
Whether judgment module, nicety of grading and recall rate for judging the Web page classifying result are greater than preset threshold;
Module is adjusted, in the case where the nicety of grading of the Web page classifying result and recall rate are greater than preset threshold, then Using the Web page classifying result as final result;It is less than or equal in the nicety of grading and recall rate of the Web page classifying result In the case where preset threshold, then the configuration parameter of the classifier is adjusted, until obtaining nicety of grading and recall rate greater than default The Web page classifying result of threshold value.
11. a kind of computer installation, which is characterized in that the computer installation includes processor, and the processor is for executing It is realized when computer program in memory as described in any one of claim 1-5 the step of Web page classification method.
12. a kind of computer readable storage medium, is stored thereon with computer program, it is characterised in that: the computer program It is realized when being executed by processor as described in any one of claim 1-6 the step of Web page classification method.
CN201910677072.2A 2019-07-25 2019-07-25 Web page classification method, device, computer installation and computer readable storage medium Pending CN110516710A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910677072.2A CN110516710A (en) 2019-07-25 2019-07-25 Web page classification method, device, computer installation and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910677072.2A CN110516710A (en) 2019-07-25 2019-07-25 Web page classification method, device, computer installation and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN110516710A true CN110516710A (en) 2019-11-29

Family

ID=68623561

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910677072.2A Pending CN110516710A (en) 2019-07-25 2019-07-25 Web page classification method, device, computer installation and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN110516710A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111143642A (en) * 2019-12-30 2020-05-12 北京天融信网络安全技术有限公司 Webpage classification method and device, electronic equipment and computer readable storage medium
CN112507186A (en) * 2020-11-27 2021-03-16 北京数立得科技有限公司 Webpage element classification method
WO2023282848A1 (en) * 2021-07-07 2023-01-12 脸萌有限公司 Web page classification method and apparatus, storage medium, and electronic device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101251855A (en) * 2008-03-27 2008-08-27 腾讯科技(深圳)有限公司 Equipment, system and method for cleaning internet web page
CN102332028A (en) * 2011-10-15 2012-01-25 西安交通大学 Webpage-oriented unhealthy Web content identifying method
US20120158724A1 (en) * 2010-12-21 2012-06-21 Tata Consultancy Services Limited Automated web page classification
CN106897821A (en) * 2017-01-24 2017-06-27 中国电力科学研究院 A kind of transient state assesses feature selection approach and device
CN107577783A (en) * 2017-09-15 2018-01-12 电子科技大学 The type of webpage automatic identifying method excavated based on Web architectural features
CN108777674A (en) * 2018-04-24 2018-11-09 东南大学 A kind of detection method for phishing site based on multi-feature fusion

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101251855A (en) * 2008-03-27 2008-08-27 腾讯科技(深圳)有限公司 Equipment, system and method for cleaning internet web page
US20120158724A1 (en) * 2010-12-21 2012-06-21 Tata Consultancy Services Limited Automated web page classification
CN102332028A (en) * 2011-10-15 2012-01-25 西安交通大学 Webpage-oriented unhealthy Web content identifying method
CN106897821A (en) * 2017-01-24 2017-06-27 中国电力科学研究院 A kind of transient state assesses feature selection approach and device
CN107577783A (en) * 2017-09-15 2018-01-12 电子科技大学 The type of webpage automatic identifying method excavated based on Web architectural features
CN108777674A (en) * 2018-04-24 2018-11-09 东南大学 A kind of detection method for phishing site based on multi-feature fusion

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111143642A (en) * 2019-12-30 2020-05-12 北京天融信网络安全技术有限公司 Webpage classification method and device, electronic equipment and computer readable storage medium
CN112507186A (en) * 2020-11-27 2021-03-16 北京数立得科技有限公司 Webpage element classification method
WO2023282848A1 (en) * 2021-07-07 2023-01-12 脸萌有限公司 Web page classification method and apparatus, storage medium, and electronic device

Similar Documents

Publication Publication Date Title
JP3598742B2 (en) Document search device and document search method
Rain Sentiment analysis in amazon reviews using probabilistic machine learning
Song et al. A comparative study on text representation schemes in text categorization
US10311120B2 (en) Method and apparatus for identifying webpage type
CN110516710A (en) Web page classification method, device, computer installation and computer readable storage medium
WO2004083989A2 (en) Web server for adapted web content
US8359307B2 (en) Method and apparatus for building sales tools by mining data from websites
CN108737423A (en) Fishing website based on webpage key content similarity analysis finds method and system
Wu et al. News filtering and summarization on the web
CN103678310A (en) Method and device for classifying webpage topics
CN106250402B (en) Website classification method and device
WO2004083990A2 (en) Web content adaption process and system
CN106650760A (en) Method and device for recognizing user behavioral object based on flow analysis
CN104462301A (en) Network data processing method and device
CN110516074A (en) Website theme classification method and device based on deep learning
CN109033282A (en) A kind of Web page text extracting method and device based on extraction template
CN106960040A (en) A kind of URL classification determines method and device
CN113468339B (en) Label extraction method and system based on knowledge graph, electronic equipment and medium
CN109165373B (en) Data processing method and device
CN107862051A (en) A kind of file classifying method, system and a kind of document classification equipment
KR20170043365A (en) Important precedents extraction and sorting method using Big Data
Sara-Meshkizadeh et al. Webpage classification based on compound of using HTML features & URL features and features of sibling pages
CN116029280A (en) Method, device, computing equipment and storage medium for extracting key information of document
CN104484451B (en) The extracting method and device of Webpage information
CN109145117A (en) Bonus system recognition methods, device and the electronic equipment of multiple level marketing project

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information
CB03 Change of inventor or designer information

Inventor after: Lin Peng

Inventor after: Wu Xiao

Inventor before: Lin Peng

Inventor before: Wu Xiao

Inventor before: Huang Jiuming

Inventor before: Zhang Shengdong

WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20191129