CN111125483B - Webpage data extraction template generation method and device, computer device and storage medium - Google Patents
Webpage data extraction template generation method and device, computer device and storage medium Download PDFInfo
- Publication number
- CN111125483B CN111125483B CN201911302343.2A CN201911302343A CN111125483B CN 111125483 B CN111125483 B CN 111125483B CN 201911302343 A CN201911302343 A CN 201911302343A CN 111125483 B CN111125483 B CN 111125483B
- Authority
- CN
- China
- Prior art keywords
- field
- extracted
- template
- attribute
- webpage
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/957—Browsing optimisation, e.g. caching or content distillation
- G06F16/9577—Optimising the visualization of content, e.g. distillation of HTML documents
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The invention is applicable to the technical field of Internet, and provides a webpage data extraction template generation method, a device, a computer device and a storage medium, wherein the method comprises the following steps: labeling a field to be extracted from a sample webpage, setting field attributes of the field to be extracted, and storing the field to be extracted and the field attributes as a webpage extraction sample into an extraction sample set; when the number of the webpage extraction samples is greater than or equal to 2, traversing the fields to be extracted of the webpage extraction samples, and generating a field sample set; selecting a candidate node set of a current field to be extracted from the converted DOM tree, comparing the candidate node set corresponding to each field to be extracted, and taking the node which is unchanged in the candidate node set as a target node of the current field to be extracted; and determining the position information of the attribute value relative to the attribute name, and generating a webpage data extraction template according to the attribute name and the position information. The method for generating the webpage data extraction template can improve the generation efficiency of the webpage data extraction template.
Description
Technical Field
The invention belongs to the technical field of internet, and particularly relates to a webpage data extraction template generation method, device, computer device and storage medium.
Background
With the rapid development of the Internet, network information resources have exponentially growing trend, and massive valuable data in various industries such as character resume, enterprise information, intellectual property rights, commodity information and the like are gathered in web pages, so that great help can be provided for knowledge discovery, information retrieval, data mining and the like. How to analyze pages more conveniently and accurately and extract valuable data has become an important research problem.
Existing website content extraction is mainly divided into two categories: the first category is news type websites, which generally contain titles, times, authors, and large text descriptions; aiming at news type website webpages, the existing method tools can finish extracting the contents by calculating the density of webpage texts; the second category is web sites with complex formats, such as a product detail page, a personal resume page and an enterprise information page of a shopping web site, and web pages of the web sites with complex formats are different in content according to different types; aiming at websites with complex formats, the existing method tools generate extraction modes by manually writing templates. The method based on the manual template configuration needs to write different templates according to different webpages, and for repeated content, multiple extraction modes need to be written, so that the problems of high labor cost, more repeated work and poor template generalization capability exist in the existing webpage template generation process.
Disclosure of Invention
The embodiment of the invention provides a webpage data extraction template generation method, which aims to solve the problems of high labor cost, more repeatability and poor template adaptability in the existing webpage template generation process.
The invention is realized in such a way that a webpage data extraction template generation method comprises the following steps:
marking a field to be extracted from a sample webpage, setting a field attribute for the field to be extracted, storing the field to be extracted and the field attribute as a webpage extraction sample into an extraction sample set, wherein the field attribute comprises an attribute name, an attribute value, an attribute type, whether the field repeatedly appears on the webpage or not and a CSS path based on an HTML attribute label;
traversing all fields to be extracted in the webpage extraction samples under the condition that the number of webpage extraction samples in the extraction sample set is greater than or equal to 2, and generating a field sample set, wherein the field sample set comprises at least two fields to be extracted with the same unique identification, and the unique identification is formed according to an attribute name and a template name in the field attribute;
traversing each field to be extracted in a field sample set, converting an original HTML webpage of each field to be extracted into a corresponding DOM tree, describing the position characteristics of the DOM tree based on HTML attribute tags according to JQuery specifications, generating a corresponding CSS path, selecting a candidate node set of a current field to be extracted from the converted DOM tree according to the CSS path, comparing the candidate node set corresponding to each field to be extracted, and taking a node which is unchanged all the time in the candidate node set as a target node of the current field to be extracted;
Comparing the field to be extracted with the target node, and determining relative position information of an attribute value in the field to be extracted relative to the target node, wherein the attribute value is a field except the attribute name in the field to be extracted; and generating a webpage data extraction template according to the attribute name, the attribute value type and the determined relative position information of the field to be extracted.
Optionally, before the field to be extracted is marked from the sample web page, the generating method of the web page data extraction template further includes:
setting sample parameters for a sample web page, wherein the sample parameters comprise: database names to be saved, the name of the type of the web page to which the database belongs and the name of the template.
Optionally, traversing all the fields to be extracted in the webpage extraction sample, and generating the field sample set includes the following steps:
and generating unique identifiers for all the fields to be extracted in the webpage extraction samples according to the attribute names and the template names, and aggregating the same fields to be extracted in different webpage extraction samples through the unique identifiers to generate the field sample set.
Optionally, the relative position information includes a target node, a search direction, a search step number, and a neighboring node.
Optionally, the generating the web page data extraction template according to the attribute name and the position information of the attribute value relative to the attribute name includes the following steps:
counting the occurrence times of each target node, selecting the target node with the highest occurrence times as a template target node, determining the relative position information of the attribute values in all the fields to be extracted relative to the template target node, and forming a relative position information set by the relative position information of the attribute values in all the fields to be extracted relative to the template target node;
and generating a webpage data extraction template according to the attribute name, the attribute value type and the relative position information set of the field to be extracted.
The invention also provides a webpage data extraction template generation device, which comprises:
the labeling module is used for labeling a field to be extracted from a sample webpage, setting a field attribute for the field to be extracted, storing the field to be extracted and the field attribute as a webpage extraction sample into an extraction sample set, wherein the field attribute comprises an attribute name, an attribute value, an attribute type, whether the field repeatedly appears on the webpage or not and a CSS path based on an HTML attribute label;
The first generation module is used for traversing all fields to be extracted in the webpage extraction samples to generate a field sample set under the condition that the number of webpage extraction samples in the extraction sample set is greater than or equal to 2, wherein the field sample set comprises at least two fields to be extracted with the same unique identification, and the unique identification is formed according to an attribute name and a template name in the field attribute;
the processing module is used for traversing each field to be extracted in the field sample set, converting an original HTML webpage of each field to be extracted into a corresponding DOM tree, describing the position characteristics of the DOM tree based on HTML attribute tags according to JQuery specifications, generating a corresponding CSS path, selecting a candidate node set of a current field to be extracted from the converted DOM tree according to the CSS path, comparing the candidate node set corresponding to each field to be extracted, and taking a node which is unchanged all the time in the candidate node set as a target node of the current field to be extracted;
the second generation module is used for comparing the field to be extracted with the target node and determining relative position information of an attribute value in the field to be extracted relative to the target node, wherein the attribute value is a field except an attribute name in the field to be extracted; and generating a webpage data extraction template according to the attribute name, the attribute value type and the determined relative position information.
Optionally, the webpage data extraction template generating device further includes:
the setting module is used for setting sample parameters for the sample web page, wherein the sample parameters comprise a database name to be saved, a web page type name and a template name.
Optionally, the first generating module is further configured to generate unique identifiers for all fields to be extracted in the webpage extraction samples according to the attribute names and the template names, and aggregate the same field to be extracted existing in different webpage extraction samples by using the unique identifiers to generate the field sample set.
Optionally, the relative position information includes a target node, a search direction, a search step number, and a neighboring node.
Optionally, the second generating module includes:
the processing sub-module is used for counting the occurrence times of all target nodes, selecting the target node with the highest occurrence times as a template target node, determining the relative position information of the attribute values in all the fields to be extracted relative to the template target node, and forming a relative position information set by the relative position information of the attribute values in all the fields to be extracted relative to the template target node;
The generation sub-module is used for generating a webpage data extraction template according to the attribute name, the attribute value type and the relative position information set of the field to be extracted.
The invention also provides a computer device comprising a processor for implementing the steps of the method for generating a webpage data extraction template as described above when executing a computer program in a memory.
The present invention also provides a computer storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method of generating a web page data extraction template as described above.
According to the webpage data extraction template generation method, under the condition that the number of webpage extraction samples is greater than or equal to 2, a field sample set is generated for the to-be-extracted field of the webpage extraction samples, a target node of the to-be-extracted field is determined based on the field sample set, the to-be-extracted field is compared with the target node, the position information of an attribute value relative to the target node is determined, and the webpage data extraction template is generated according to the attribute name, the attribute value type and the determined relative position information of the to-be-extracted field.
Drawings
Fig. 1 is a flowchart of an implementation of a method for generating a web page data extraction template according to an embodiment of the present invention.
FIG. 2 is a flowchart of an implementation after a node that is unchanged all the time in a candidate node set is used as a target node of the current field to be extracted according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a device for generating a web page data extraction template according to the embodiment of the present invention;
fig. 4 is a schematic structural diagram of a second generating module according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of another apparatus for generating a web page data extraction template according to the embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
Fig. 1 is a flowchart illustrating an implementation of a method for generating a web page data extraction template according to an embodiment of the present invention. The webpage data extraction template generation method comprises the following steps:
step S101, a field to be extracted is marked from a sample webpage, a field attribute is set for the field to be extracted, the field to be extracted and the field attribute are stored into an extraction sample set as webpage extraction samples, and the field attribute comprises an attribute name, an attribute value, an attribute type and a CSS path based on HTML attribute labels, wherein the CSS path is formed by whether the field repeatedly appears on the webpage or not.
In this embodiment, an labeling tool may be installed in the browser in advance, a sample web page is opened through the browser, a labeling mode is opened, and the sample web page is labeled through the labeling tool. The marking tool automatically identifies the uniform resource locator URL of the current sample page and receives information such as template names, background server addresses and the like input by a user. And selecting a field to be extracted from the current sample page through a selection tool in the marking tool. There are various ways to select the field to be extracted. For example, the selection tool may automatically highlight the framed data content by moving a cursor over the data to be extracted via a mouse. The sample web page may be of the type news page, shopping page, search results page, etc. For example, the field to be extracted may be "company name: a "," number of companies: 123", etc. In this embodiment, the field type may include plain text, pictures, links, or the like.
The extracted sample set can be stored in a background server, and the field to be extracted and the field attribute can be stored in the extracted sample set as webpage extracted samples through a background server address. It should be noted that, a web page extraction sample may include a plurality of different fields and corresponding set attributes, for example, a sample may include a plurality of fields, for example, "company name: a "," number of companies: 123", can be respectively referred to as" company name: a "and" number of companies: 123 "set the corresponding field attributes.
Optionally, before the step S102, the method for generating a web page data extraction template includes the following steps:
setting sample parameters for a sample web page, wherein the sample parameters comprise: database names to be saved, the name of the type of the web page to which the database belongs and the name of the template.
In this embodiment, the template name may be determined by user definition, where the database name to be saved designates a database storing fields to be annotated, and the web page type name is one of the types of news page, shopping page, search result page, and the like.
Step S102, under the condition that the number of webpage extraction samples in the extraction sample set is greater than or equal to 2, traversing all fields to be extracted in the webpage extraction samples to generate a field sample set, wherein the field sample set comprises at least two fields to be extracted with the same unique identification, and the unique identification is formed according to an attribute name and a template name in the field attribute.
In this embodiment, the unique identifier is formed by concatenating the attribute name and the template name. For example, the field to be extracted "company name" of the sample is extracted in the first web page: the attribute name of A' is company name, the template name is company abstract template, and the unique identification of the field to be extracted is company abstract template company name. The field to be extracted of the sample is extracted on the first webpage, "the number of companies: the attribute name of 10 persons is the number of company persons, the template name is the company abstract template, and the unique identification of the field to be extracted is the "company abstract template number of company persons". Therefore, the unique identifier can uniquely identify the extraction field in the first webpage extraction sample and cannot be repeated with other fields of the first webpage extraction sample.
For another example, the field to be extracted of the sample is extracted in the second web page "company name: and B' is a company name, the template name is a company abstract template, and the unique identification of the field to be extracted is a company abstract template company name. And carrying out aggregation classification processing on the fields with the same unique identification, wherein the field sample set comprises a field to be extracted 'company name': a "and" company name: b).
In the step S102, traversing all the fields to be extracted in the web page extraction sample, and generating a field sample set includes the following steps:
and generating unique identifiers for all the fields to be extracted in the webpage extraction samples according to the attribute names and the template names, and aggregating the same fields to be extracted in different webpage extraction samples through the unique identifiers to generate the field sample set.
If the first web page extraction sample has two fields to be extracted, "company name: company 1 "and" company staff: 10 persons ' then the labeling results corresponding to the two fields are stored as sample 1= { company_name=company 1, company_num=10 }, and the unique identifications of the two fields to be extracted are respectively ' company abstract template company name ' and ' company abstract template number '.
The second page samples two to-be-extracted fields, "company name: company 2 "and" company staff: 20 people ", then the labeling results corresponding to the two bullets are stored as sample 2= { company_name=company 2, company_num=20 }, and the unique identifications of the two fields to be extracted are respectively" company abstract template company name "and" company abstract template number ".
In this embodiment, each field in the page extraction sample may be processed in units of the page extraction sample, so as to generate a sample set of each field, and the process of generating the field sample set is adjusted to be that in units of the field, the same field existing in each page extraction sample is subjected to aggregation classification processing according to a unique identifier, for example, the field aggregation classification processing is performed on "sample 1= { company 1, company_num=10 }" and "sample 2= { company_name=company 2, company_num=20 }", so that the result becomes:
field 1= { sample1. Company_name=company 1, sample2. Company_name=company 2}
field 2= { sample1. Company_num=10, sample2. Company_num=20 }, where field1 represents field sample set 1, field2 represents field sample set 2.
Therefore, the field to be extracted can be accurately classified, and a field sample set can be quickly obtained.
Step S103, traversing each field to be extracted in a field sample set, converting an original HTML webpage of each field to be extracted into a corresponding DOM tree, describing the position characteristics of the DOM tree based on HTML attribute tags according to JQuery specifications, generating a corresponding CSS path, selecting a candidate node set of a current field to be extracted from the converted DOM tree according to the CSS path, comparing the candidate node set corresponding to each field to be extracted, and taking a node which is unchanged all the time in the candidate node set as a target node of the current field to be extracted.
It is added that the DOM tree has corresponding features, generally called DOM tree features, that is, features of the field to be extracted in various features of the DOM tree, such as position features, features of adjacent nodes above, below, left and right. Various features of the DOM tree may be understood as context information, the location feature may be described by using a CSS path, and features of the adjacent nodes about the location may include information of a parent node, a sibling node, a descendant node, and the like.
In this embodiment, the node paths recorded by using the array subscript in the CSS specification are described according to the JQuery specification based on the position features of the HTML attribute tag to generate corresponding CSS paths, and the generated CSS paths are saved in the field attribute. For writing convenience, the node recording path under the array is called CSS_OLD_FORMAT in the CSS specification, the position feature of the DOM tree is described based on the HTML attribute tag according to the JQuery specification, and the CSS path generated correspondingly is called CSS_NEW_FORMAT. For the fields which repeatedly appear in the page, only the fields which appear for the first time are subjected to corresponding path conversion processing. It is further added that the CSS path based on the HTML attribute tag may be set into a field attribute, where the field attribute may include an attribute name, an attribute value, an attribute type, whether a field is repeatedly appeared on a page, and a CSS path based on the HTML attribute tag.
For example, the fields to be extracted respectively marked in the first page and the second page of the similar commodity are "date of manufacture: 2018, 7, 1, date of production: 2018, 7, 2 days "," date of production: candidate node set of 2018, 7, 1 ", and" date of production: the node which is unchanged all the time in the candidate set of 2018, 7 and 2 is the "date of production", so the "date of production" is taken as the current "date of production: 2018, 7, 1, date of production: destination node of 2018, 7, 2 ".
Step S104, comparing the field to be extracted with the target node, and determining the relative position information of the attribute value in the field to be extracted relative to the target node, wherein the attribute value is a field except for an attribute name in the field to be extracted; and generating a webpage data extraction template according to the attribute name, the attribute value type and the determined relative position information.
In this embodiment, the relative position information includes: searching direction, searching step number and adjacent nodes, wherein the target node is a reference object for determining the position of the attribute value, and the position of the attribute value in the webpage can be positioned according to the target node, the searching direction, the searching step number and the adjacent nodes. The data in other web pages can be extracted through the web page data extraction template, for example, if the web page data extraction template is extracted from the third web page to "date of production: 2018, 7, 3, "" company name: XX company "," company number: and XX' and other fields meet the user requirements, the webpage data extraction template meets the requirements, and if the user requirements are not met, the webpage data extraction template is corrected until the user requirements are met.
Referring to fig. 2, after step S103, the method for generating the web page data extraction template includes the following steps:
step S105, counting the occurrence times of each target node, selecting the target node with the highest occurrence times as a template target node, determining the relative position information of the attribute values in all the fields to be extracted relative to the template target node, and forming a relative position information set by the relative position information of the attribute values in all the fields to be extracted relative to the template target node.
Step S106, generating a webpage data extraction template according to the attribute name, the attribute value type and the relative position information set of the field to be extracted.
In this embodiment, since the web page sample can be extracted into multiple fields, multiple target nodes are generated, for example, in the first web page extraction sample, there are three fields to be extracted "company name: company 1"," trade name: cup a "and" trade name: bottle B ", in the second web page extraction sample, there are three fields to be extracted" company name: company 2"," trade name: cup C "and" trade name: bottle D ", then the target node that can be determined from the field to be extracted has one" company name "and two" trade names ", and the target node" trade name "is taken as the template target node. Determining relative position information of "company 1", "cup a" and "bottle B", "company 2", "cup C" and "bottle D", respectively, with respect to the target node "trade name", and composing the determined relative position information into a relative position information set. According to the company name: company 1"," trade name: cup a "," trade name: bottle B "," company name: company 2"," trade name: cup C "and" trade name: and generating a webpage data extraction template based on the fields to be extracted of the first webpage and the second webpage by the attribute name, the attribute value type and the relative position information set of the bottle D'.
In this way, the occurrence times of the target nodes are counted, the target node with the highest occurrence times is selected as the template target node, system resources can be saved, and the accuracy of extracting sample parameters is improved.
The supplementary explanation is that after the webpage data extraction template is generated, the webpage data extraction template can be adjusted. Adjusting the web page data extraction template may include the following: the marking tool displays the webpage data extraction template, the user opens a new page, the name of the template is filled in the marking tool, and the test is clicked. Displaying an extraction result obtained after field extraction according to a webpage data extraction template in a page of the marking tool for a user to test, and if the test has no problem, finishing marking the page of the type; if the test fails, the process of steps S101, S102, S103, S104 is repeated for only the extracted erroneous field, or steps S101, S102, S103, S105, S106 are repeated, and the previously generated template is corrected until the test result is correct.
According to the webpage data extraction template generation method, under the condition that the number of webpage extraction samples is greater than or equal to 2, a field sample set is generated for the to-be-extracted field of the webpage extraction samples, a target node of the to-be-extracted field is determined based on the field sample set, the to-be-extracted field is compared with the target node, the position information of an attribute value relative to the target node is determined, and the webpage data extraction template is generated according to the attribute name, the attribute value type and the determined relative position information of the to-be-extracted field.
Fig. 3 is a schematic structural diagram of a device 300 for generating a web page data extraction template according to an embodiment of the present invention, and for convenience of explanation, only relevant portions of the implementation of the present invention are shown. The web page data extraction template generating apparatus 300 includes:
the labeling module 301 is configured to label a field to be extracted from a sample web page, set a field attribute for the field to be extracted, store the field to be extracted and the field attribute as a web page extraction sample in an extraction sample set, where the field attribute includes an attribute name, an attribute value, an attribute type, whether the field appears repeatedly on the page, and a CSS path based on an HTML attribute tag.
In this embodiment, an labeling tool may be installed in the browser in advance, a sample web page is opened through the browser, a labeling mode is opened, and the sample web page is labeled through the labeling tool. The marking tool automatically identifies the uniform resource locator URL of the current sample page and receives information such as template names, background server addresses and the like input by a user. And selecting a field to be extracted from the current sample page through a selection tool in the marking tool. There are various ways to select the field to be extracted. For example, the selection tool may automatically highlight the framed data content by moving a cursor over the data to be extracted via a mouse. The sample web page may be of the type news page, shopping page, search results page, etc. For example, the field to be extracted may be "company name: a "," number of companies: 123", etc. In this embodiment, the field type may include plain text, pictures, links, or the like.
The extracted sample set can be stored in a background server, and the field to be extracted and the field attribute can be stored in the extracted sample set as webpage extracted samples through a background server address. It should be noted that, a web page extraction sample may include a plurality of different fields and corresponding set attributes, for example, a sample may include a plurality of fields, for example, "company name: a "," number of companies: 123", can be respectively referred to as" company name: a "and" number of companies: 123 "set the corresponding field attributes.
The first generating module 302 is configured to traverse all the fields to be extracted in the web page extraction samples in the extraction sample set if the number of web page extraction samples in the extraction sample set is greater than or equal to 2, and generate a field sample set, where the field sample set includes at least two fields to be extracted with the same unique identifier, and the unique identifier is formed according to an attribute name and a template name in the field attribute.
In this embodiment, the unique identifier is formed by concatenating the attribute name and the template name. For example, the field to be extracted "company name" of the sample is extracted in the first web page: the attribute name of A' is company name, the template name is company abstract template, and the unique identification of the field to be extracted is company abstract template company name. The field to be extracted of the sample is extracted on the first webpage, "the number of companies: the attribute name of 10 persons is the number of company persons, the template name is the company abstract template, and the unique identification of the field to be extracted is the "company abstract template number of company persons". Therefore, the unique identifier can uniquely identify the extraction field in the first webpage extraction sample and cannot be repeated with other fields of the first webpage extraction sample.
For another example, the field to be extracted of the sample is extracted in the second web page "company name: and B' is a company name, the template name is a company abstract template, and the unique identification of the field to be extracted is a company abstract template company name. And carrying out aggregation classification processing on the fields with the same unique identification, wherein the field sample set comprises a field to be extracted 'company name': a "and" company name: b).
The first generating module is further configured to generate unique identifiers for all fields to be extracted in the webpage extraction samples according to the attribute names and the template names, and aggregate the same field to be extracted existing in different webpage extraction samples according to the unique identifiers to generate the field sample set.
If the first web page extraction sample has two fields to be extracted, "company name: company 1 "and" company staff: 10 persons ' then the labeling results corresponding to the two fields are stored as sample 1= { company_name=company 1, company_num=10 }, and the unique identifications of the two fields to be extracted are respectively ' company abstract template company name ' and ' company abstract template number '.
The second page samples two to-be-extracted fields, "company name: company 2 "and" company staff: 20 people ", then the labeling results corresponding to the two bullets are stored as sample 2= { company_name=company 2, company_num=20 }, and the unique identifications of the two fields to be extracted are respectively" company abstract template company name "and" company abstract template number ".
In this embodiment, each field in the page extraction sample may be processed in units of the page extraction sample, so as to generate a sample set of each field, and the process of generating the field sample set is adjusted to be that in units of the field, the same field existing in each page extraction sample is subjected to aggregation classification processing according to a unique identifier, for example, the field aggregation classification processing is performed on "sample 1= { company 1, company_num=10 }" and "sample 2= { company_name=company 2, company_num=20 }", so that the result becomes:
field 1= { sample1. Company_name=company 1, sample2. Company_name=company 2}
field 2= { sample1. Company_num=10, sample2. Company_num=20 }, where field1 represents field sample set 1, field2 represents field sample set 2.
Therefore, the field to be extracted can be accurately classified, and a field sample set can be quickly obtained.
The processing module 303 is configured to traverse each field to be extracted in the field sample set, convert an original HTML web page of each field to be extracted into a corresponding DOM tree, describe position features of the DOM tree based on HTML attribute tags according to JQuery specifications, generate a corresponding CSS path, select a candidate node set of a current field to be extracted from the converted DOM tree according to the CSS path, compare the candidate node set corresponding to each field to be extracted, and use a node which is unchanged all the time in the candidate node set as a target node of the current field to be extracted.
It is added that the DOM tree has corresponding features, generally called DOM tree features, that is, features of the field to be extracted in various features of the DOM tree, such as position features, features of adjacent nodes above, below, left and right. Various features of the DOM tree may be understood as context information, the location feature may be described by using a CSS path, and features of the adjacent nodes about the location may include information of a parent node, a sibling node, a descendant node, and the like.
In this embodiment, the node paths recorded by using the array subscript in the CSS specification are described according to the JQuery specification based on the position features of the HTML attribute tag to generate corresponding CSS paths, and the generated CSS paths are saved in the field attribute. For writing convenience, the node recording path under the array is called CSS_OLD_FORMAT in the CSS specification, the position feature of the DOM tree is described based on the HTML attribute tag according to the JQuery specification, and the CSS path generated correspondingly is called CSS_NEW_FORMAT. For the fields which repeatedly appear in the page, only the fields which appear for the first time are subjected to corresponding path conversion processing. It is further added that the CSS path based on the HTML attribute tag may be set into a field attribute, where the field attribute may include an attribute name, an attribute value, an attribute type, whether a field is repeatedly appeared on a page, and a CSS path based on the HTML attribute tag.
For example, the fields to be extracted respectively marked in the first page and the second page of the similar commodity are "date of manufacture: 2018, 7, 1, date of production: 2018, 7, 2 days "," date of production: candidate node set of 2018, 7, 1 ", and" date of production: the node which is unchanged all the time in the candidate set of 2018, 7 and 2 is the "date of production", so the "date of production" is taken as the current "date of production: 2018, 7, 1, date of production: destination node of 2018, 7, 2 ".
A second generating module 304, configured to compare the field to be extracted with the target node, and determine relative position information of an attribute value in the field to be extracted relative to the target node, where the attribute value is a field of the field to be extracted except for an attribute name; and generating a webpage data extraction template according to the attribute name, the attribute value type and the determined relative position information.
In this embodiment, the relative position information includes: searching direction, searching step number and adjacent nodes, wherein the target node is a reference object for determining the position of the attribute value, and the position of the attribute value in the webpage can be positioned according to the target node, the searching direction, the searching step number and the adjacent nodes. The data in other web pages can be extracted through the web page data extraction template, for example, if the web page data extraction template is extracted from the third web page to "date of production: 2018, 7, 3, "" company name: XX company "," company number: and XX' and other fields meet the user requirements, the webpage data extraction template meets the requirements, and if the user requirements are not met, the webpage data extraction template is corrected until the user requirements are met.
Referring to fig. 4, the second generating module 304 includes:
the processing sub-module 3041 is configured to count the occurrence times of each target node, select the target node with the highest occurrence times as a template target node, determine relative position information of attribute values in all fields to be extracted relative to the template target node, and form a relative position information set from the relative position information of the attribute values in all fields to be extracted relative to the template target node.
The generating submodule 3042 is configured to generate a webpage data extraction template according to the attribute name, the attribute value type and the relative position information set of the field to be extracted.
In this embodiment, since the web page sample can be extracted into multiple fields, multiple target nodes are generated, for example, in the first web page extraction sample, there are three fields to be extracted "company name: company 1"," trade name: cup a "and" trade name: bottle B ", in the second web page extraction sample, there are three fields to be extracted" company name: company 2"," trade name: cup C "and" trade name: bottle D ", then the target node that can be determined from the field to be extracted has one" company name "and two" trade names ", and the target node" trade name "is taken as the template target node. Determining relative position information of "company 1", "cup a" and "bottle B", "company 2", "cup C" and "bottle D", respectively, with respect to the target node "trade name", and composing the determined relative position information into a relative position information set. According to the company name: company 1"," trade name: cup a "," trade name: bottle B "," company name: company 2"," trade name: cup C "and" trade name: and generating a webpage data extraction template based on the fields to be extracted of the first webpage and the second webpage by the attribute name, the attribute value type and the relative position information set of the bottle D'.
In this way, the occurrence times of the target nodes are counted, the target node with the highest occurrence times is selected as the template target node, system resources can be saved, and the accuracy of extracting sample parameters is improved.
Optionally, referring to fig. 5, the apparatus for generating a webpage data extraction template further includes:
a setting module 305, configured to set sample parameters for a sample web page, where the sample parameters include: database names to be saved, the name of the type of the web page to which the database belongs and the name of the template.
In this embodiment, the template name may be determined by user definition, where the database name to be saved designates a database storing fields to be annotated, and the web page type name is one of the types of news page, shopping page, search result page, and the like.
The supplementary explanation is that after the webpage data extraction template is generated, the webpage data extraction template can be adjusted. Adjusting the web page data extraction template may include the following: the marking tool displays the webpage data extraction template, the user opens a new page, the name of the template is filled in the marking tool, and the test is clicked. Displaying an extraction result obtained after field extraction according to a webpage data extraction template in a page of the marking tool for a user to test, and if the test has no problem, finishing marking the page of the type; if the test fails, only the field with the extraction error is corrected by the labeling module 301, the first generating module 302, the processing module 303 and the second generating module 304 until the test result is correct.
According to the webpage data extraction template generation device, under the condition that the number of webpage extraction samples is greater than or equal to 2, a field sample set is generated for the to-be-extracted field of the webpage extraction samples, a target node of the to-be-extracted field is determined based on the field sample set, the to-be-extracted field is compared with the target node, the position information of an attribute value relative to the target node is determined, and the webpage data extraction template is generated according to the attribute name, the attribute value type and the determined relative position information of the to-be-extracted field.
The embodiment of the invention provides a computer device, which comprises a processor, wherein the processor is used for realizing the steps of the webpage data extraction template generation method provided by the method embodiments when executing a computer program in a memory.
For example, a computer program may be split into one or more modules, one or more modules stored in memory and executed by a processor to perform the present invention. One or more modules may be a series of computer program instruction segments capable of performing particular functions to describe the execution of a computer program in a computer device. For example, the computer program may be divided into the steps of the web page data extraction template generation method provided by the above-described respective method embodiments.
It will be appreciated by those skilled in the art that the foregoing description of computer apparatus is merely an example and is not intended to be limiting, and that more or fewer components than the foregoing description may be included, or certain components may be combined, or different components may be included, for example, input-output devices, network access devices, buses, etc.
The processor may be a central processing unit (Central Processing Unit, CPU), other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like that is a control center of the computer device, connecting various parts of the overall computer device using various interfaces and lines.
The memory may be used to store the computer program and/or modules, and the processor may implement various functions of the computer device by running or executing the computer program and/or modules stored in the memory, and invoking data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required for at least one function, and the like; the storage data area may store data (such as audio data, phonebook, etc.) created according to the use of the handset, etc. In addition, the memory may include high-speed random access memory, and may also include non-volatile memory, such as a hard disk, memory, plug-in hard disk, smart Media Card (SMC), secure Digital (SD) Card, flash Card (Flash Card), at least one disk storage device, flash memory device, or other volatile solid-state storage device.
The modules/units integrated with the computer apparatus may be stored in a computer readable storage medium if implemented in the form of software functional units and sold or used as a stand alone product. Based on such understanding, the present invention may implement all or part of the procedures in the methods of the above embodiments, or may be implemented by instructing related hardware by a computer program, where the computer program may be stored in a computer readable storage medium, and the computer program may implement the steps of the embodiments of the method for generating a web page data extraction template when executed by a processor. Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier wave signal, an electrical signal, a software distribution medium, and so forth.
The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, and alternatives falling within the spirit and principles of the invention.
Claims (12)
1. The webpage data extraction template generation method is characterized by comprising the following steps:
labeling a field to be extracted from a sample webpage, setting field attributes for the field to be extracted, and storing the field to be extracted and the field attributes as webpage extraction samples into an extraction sample set, wherein the field attributes comprise attribute names, attribute values and attribute types;
traversing all fields to be extracted in the webpage extraction samples under the condition that the number of webpage extraction samples in the extraction sample set is greater than or equal to 2, and generating a field sample set, wherein the field sample set comprises at least two fields to be extracted with the same unique identification, and the unique identification is formed according to an attribute name and a template name in the field attribute;
traversing each field to be extracted in a field sample set, converting an original HTML webpage of each field to be extracted into a corresponding DOM tree, describing the position characteristics of the DOM tree based on HTML attribute tags according to JQuery specifications, generating a corresponding CSS path, selecting a candidate node set of a current field to be extracted from the converted DOM tree according to the CSS path, comparing the candidate node set corresponding to each field to be extracted, and taking a node which is unchanged all the time in the candidate node set as a target node of the current field to be extracted;
Comparing the field to be extracted with the target node, and determining relative position information of an attribute value in the field to be extracted relative to the target node, wherein the attribute value is a field except the attribute name in the field to be extracted; and generating a webpage data extraction template according to the attribute name, the attribute value type and the determined relative position information of the field to be extracted.
2. The method for generating a template for extracting web page data according to claim 1, further comprising, before labeling the field to be extracted from the sample web page:
setting sample parameters for a sample web page, wherein the sample parameters comprise: database names to be saved, the name of the type of the web page to which the database belongs and the name of the template.
3. The method for generating a template for extracting web page data according to claim 1, wherein said traversing all the fields to be extracted in the sample for extracting web page, generating a field sample set comprises the following steps:
and generating unique identifiers for all the fields to be extracted in the webpage extraction samples according to the attribute names and the template names, and aggregating the same fields to be extracted in different webpage extraction samples through the unique identifiers to generate the field sample set.
4. The web page data extraction template generation method according to claim 1, wherein the relative position information includes: searching direction, searching step number and adjacent nodes.
5. The method for generating a web page data extraction template according to claim 1, wherein after the node which is unchanged all the time in the candidate node set is used as the target node of the field to be extracted currently, the method for generating a web page data extraction template comprises the following steps:
counting the occurrence times of each target node, selecting the target node with the highest occurrence times as a template target node, determining the relative position information of the attribute values in all the fields to be extracted relative to the template target node, and forming a relative position information set by the relative position information of the attribute values in all the fields to be extracted relative to the template target node;
and generating a webpage data extraction template according to the attribute name, the attribute value type and the relative position information set of the field to be extracted.
6. A web page data extraction template generation apparatus, characterized in that the web page data extraction template generation apparatus includes:
The labeling module is used for labeling a field to be extracted from a sample webpage, setting field attributes for the field to be extracted, storing the field to be extracted and the field attributes as webpage extraction samples into an extraction sample set, wherein the field attributes comprise attribute names, attribute values and attribute types;
the first generation module is used for traversing all fields to be extracted in the webpage extraction samples to generate a field sample set under the condition that the number of webpage extraction samples in the extraction sample set is greater than or equal to 2, wherein the field sample set comprises at least two fields to be extracted with the same unique identification, and the unique identification is formed according to an attribute name and a template name in the field attribute;
the processing module is used for traversing each field to be extracted in the field sample set, converting an original HTML webpage of each field to be extracted into a corresponding DOM tree, describing the position characteristics of the DOM tree based on HTML attribute tags according to JQuery specifications, generating a corresponding CSS path, selecting a candidate node set of a current field to be extracted from the converted DOM tree according to the CSS path, comparing the candidate node set corresponding to each field to be extracted, and taking a node which is unchanged all the time in the candidate node set as a target node of the current field to be extracted;
The second generation module is used for comparing the field to be extracted with the target node and determining relative position information of an attribute value in the field to be extracted relative to the target node, wherein the attribute value is a field except the attribute name in the field to be extracted; and generating a webpage data extraction template according to the attribute name, the attribute value type and the determined relative position information of the field to be extracted.
7. The web page data extraction template generation apparatus according to claim 6, further comprising:
the setting module is used for setting sample parameters for the sample web page, wherein the sample parameters comprise a database name to be saved, a web page type name and a template name.
8. The apparatus for generating a web page data extraction template according to claim 6, wherein the first generating module is further configured to generate unique identifiers for all fields to be extracted in a web page extraction sample according to the attribute name and the template name, and aggregate the same fields to be extracted existing in different web page extraction samples by the unique identifiers to generate the field sample set.
9. The web page data extraction template generation apparatus of claim 6, wherein the relative position information includes: searching direction, searching step number and adjacent nodes.
10. The web page data extraction template generation apparatus of claim 6, wherein the second generation module comprises:
the processing sub-module is used for counting the occurrence times of all target nodes, selecting the target node with the highest occurrence times as a template target node, determining the relative position information of the attribute values in all the fields to be extracted relative to the template target node, and forming a relative position information set by the relative position information of the attribute values in all the fields to be extracted relative to the template target node;
the generation sub-module is used for generating a webpage data extraction template according to the attribute name, the attribute value type and the relative position information set of the field to be extracted.
11. A computer device comprising a processor for implementing the steps of the method of generating a web page data extraction template according to any one of claims 1-5 when executing a computer program in memory.
12. A computer-readable storage medium having stored thereon a computer program, characterized by: the computer program, when executed by a processor, implements a method for generating a web page data extraction template according to any one of claims 1 to 5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911302343.2A CN111125483B (en) | 2019-12-17 | 2019-12-17 | Webpage data extraction template generation method and device, computer device and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911302343.2A CN111125483B (en) | 2019-12-17 | 2019-12-17 | Webpage data extraction template generation method and device, computer device and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111125483A CN111125483A (en) | 2020-05-08 |
CN111125483B true CN111125483B (en) | 2023-06-27 |
Family
ID=70500045
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911302343.2A Active CN111125483B (en) | 2019-12-17 | 2019-12-17 | Webpage data extraction template generation method and device, computer device and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111125483B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113434748A (en) * | 2021-07-19 | 2021-09-24 | 湖南四方天箭信息科技有限公司 | Template annotation based distributed crawler method and device, computer device and computer readable storage medium |
Family Cites Families (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7680858B2 (en) * | 2006-07-05 | 2010-03-16 | Yahoo! Inc. | Techniques for clustering structurally similar web pages |
US20100169311A1 (en) * | 2008-12-30 | 2010-07-01 | Ashwin Tengli | Approaches for the unsupervised creation of structural templates for electronic documents |
CN101833554B (en) * | 2009-03-09 | 2012-09-26 | 富士通株式会社 | Method and equipment for producing extraction template and method and equipment for extracting content on web pages |
JP2011053912A (en) * | 2009-09-02 | 2011-03-17 | Nec Corp | Page similarity determination apparatus, page similarity determination method and page similarity determination program |
CN102254009B (en) * | 2011-07-15 | 2013-05-01 | 福建星网锐捷通讯股份有限公司 | Method for extracting data of webpage table |
CN102955796B (en) * | 2011-08-16 | 2017-06-27 | 微软技术许可有限责任公司 | Based on frequent subtree come the method for derived record template |
CN103544176B (en) * | 2012-07-13 | 2018-08-10 | 百度在线网络技术(北京)有限公司 | Method and apparatus for generating the page structure template corresponding to multiple pages |
CN104050281A (en) * | 2014-06-26 | 2014-09-17 | 北京思特奇信息技术股份有限公司 | Webpage information extraction method and device based on http protocol |
US20160012147A1 (en) * | 2014-07-10 | 2016-01-14 | MyMojo Corporation | Asynchronous Initialization of Document Object Model (DOM) Modules |
-
2019
- 2019-12-17 CN CN201911302343.2A patent/CN111125483B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN111125483A (en) | 2020-05-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112749284B (en) | Knowledge graph construction method, device, equipment and storage medium | |
CN101464905B (en) | Web page information extraction system and method | |
US8046681B2 (en) | Techniques for inducing high quality structural templates for electronic documents | |
US20090125529A1 (en) | Extracting information based on document structure and characteristics of attributes | |
US20090063500A1 (en) | Extracting data content items using template matching | |
US20090248707A1 (en) | Site-specific information-type detection methods and systems | |
US20100169311A1 (en) | Approaches for the unsupervised creation of structural templates for electronic documents | |
US20150287047A1 (en) | Extracting Information from Chain-Store Websites | |
US10572566B2 (en) | Image quality independent searching of screenshots of web content | |
US8359307B2 (en) | Method and apparatus for building sales tools by mining data from websites | |
CN109002425B (en) | Method for acquiring upstream and downstream relations of enterprise, terminal device and medium | |
CN116127105B (en) | Data collection method and device for big data platform | |
US9767086B2 (en) | System and method for enablement of data masking for web documents | |
US10755091B2 (en) | Method and apparatus for retrieving image-text block from web page | |
US20100185684A1 (en) | High precision multi entity extraction | |
CN112417338B (en) | Page adaptation method, system and equipment | |
CN107870915B (en) | Indication of search results | |
JP2008090404A (en) | Document retrieval apparatus, method and program | |
CN101763424B (en) | Method for determining characteristic words and searching according to file content | |
CN112347324B (en) | Document query method and device, electronic equipment and storage medium | |
CN111125483B (en) | Webpage data extraction template generation method and device, computer device and storage medium | |
US20130031474A1 (en) | Method for managing discovery documents on a mobile computing device | |
CN113918713A (en) | Data annotation method and device, computer equipment and storage medium | |
CN117390329A (en) | Webpage labeling method, device and equipment | |
CN115297042B (en) | Method for detecting consistency of webpages under different networks and related equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |