Nothing Special   »   [go: up one dir, main page]

CN110020236A - Web analysis method, apparatus, storage medium, processor and equipment - Google Patents

Web analysis method, apparatus, storage medium, processor and equipment Download PDF

Info

Publication number
CN110020236A
CN110020236A CN201710758003.5A CN201710758003A CN110020236A CN 110020236 A CN110020236 A CN 110020236A CN 201710758003 A CN201710758003 A CN 201710758003A CN 110020236 A CN110020236 A CN 110020236A
Authority
CN
China
Prior art keywords
template
url
web analysis
webpage
business scenario
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710758003.5A
Other languages
Chinese (zh)
Other versions
CN110020236B (en
Inventor
袁园
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201710758003.5A priority Critical patent/CN110020236B/en
Publication of CN110020236A publication Critical patent/CN110020236A/en
Application granted granted Critical
Publication of CN110020236B publication Critical patent/CN110020236B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses web analysis method, apparatus, storage medium, processor and equipment, the web analysis method includes: to obtain web analysis request, wherein the business scenario where when carrying the uniform resource locator URL of webpage to be resolved in web analysis request and parsing the webpage to be resolved;The template to match simultaneously with the business scenario and the URL is searched from pre-configured each template, wherein the template content of the template includes resolution rules, and different templates have different resolution rules;The webpage to be resolved is parsed using the resolution rules in the template found, obtains parsing result.The present invention, which does not need program on restarting line, can be completed the configuration of resolution rules, thus improve work efficiency.

Description

Web analysis method, apparatus, storage medium, processor and equipment
Technical field
The present invention relates to field of computer technology, are situated between more specifically to a kind of web analysis method, apparatus, storage Matter, processor and equipment.
Background technique
Web analysis refers to that analysis extracts really desired information from web page source code.The net in search engine exploitation Page analytic technique is very important a ring.
Different web sites, the different space of a whole page each webpage generally correspond to different resolution rules.Realize different web sites, no Each webpage with the space of a whole page parses in identical platform, and the web analysis method taken at present is: carrying out to each webpage When parsing, first have to complete the configuration to resolution rules corresponding to the webpage, it then could be using the resolution rules to the net Page is parsed, and starts to parse next webpage again after the completion to the web analysis.Wherein, one new solution of every configuration When analysis rule, this new configuration rule will first be written, this solution being newly written could be allowed by then restarting program on line Analyse taking effect rules.
But the configuration of new resolution rules could be completed due to restart program on line every time, and when needs When the resolution rules number newly configured is more, each program on line of restarting necessarily will affect working efficiency.
Summary of the invention
In view of the above problems, it proposes on the present invention overcomes the above problem or at least be partially solved in order to provide one kind Web analysis method, apparatus, storage medium, processor and the equipment of problem are stated, scheme is as follows:
A kind of web analysis method, comprising:
Web analysis request is obtained, wherein carrying the unified resource positioning of webpage to be resolved in web analysis request Device URL and parse the webpage to be resolved when where business scenario;
The template to match simultaneously with the business scenario and the URL is searched from pre-configured each template, Described in the template content of template include resolution rules, and different templates have different resolution rules;
The webpage to be resolved is parsed using the resolution rules in the template found, obtains parsing result.
Optionally, it is searching from pre-configured each template while matching with the business scenario and the URL Template before, the web analysis method further include: in advance by each template with the unified configuration of default storage format in database In, wherein the storage format of each template is using the column storage format for supporting nested structure, storage column point in the database For domain name, business scenario and template object, the template object is specifically included in the URL canonical matching rule and template of template Hold.
Wherein, described search from pre-configured each template matches with the business scenario and the URL simultaneously Template, comprising:
Using the domain name in the URL as keyword, the template in the database is retrieved, is screened out from it institute State template corresponding to the domain name in URL;
Using the business scenario as keyword, two are carried out to template corresponding to the domain name in the URL filtered out Secondary retrieval is screened out from it template corresponding to the business scenario;
The URL is matched with the URL canonical matching rule of template corresponding to the business scenario filtered out, Find out the template of successful match.
Optionally, described in advance to unify each template after configuring in the database with default storage format, the webpage Analytic method further include:
A cache pool is being locallyd create in advance, while opening a background thread in rear end;The background thread is used for Periodically the template in the database is updated into the cache pool.
Optionally, the template content further includes call instruction;
Corresponding, the resolution rules using in the template found parse the webpage to be resolved, obtain After parsing result, the web analysis method further include:
Pre-configured public resolution component is called according to the call instruction in the template found, using described public Resolution component obtains secondary parsing result to needing the field of secondary parsing to handle in the parsing result;
Wherein, the public resolution component refers to resolver, and different public resolution components has different analytic abilities.
A kind of web analysis device, comprising:
Pretreatment unit is used for pre-configured each template, wherein the template content of the template includes resolution rules, and Different templates have different resolution rules;
Acquiring unit, for obtaining web analysis request, wherein carrying webpage to be resolved in web analysis request URL and parse the webpage to be resolved when where business scenario;
Searching unit, for from pre-configured each template search simultaneously with the business scenario and the URL phase Matched template;
First resolution unit, the resolution rules in template for being found using the searching unit are to described to be resolved Webpage is parsed, and parsing result is obtained.
Optionally, the template content further includes call instruction;
It is corresponding, the web analysis device further include: the second resolution unit, for according to the tune in the template found Pre-configured public resolution component is called with instruction, using the public resolution component to needing two in the parsing result The field of secondary parsing is handled, and secondary parsing result is obtained;
Wherein, the public resolution component refers to resolver, and different public resolution components has different analytic abilities.
A kind of storage medium, is stored thereon with program, which realizes disclosed above any when being executed by processor Web analysis method.
A kind of processor, for running program, described program executes disclosed above any the processor when running Web analysis method.
A kind of equipment, the equipment include processor, memory and storage on a memory and can run on a processor Program, processor execute program when realize any web analysis method disclosed above.
By above-mentioned technical proposal, web analysis method, apparatus, storage medium, processor and equipment provided by the invention, It can be with pre-configured each resolution rules, then each webpage in realization different web sites, the different spaces of a whole page solves in identical platform When analysis, for each webpage, it can directly go to transfer out matching parsing from pre-configured each resolution rules Rule parses the webpage.Matching parsing can be completed since the present invention does not need program on restarting line The configuration of rule, thus improve work efficiency.
The above description is only an overview of the technical scheme of the present invention, in order to better understand the technical means of the present invention, And it can be implemented in accordance with the contents of the specification, and in order to allow above and other objects of the present invention, feature and advantage can It is clearer and more comprehensible, the followings are specific embodiments of the present invention.
Detailed description of the invention
By reading the following detailed description of the preferred embodiment, various other advantages and benefits are common for this field Technical staff will become clear.The drawings are only for the purpose of illustrating a preferred embodiment, and is not considered as to the present invention Limitation.And throughout the drawings, the same reference numbers will be used to refer to the same parts.In the accompanying drawings:
Fig. 1 shows a kind of web analysis method flow diagram provided in an embodiment of the present invention;
Fig. 2 shows another web analysis method flow diagrams provided in an embodiment of the present invention;
Fig. 3 shows another web analysis method flow diagram provided in an embodiment of the present invention;
Fig. 4 shows a kind of web analysis apparatus structure schematic diagram provided in an embodiment of the present invention;
Fig. 5 shows another web analysis apparatus structure schematic diagram provided in an embodiment of the present invention.
Specific embodiment
Exemplary embodiments of the present disclosure are described in more detail below with reference to accompanying drawings.Although showing the disclosure in attached drawing Exemplary embodiment, it being understood, however, that may be realized in various forms the disclosure without should be by embodiments set forth here It is limited.On the contrary, these embodiments are provided to facilitate a more thoroughly understanding of the present invention, and can be by the scope of the present disclosure It is fully disclosed to those skilled in the art.
As shown in Figure 1, a kind of web analysis method provided in an embodiment of the present invention, comprising:
Step S01: obtaining web analysis request, wherein carrying the URL of webpage to be resolved in web analysis request Business field where when (Uniform Resoure Locator, uniform resource locator) and the parsing webpage to be resolved Scape.
Specifically, URL is the table succinct to the position for the resource that can be obtained from internet and one kind of access method Show, is the address of standard resource on internet, is commonly called as " network address ".Include the information such as domain name in URL.
For same webpage, the content of required parsing is different under different business scenarios.Such as it to parse One Baidu search page is all ad contents for needing to parse Baidu and recommending at business scenario A, and in business scenario B Under, then it is the content for needing to parse normal searching and coming out.In web analysis request, the business scenario carried can be used It is indicated with the unique corresponding mark of the business scenario.
Step S02: lookup matches with the business scenario and the URL simultaneously from pre-configured each template Template, wherein the template content of the template includes resolution rules, and different templates have different resolution rules.
Specifically, according to webpage to be resolved URL and parse the webpage to be resolved when where business scenario, Neng Gouwei One determines the corresponding resolution rules of the webpage to be resolved.Based on this, the pre-configured different web sites of the present embodiment, the different spaces of a whole page Each webpage corresponding to resolution rules (for example, in advance by the corresponding parsing rule of each webpage of different web sites, the different space of a whole page Then unify configuration in the database), each resolution rules are stored in respectively in different templates, when needing to solve to current web page When analysis, business scenario where when URL and parsing current web page according to current web page can be directly from pre-configured each Template corresponding to current web page is found in template, obtains the resolution rules in the template.
In preprocessing process, in order to realize the unified configuration of each template in the database, need to define template Resolution rules and determining each nodename meaning, be exemplified below: the resolution rules of template can be defined as Json The character string rule of the Key-Value form of (JavaScript Object Notation, JS object tag) form, Key table Show that field type, Value are the Xpath used when carrying out web analysis (Xml Path Language, extensible markups Language path language) value.Wherein, each node Key of Json form mainly includes two kinds of expression types, and one is node attribute values Type, secondly be service fields type, the range that nodal community Value Types mainly indicate has the level of present node, parses Content needs the mark of specially treated, and service fields type is then to indicate breath manner of breathing with each specific field of webpage to be resolved It closes, nodal community Value Types are to complement each other with service fields type, indispensable.
The resolution rules of template are defined with above-mentioned example and determine that one when each nodename meaning is answered in the following, providing Use example.
By taking webpage to be resolved is video web-pages as an example, the resolution rules in corresponding template can use following example 1- 1。
Example 1-1
In example 1-1, each nodename meaning are as follows: xpath field be nodal community Value Types, Video, Title and ViewCount be specific service fields type, the matched attribute results value of Videos be meet Xpath array form it is interior Hold.
In addition, when by the configuration of each template in the database, it is also necessary to determine the storage format of each template in the database. For convenience of quickly being managed each template in the database and achieve the purpose that quickly to search template, in the database The storage format of each template can be divided into domain name, business scenario using the column storage format for supporting nested structure, storage column It is arranged with template object three, wherein the template object specifically includes the URL canonical matching rule and template content of template.Namely It says, the storage format of each template is specially Dictionary < string, Dictionary < string, List in the database <Template>>>, explain: Dictionary<domain name, Dictionary<scene, List<template object>>, wherein template pair URL canonical matching rule and template content as specifically including template.
Based on this storage format, match simultaneously with the business scenario and the URL searching from database When template, to achieve the purpose that quick search template, as shown in Fig. 2, the step S02 is specifically included:
Step S021: using analyzing web page request in domain name in the URL that carries as keyword, in database Template is retrieved, and template corresponding to the domain name in the URL is screened out from it;
Step S022: using analyzing web page request in the business scenario that carries as keyword, described in filtering out Template corresponding to domain name in URL carries out quadratic search, is screened out from it template corresponding to the business scenario;
Step S023: by the URL canonical matching rule of template corresponding to the URL and the business scenario filtered out It is matched, finds out the template of successful match.
In brief, lookup mode shown in Fig. 2 is exactly successively with the domain name of webpage to be resolved, URL and when being parsed The business scenario at place is as keyword, first from the template set filtered out under the domain name in database, then from the template set In filter out template set under the business scenario, filtered out only from the template set that postsearch screening obtains further according to the URL One matched template, this lookup mode can reduce the matched calculation amount of URL canonical, achieve the purpose that quick search template.
It, can also be in advance in local for the speed for further increasing inquiry template on the basis of lookup mode shown in Fig. 2 It creates a cache pool, while opening a background thread in rear end, the background thread is for periodically (such as every one Minute) update the template in the database into the cache pool.The benefit of cache pool is to be locally stored, search speed Fastly.It is in order to achieve the purpose that the template in cache pool described in real-time update, specifically: template that a background thread is opened in rear end It is stored in database, the template of external call is the template in cache pool, if carried out to the template in database Modification, then also wanting the update of dynamic realtime into cache pool, corresponding operation is exactly that a background thread is opened in rear end, then Periodically (such as every one minute) template in database is updated into cache pool.Each template deposits in the cache pool Storage format is similarly Dictionary<string, Dictionary<string, List<Template>>>.
Step S03: the webpage to be resolved is parsed using the resolution rules in the template found, is parsed As a result.
By taking example 1-1 as an example, a certain existing sports cast class video web-pages are parsed using example 1-1, are exported Parsing result as shown in example 1-2.
Example 1-2
After being parsed using the resolution rules in the template found to the webpage to be resolved, parsing result is direct Feed back to called side.Web analysis method disclosed in the present embodiment is stateless service, that is to say, that the present embodiment discloses Web analysis method change not based on the change of called side.
By the above-mentioned associated description to the present embodiment it is found that web analysis method provided in this embodiment can be pre-configured with Good each resolution rules, then in each webpage for realizing different web sites, the different spaces of a whole page when being parsed in identical platform, for each Webpage directly can remove to transfer from pre-configured each resolution rules matching resolution rules to the webpage It is parsed, the configuration for completing matching resolution rules is gone without program on restarting line, thus improve work Make efficiency.
It is further to note that the web analysis method or the embodiment of the present invention of either prior art use disclose Above-mentioned web analysis method, suffering a problem that using Shi Douhui, exactly webpage parsed using resolution rules Afterwards, the certain fields parsed may not be finally desired content, it is necessary to pass through specially treated, just can finally be thought The parsing result wanted.To realize that this target, the treatment measures that the prior art is taken are to utilize resolution rules to a net every time Hard coded after page is parsed, to the field progress program for needing specially treated in parsing result, that is to say, that parsing is tied Each field for needing specially treated requires the parsing result that one resolver of bespoke just can finally be wanted in fruit, Compiler workload is too big.And the embodiment of the present invention finally desired parsing result and avoids bringing too big volume in order to obtain Translator program workload proposes another web analysis method, such as Fig. 3 on the basis of aforementioned disclosed web analysis method It is shown, it specifically includes:
Step S01: obtaining web analysis request, wherein carrying the unification of webpage to be resolved in web analysis request Resource localizer URL and parse the webpage to be resolved when where business scenario.
Step S02: lookup matches with the business scenario and the URL simultaneously from pre-configured each template Template, wherein the template content of the template includes resolution rules and call instruction, and there is different templates different parsings to advise Then.
Step S03: the webpage to be resolved is parsed using the resolution rules in the template found, is parsed As a result.
Step S04: calling pre-configured public resolution component according to the call instruction in the template found, utilizes The public resolution component obtains secondary parsing result to needing the field of secondary parsing to handle in the parsing result. Wherein, the public resolution component refers to resolver, and different public resolution components has different analytic abilities.It is described public Resolution component can configure in same database with each template.
Web analysis method shown in Fig. 3 is based on aforementioned disclosed web analysis method and proposes that improvement is: in template Appearance further includes call instruction, after parsing every time using resolution rules to a webpage, also calls corresponding public parsing group Part handles the field for needing specially treated each in parsing result, with the parsing result finally wanted.These are public Resolution component is that the present embodiment is pre-configured altogether, and any public resolution component is available for multiple templates and calls, rather than certain A certain field in the parsing result of one webpage it is dedicated, therefore greatly reduce compiler workload.
For example, having in " NBA All-Star game " and " shuttlecock World Championships " in parsing result shown in aforementioned exemplary 1-2 Space, and desired parsing result is free from the field in space, pre-defines a public resolution component to this present embodiment, Public ability is TrimeTransfomation, i.e., the operation in space is carried out to the field after parsing.It says for another example, for showing " 23 times " and " 45 times " in parsing result shown in example 1-2, pre-define a public resolution component, and public ability is IntegerExtractTransformation carries out canonical matching for numeric type, to select " 23 " and " 45 " As last output result.At this point, shown in the following 1-3 of template content and final output newly obtained.
The template content newly obtained:
Final output:
Example 1-3
It is obvious that the output result in example 1-3 is exactly finally desired parsing result.
Corresponding with above method embodiment, the present invention also provides a kind of web analysis devices.
As shown in figure 4, a kind of web analysis device provided in an embodiment of the present invention, comprising:
Pretreatment unit 100 is used for pre-configured each template, wherein the template content of the template includes parsing rule Then, and different templates have different resolution rules;
Acquiring unit 200, for obtaining web analysis request, wherein carrying net to be resolved in web analysis request Business scenario where when the URL and the parsing webpage to be resolved of page;
Searching unit 300, for from pre-configured each template search simultaneously with the business scenario and the URL The template to match;
First resolution unit 400, the resolution rules in template for being found using searching unit 200 are to described wait solve Analysis webpage is parsed, and parsing result is obtained.
Optionally, pretreatment unit 100 is specifically used in advance configuring each template in number according to default storage format is unified According in library, wherein the storage format of each template is using the column storage format for supporting nested structure, storage in the database Column are divided into domain name, business scenario and template object, and the template object specifically includes the URL canonical matching rule and template of template Content.
Optionally, searching unit 300 is specifically used for using the domain name in the URL as keyword, in the database Template retrieved, be screened out from it template corresponding to the domain name in the URL;Again using the business scenario as key Word carries out quadratic search to template corresponding to the domain name in the URL filtered out, is screened out from it the business scenario institute Corresponding template;The URL canonical matching rule of template corresponding to the URL and the business scenario filtered out is carried out again Matching, finds out the template of successful match.
Optionally, pretreatment unit 300 is also used to localling create a cache pool in advance, while opening one in rear end Background thread;The background thread is for periodically updating the template in the database into the cache pool.
Optionally, the template content further includes call instruction;It is corresponding, as shown in figure 5, the web analysis device is also It include: the second resolution unit 500, for calling pre-configured public parsing according to the call instruction in the template found Component is obtained secondary using the public resolution component to needing the field of secondary parsing to handle in the parsing result Parsing result.Wherein, the public resolution component refers to resolver, and different public resolution components has different parsing energy Power.
The web analysis device includes processor and memory, and above-mentioned pretreatment unit 100, is searched at acquiring unit 200 Unit 300, the first resolution unit 400 and second resolution unit 500 etc. are stored as program unit in memory, by handling Device executes above procedure unit stored in memory to realize corresponding function.
Include kernel in processor, is gone in memory to transfer corresponding program unit by kernel.Kernel can be set one Or more, web analysis is realized by adjusting kernel parameter.
Memory may include the non-volatile memory in computer-readable medium, random access memory (RAM) and/ Or the forms such as Nonvolatile memory, if read-only memory (ROM) or flash memory (flash RAM), memory include that at least one is deposited Store up chip.
For above-mentioned apparatus embodiment, since it essentially corresponds to embodiment of the method, so describe fairly simple, Referring to the related description of embodiment of the method in place of related.
The embodiment of the invention provides a kind of storage mediums, are stored thereon with program, real when which is executed by processor Existing any web analysis method disclosed above.
The embodiment of the invention provides a kind of processor, the processor is for running program, wherein described program operation Any web analysis method Shi Zhihang disclosed above.
The embodiment of the invention provides a kind of equipment, equipment include processor, memory and storage on a memory and can The program run on a processor, processor perform the steps of when executing program
Web analysis request is obtained, wherein carrying the unified resource positioning of webpage to be resolved in web analysis request Device URL and parse the webpage to be resolved when where business scenario;
The template to match simultaneously with the business scenario and the URL is searched from pre-configured each template, Described in the template content of template include resolution rules, and different templates have different resolution rules;
The webpage to be resolved is parsed using the resolution rules in the template found, obtains parsing result.
Optionally, it is searching from pre-configured each template while matching with the business scenario and the URL Template before, the web analysis method further include: in advance by each template with the unified configuration of default storage format in database In, wherein the storage format of each template is using the column storage format for supporting nested structure, storage column point in the database For domain name, business scenario and template object, the template object is specifically included in the URL canonical matching rule and template of template Hold.
Optionally, it is described from pre-configured each template search simultaneously with the business scenario and the URL phase The template matched, specifically includes:
Using the domain name in the URL as keyword, the template in the database is retrieved, is screened out from it institute State template corresponding to the domain name in URL;
Using the business scenario as keyword, two are carried out to template corresponding to the domain name in the URL filtered out Secondary retrieval is screened out from it template corresponding to the business scenario;
The URL is matched with the URL canonical matching rule of template corresponding to the business scenario filtered out, Find out the template of successful match.
Optionally, described in advance to unify each template after configuration in the database with default storage format, further includes: pre- A cache pool first is being locallyd create, while opening a background thread in rear end;The background thread will be for periodically will Template in the database is updated into the cache pool.
Optionally, the template content further includes call instruction;
Corresponding, the resolution rules using in the template found parse the webpage to be resolved, obtain After parsing result, further includes:
Pre-configured public resolution component is called according to the call instruction in the template found, using described public Resolution component obtains secondary parsing result to needing the field of secondary parsing to handle in the parsing result;
Wherein, the public resolution component refers to resolver, and different public resolution components has different analytic abilities.
Equipment herein can be server, PC, PAD, mobile phone etc..
The embodiment of the invention also provides a kind of computer program products, when executing on data processing equipment, are suitable for Execute the program of initialization there are as below methods step:
Web analysis request is obtained, wherein carrying the unified resource positioning of webpage to be resolved in web analysis request Device URL and parse the webpage to be resolved when where business scenario;
The template to match simultaneously with the business scenario and the URL is searched from pre-configured each template, Described in the template content of template include resolution rules, and different templates have different resolution rules;
The webpage to be resolved is parsed using the resolution rules in the template found, obtains parsing result.
Optionally, it is searching from pre-configured each template while matching with the business scenario and the URL Template before, the web analysis method further include: in advance by each template with the unified configuration of default storage format in database In, wherein the storage format of each template is using the column storage format for supporting nested structure, storage column point in the database For domain name, business scenario and template object, the template object is specifically included in the URL canonical matching rule and template of template Hold.
Optionally, it is described from pre-configured each template search simultaneously with the business scenario and the URL phase The template matched, specifically includes:
Using the domain name in the URL as keyword, the template in the database is retrieved, is screened out from it institute State template corresponding to the domain name in URL;
Using the business scenario as keyword, two are carried out to template corresponding to the domain name in the URL filtered out Secondary retrieval is screened out from it template corresponding to the business scenario;
The URL is matched with the URL canonical matching rule of template corresponding to the business scenario filtered out, Find out the template of successful match.
Optionally, described in advance to unify each template after configuration in the database with default storage format, further includes: pre- A cache pool first is being locallyd create, while opening a background thread in rear end;The background thread will be for periodically will Template in the database is updated into the cache pool.
Optionally, the template content further includes call instruction;
Corresponding, the resolution rules using in the template found parse the webpage to be resolved, obtain After parsing result, further includes:
Pre-configured public resolution component is called according to the call instruction in the template found, using described public Resolution component obtains secondary parsing result to needing the field of secondary parsing to handle in the parsing result;
Wherein, the public resolution component refers to resolver, and different public resolution components has different analytic abilities.
It should be understood by those skilled in the art that, embodiments herein can provide as method, system or computer program Product.Therefore, complete hardware embodiment, complete software embodiment or reality combining software and hardware aspects can be used in the application Apply the form of example.Moreover, it wherein includes the computer of computer usable program code that the application, which can be used in one or more, The computer program implemented in usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) produces The form of product.
The application is referring to method, the process of equipment (system) and computer program product according to the embodiment of the present application Figure and/or block diagram describe.It should be understood that every one stream in flowchart and/or the block diagram can be realized by computer program instructions The combination of process and/or box in journey and/or box and flowchart and/or the block diagram.It can provide these computer programs Instruct the processor of general purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices to produce A raw machine, so that being generated by the instruction that computer or the processor of other programmable data processing devices execute for real The device for the function of being specified in present one or more flows of the flowchart and/or one or more blocks of the block diagram.
These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works, so that it includes referring to that instruction stored in the computer readable memory, which generates, Enable the manufacture of device, the command device realize in one box of one or more flows of the flowchart and/or block diagram or The function of being specified in multiple boxes.
These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that counting Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer or The instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or block diagram one The step of function of being specified in a box or multiple boxes.
In a typical configuration, calculating equipment includes one or more processors (CPU), input/output interface, net Network interface and memory.
Memory may include the non-volatile memory in computer-readable medium, random access memory (RAM) and/ Or the forms such as Nonvolatile memory, such as read-only memory (ROM) or flash memory (flash RAM).Memory is computer-readable Jie The example of matter.
Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by any method Or technology come realize information store.Information can be computer readable instructions, data structure, the module of program or other data. The example of the storage medium of computer includes, but are not limited to phase change memory (PRAM), static random access memory (SRAM), moves State random access memory (DRAM), other kinds of random access memory (RAM), read-only memory (ROM), electric erasable Programmable read only memory (EEPROM), flash memory or other memory techniques, read-only disc read only memory (CD-ROM) (CD-ROM), Digital versatile disc (DVD) or other optical storage, magnetic cassettes, tape magnetic disk storage or other magnetic storage devices Or any other non-transmission medium, can be used for storage can be accessed by a computing device information.As defined in this article, it calculates Machine readable medium does not include temporary computer readable media (transitory media), such as the data-signal and carrier wave of modulation.
It should also be noted that, the terms "include", "comprise" or its any other variant are intended to nonexcludability It include so that the process, method, commodity or the equipment that include a series of elements not only include those elements, but also to wrap Include other elements that are not explicitly listed, or further include for this process, method, commodity or equipment intrinsic want Element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that including element There is also other identical elements in process, method, commodity or equipment.
It will be understood by those skilled in the art that embodiments herein can provide as method, system or computer program product. Therefore, complete hardware embodiment, complete software embodiment or embodiment combining software and hardware aspects can be used in the application Form.It is deposited moreover, the application can be used to can be used in the computer that one or more wherein includes computer usable program code The shape for the computer program product implemented on storage media (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) Formula.
The above is only embodiments herein, are not intended to limit this application.To those skilled in the art, Various changes and changes are possible in this application.It is all within the spirit and principles of the present application made by any modification, equivalent replacement, Improve etc., it should be included within the scope of the claims of this application.

Claims (10)

1. a kind of web analysis method characterized by comprising
Web analysis request is obtained, wherein carrying the uniform resource locator of webpage to be resolved in web analysis request URL and parse the webpage to be resolved when where business scenario;
The template to match simultaneously with the business scenario and the URL is searched from pre-configured each template, wherein institute The template content for stating template includes resolution rules, and different templates have different resolution rules;
The webpage to be resolved is parsed using the resolution rules in the template found, obtains parsing result.
2. web analysis method according to claim 1, which is characterized in that searched from pre-configured each template Before the template to match simultaneously with the business scenario and the URL, the web analysis method further include: in advance by each mould Version configures in the database so that default storage format is unified, wherein the storage format of each template, which uses, in the database supports The column storage format of nested structure, storage column are divided into domain name, business scenario and template object, and the template object specifically wraps Include the URL canonical matching rule and template content of template.
3. web analysis method according to claim 2, which is characterized in that described to be looked into from pre-configured each template The template looked for while matched with the business scenario and the URL, comprising:
Using the domain name in the URL as keyword, the template in the database is retrieved, is screened out from it described Template corresponding to domain name in URL;
Using the business scenario as keyword, secondary inspection is carried out to template corresponding to the domain name in the URL filtered out Rope is screened out from it template corresponding to the business scenario;
The URL is matched with the URL canonical matching rule of template corresponding to the business scenario filtered out, is searched The template of successful match out.
4. web analysis method according to claim 2, which is characterized in that described in advance by each template with default storage lattice After the unified configuration in the database of formula, the web analysis method further include:
A cache pool is being locallyd create in advance, while opening a background thread in rear end;The background thread is used for the period Property by the database template update into the cache pool.
5. web analysis method described in any one of -4 according to claim 1, which is characterized in that the template content further includes Call instruction;
Corresponding, the resolution rules using in the template found parse the webpage to be resolved, are parsed As a result after, the web analysis method further include:
Pre-configured public resolution component is called according to the call instruction in the template found, utilizes the public parsing Component obtains secondary parsing result to needing the field of secondary parsing to handle in the parsing result;
Wherein, the public resolution component refers to resolver, and different public resolution components has different analytic abilities.
6. a kind of web analysis device characterized by comprising
Pretreatment unit is used for pre-configured each template, wherein the template content of the template includes resolution rules, and different Template has different resolution rules;
Acquiring unit, for obtaining web analysis request, wherein carrying the URL of webpage to be resolved in web analysis request And the business scenario where when the parsing webpage to be resolved;
Searching unit matches with the business scenario and the URL simultaneously for searching from pre-configured each template Template;
First resolution unit, the resolution rules in template for being found using the searching unit are to the webpage to be resolved It is parsed, obtains parsing result.
7. web analysis device according to claim 6, which is characterized in that the template content further includes call instruction;
It is corresponding, the web analysis device further include: the second resolution unit, for being referred to according to the calling in the template found It enables and calls pre-configured public resolution component, using the public resolution component to needing secondary solution in the parsing result The field of analysis is handled, and secondary parsing result is obtained;
Wherein, the public resolution component refers to resolver, and different public resolution components has different analytic abilities.
8. a kind of storage medium, is stored thereon with program, which is characterized in that the program realizes claim when being executed by processor 1 to 5 described in any item web analysis methods.
9. a kind of processor, the processor is for running program, which is characterized in that perform claim requirement when described program is run 1 to 5 described in any item web analysis methods.
10. a kind of equipment, the equipment includes processor, memory and storage on a memory and can run on a processor Program, which is characterized in that processor realizes web analysis method described in any one of claim 1 to 5 when executing program.
CN201710758003.5A 2017-08-29 2017-08-29 Webpage parsing method, device, storage medium, processor and equipment Active CN110020236B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710758003.5A CN110020236B (en) 2017-08-29 2017-08-29 Webpage parsing method, device, storage medium, processor and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710758003.5A CN110020236B (en) 2017-08-29 2017-08-29 Webpage parsing method, device, storage medium, processor and equipment

Publications (2)

Publication Number Publication Date
CN110020236A true CN110020236A (en) 2019-07-16
CN110020236B CN110020236B (en) 2021-11-30

Family

ID=67186156

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710758003.5A Active CN110020236B (en) 2017-08-29 2017-08-29 Webpage parsing method, device, storage medium, processor and equipment

Country Status (1)

Country Link
CN (1) CN110020236B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111125565A (en) * 2019-11-01 2020-05-08 上海掌门科技有限公司 A method and device for inputting information in an application
CN112597410A (en) * 2020-12-10 2021-04-02 北京明朝万达科技股份有限公司 Method and device for performing structured extraction on webpage content based on rule configuration library
CN112947906A (en) * 2019-11-26 2021-06-11 贝壳技术有限公司 Condition analysis method and configuration platform
CN113867881A (en) * 2021-10-19 2021-12-31 创优数字科技(广东)有限公司 Application home page dynamic display method, device, equipment and medium
CN114020276A (en) * 2021-11-05 2022-02-08 山东库睿科技有限公司 Data processing method, device, electronic equipment and medium
CN114218442A (en) * 2021-12-10 2022-03-22 北京云迹科技股份有限公司 A data processing method, system, electronic device and readable storage medium
CN114692050A (en) * 2022-03-30 2022-07-01 北京金堤科技有限公司 Page parsing method and device, computer readable medium and electronic device

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101916285A (en) * 2010-08-20 2010-12-15 北京新岸线网络技术有限公司 Method and device for analyzing internet web page contents
CN102254046A (en) * 2011-08-18 2011-11-23 深圳市融创天下科技股份有限公司 Webpage data acquiring method and system
CN103761330A (en) * 2014-02-10 2014-04-30 赛特斯信息科技股份有限公司 System and method for achieving automatic Internet information extraction based on template configuration
CN103793461A (en) * 2013-12-02 2014-05-14 北京奇虎科技有限公司 Webpage information analysis method and device
CN104572874A (en) * 2014-12-19 2015-04-29 北京锐安科技有限公司 Webpage information extraction method and device
US20160055132A1 (en) * 2014-08-20 2016-02-25 Vertafore, Inc. Automated customized web portal template generation systems and methods
CN105630839A (en) * 2014-11-07 2016-06-01 阿里巴巴集团控股有限公司 Webpage information acquisition method and device
CN106055585A (en) * 2016-05-20 2016-10-26 北京神州绿盟信息安全科技股份有限公司 Log analysis method and apparatus
US20170192941A1 (en) * 2016-01-05 2017-07-06 Quixey, Inc. Computer-Automated Generation of Application Deep Links

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101916285A (en) * 2010-08-20 2010-12-15 北京新岸线网络技术有限公司 Method and device for analyzing internet web page contents
CN102254046A (en) * 2011-08-18 2011-11-23 深圳市融创天下科技股份有限公司 Webpage data acquiring method and system
CN103793461A (en) * 2013-12-02 2014-05-14 北京奇虎科技有限公司 Webpage information analysis method and device
CN103761330A (en) * 2014-02-10 2014-04-30 赛特斯信息科技股份有限公司 System and method for achieving automatic Internet information extraction based on template configuration
US20160055132A1 (en) * 2014-08-20 2016-02-25 Vertafore, Inc. Automated customized web portal template generation systems and methods
CN105630839A (en) * 2014-11-07 2016-06-01 阿里巴巴集团控股有限公司 Webpage information acquisition method and device
CN104572874A (en) * 2014-12-19 2015-04-29 北京锐安科技有限公司 Webpage information extraction method and device
US20170192941A1 (en) * 2016-01-05 2017-07-06 Quixey, Inc. Computer-Automated Generation of Application Deep Links
CN106055585A (en) * 2016-05-20 2016-10-26 北京神州绿盟信息安全科技股份有限公司 Log analysis method and apparatus

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
J. HU 等: "《2009 International Conference on Management and Service Science》", 《2009 INTERNATIONAL CONFERENCE ON MANAGEMENT AND SERVICE SCIENCE》 *
乔峰: "基于模板化网络爬虫技术的Web网页信息抽取", 《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》 *
李舒晨 等: "网络舆情分析中网页信息预处理方案的实现", 《电脑与电信》 *
顾韵华 等: ""基于模板和领域本体的Deep Web信息抽取研究"", 《计算机工程与设计》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111125565A (en) * 2019-11-01 2020-05-08 上海掌门科技有限公司 A method and device for inputting information in an application
CN112947906A (en) * 2019-11-26 2021-06-11 贝壳技术有限公司 Condition analysis method and configuration platform
CN112597410A (en) * 2020-12-10 2021-04-02 北京明朝万达科技股份有限公司 Method and device for performing structured extraction on webpage content based on rule configuration library
CN113867881A (en) * 2021-10-19 2021-12-31 创优数字科技(广东)有限公司 Application home page dynamic display method, device, equipment and medium
CN114020276A (en) * 2021-11-05 2022-02-08 山东库睿科技有限公司 Data processing method, device, electronic equipment and medium
CN114218442A (en) * 2021-12-10 2022-03-22 北京云迹科技股份有限公司 A data processing method, system, electronic device and readable storage medium
CN114692050A (en) * 2022-03-30 2022-07-01 北京金堤科技有限公司 Page parsing method and device, computer readable medium and electronic device

Also Published As

Publication number Publication date
CN110020236B (en) 2021-11-30

Similar Documents

Publication Publication Date Title
CN110020236A (en) Web analysis method, apparatus, storage medium, processor and equipment
CN107038207B (en) Data query method, data processing method and device
US11934403B2 (en) Generating training data for natural language search systems
CN113051285B (en) SQL sentence conversion method, system, equipment and storage medium
US11100420B2 (en) Input processing for machine learning
US11546230B2 (en) Real time streaming analytics platform
US20230106226A1 (en) Code enrichment for training language models relating to computer programming
US10956132B1 (en) Unified code and data management for model development
US11068244B2 (en) Optimized transpilation
CN107220098B (en) Method and device for implementing rule engine
EP2924633A1 (en) A system and method for extracting a business rule embedded in an application source code
CN107391528B (en) Front-end component dependent information searching method and equipment
CN110866029B (en) sql statement construction method, device, server and readable storage medium
CN110119401A (en) Processing method, device, server and the storage medium of user&#39;s portrait
CN108984155A (en) Flow chart of data processing setting method and device
KR102033416B1 (en) Method for generating data extracted from document and apparatus thereof
CN110083625A (en) Realtime stream processing method, equipment, data processing equipment and medium
CN109284115A (en) A kind of method and device generating tracker script
CN108932225B (en) Method and system for converting natural language requirements into semantic modeling language statements
US8666951B2 (en) Managing multiple versions of enterprise meta-models using semantic based indexing
Loseto et al. Linked Data (in low-resource) Platforms: a mapping for Constrained Application Protocol
CN113779311A (en) Data processing method, device and storage medium
CN109614098A (en) The generation method and device of configuration interface
Vesić et al. Comparative analysis of web application performance in case of using rest versus graphql
US10223086B2 (en) Systems and methods for code parsing and lineage detection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 100080 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing

Applicant after: Beijing Guoshuang Technology Co.,Ltd.

Address before: 100086 Beijing city Haidian District Shuangyushu Area No. 76 Zhichun Road cuigongfandian 8 layer A

Applicant before: Beijing Guoshuang Technology Co.,Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant