CN110020236A - Web analysis method, apparatus, storage medium, processor and equipment - Google Patents
Web analysis method, apparatus, storage medium, processor and equipment Download PDFInfo
- Publication number
- CN110020236A CN110020236A CN201710758003.5A CN201710758003A CN110020236A CN 110020236 A CN110020236 A CN 110020236A CN 201710758003 A CN201710758003 A CN 201710758003A CN 110020236 A CN110020236 A CN 110020236A
- Authority
- CN
- China
- Prior art keywords
- template
- url
- web analysis
- webpage
- business scenario
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000004458 analytical method Methods 0.000 title claims abstract description 86
- 238000007689 inspection Methods 0.000 claims 1
- 238000000034 method Methods 0.000 description 22
- 238000010586 diagram Methods 0.000 description 11
- 238000004590 computer program Methods 0.000 description 10
- 230000008569 process Effects 0.000 description 6
- 238000012545 processing Methods 0.000 description 6
- VZSRBBMJRBPUNF-UHFFFAOYSA-N 2-(2,3-dihydro-1H-inden-2-ylamino)-N-[3-oxo-3-(2,4,6,7-tetrahydrotriazolo[4,5-c]pyridin-5-yl)propyl]pyrimidine-5-carboxamide Chemical compound C1C(CC2=CC=CC=C12)NC1=NC=C(C=N1)C(=O)NCCC(N1CC2=C(CC1)NN=N2)=O VZSRBBMJRBPUNF-UHFFFAOYSA-N 0.000 description 5
- 230000008901 benefit Effects 0.000 description 4
- 230000006870 function Effects 0.000 description 4
- AFCARXCZXQIEQB-UHFFFAOYSA-N N-[3-oxo-3-(2,4,6,7-tetrahydrotriazolo[4,5-c]pyridin-5-yl)propyl]-2-[[3-(trifluoromethoxy)phenyl]methylamino]pyrimidine-5-carboxamide Chemical compound O=C(CCNC(=O)C=1C=NC(=NC=1)NCC1=CC(=CC=C1)OC(F)(F)F)N1CC2=C(CC1)NN=N2 AFCARXCZXQIEQB-UHFFFAOYSA-N 0.000 description 3
- 230000008859 change Effects 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 238000012546 transfer Methods 0.000 description 3
- NIPNSKYNPDTRPC-UHFFFAOYSA-N N-[2-oxo-2-(2,4,6,7-tetrahydrotriazolo[4,5-c]pyridin-5-yl)ethyl]-2-[[3-(trifluoromethoxy)phenyl]methylamino]pyrimidine-5-carboxamide Chemical compound O=C(CNC(=O)C=1C=NC(=NC=1)NCC1=CC(=CC=C1)OC(F)(F)F)N1CC2=C(CC1)NN=N2 NIPNSKYNPDTRPC-UHFFFAOYSA-N 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 235000013399 edible fruits Nutrition 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 230000029058 respiratory gaseous exchange Effects 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
- G06F16/9566—URL specific, e.g. using aliases, detecting broken or misspelled links
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The invention discloses web analysis method, apparatus, storage medium, processor and equipment, the web analysis method includes: to obtain web analysis request, wherein the business scenario where when carrying the uniform resource locator URL of webpage to be resolved in web analysis request and parsing the webpage to be resolved;The template to match simultaneously with the business scenario and the URL is searched from pre-configured each template, wherein the template content of the template includes resolution rules, and different templates have different resolution rules;The webpage to be resolved is parsed using the resolution rules in the template found, obtains parsing result.The present invention, which does not need program on restarting line, can be completed the configuration of resolution rules, thus improve work efficiency.
Description
Technical field
The present invention relates to field of computer technology, are situated between more specifically to a kind of web analysis method, apparatus, storage
Matter, processor and equipment.
Background technique
Web analysis refers to that analysis extracts really desired information from web page source code.The net in search engine exploitation
Page analytic technique is very important a ring.
Different web sites, the different space of a whole page each webpage generally correspond to different resolution rules.Realize different web sites, no
Each webpage with the space of a whole page parses in identical platform, and the web analysis method taken at present is: carrying out to each webpage
When parsing, first have to complete the configuration to resolution rules corresponding to the webpage, it then could be using the resolution rules to the net
Page is parsed, and starts to parse next webpage again after the completion to the web analysis.Wherein, one new solution of every configuration
When analysis rule, this new configuration rule will first be written, this solution being newly written could be allowed by then restarting program on line
Analyse taking effect rules.
But the configuration of new resolution rules could be completed due to restart program on line every time, and when needs
When the resolution rules number newly configured is more, each program on line of restarting necessarily will affect working efficiency.
Summary of the invention
In view of the above problems, it proposes on the present invention overcomes the above problem or at least be partially solved in order to provide one kind
Web analysis method, apparatus, storage medium, processor and the equipment of problem are stated, scheme is as follows:
A kind of web analysis method, comprising:
Web analysis request is obtained, wherein carrying the unified resource positioning of webpage to be resolved in web analysis request
Device URL and parse the webpage to be resolved when where business scenario;
The template to match simultaneously with the business scenario and the URL is searched from pre-configured each template,
Described in the template content of template include resolution rules, and different templates have different resolution rules;
The webpage to be resolved is parsed using the resolution rules in the template found, obtains parsing result.
Optionally, it is searching from pre-configured each template while matching with the business scenario and the URL
Template before, the web analysis method further include: in advance by each template with the unified configuration of default storage format in database
In, wherein the storage format of each template is using the column storage format for supporting nested structure, storage column point in the database
For domain name, business scenario and template object, the template object is specifically included in the URL canonical matching rule and template of template
Hold.
Wherein, described search from pre-configured each template matches with the business scenario and the URL simultaneously
Template, comprising:
Using the domain name in the URL as keyword, the template in the database is retrieved, is screened out from it institute
State template corresponding to the domain name in URL;
Using the business scenario as keyword, two are carried out to template corresponding to the domain name in the URL filtered out
Secondary retrieval is screened out from it template corresponding to the business scenario;
The URL is matched with the URL canonical matching rule of template corresponding to the business scenario filtered out,
Find out the template of successful match.
Optionally, described in advance to unify each template after configuring in the database with default storage format, the webpage
Analytic method further include:
A cache pool is being locallyd create in advance, while opening a background thread in rear end;The background thread is used for
Periodically the template in the database is updated into the cache pool.
Optionally, the template content further includes call instruction;
Corresponding, the resolution rules using in the template found parse the webpage to be resolved, obtain
After parsing result, the web analysis method further include:
Pre-configured public resolution component is called according to the call instruction in the template found, using described public
Resolution component obtains secondary parsing result to needing the field of secondary parsing to handle in the parsing result;
Wherein, the public resolution component refers to resolver, and different public resolution components has different analytic abilities.
A kind of web analysis device, comprising:
Pretreatment unit is used for pre-configured each template, wherein the template content of the template includes resolution rules, and
Different templates have different resolution rules;
Acquiring unit, for obtaining web analysis request, wherein carrying webpage to be resolved in web analysis request
URL and parse the webpage to be resolved when where business scenario;
Searching unit, for from pre-configured each template search simultaneously with the business scenario and the URL phase
Matched template;
First resolution unit, the resolution rules in template for being found using the searching unit are to described to be resolved
Webpage is parsed, and parsing result is obtained.
Optionally, the template content further includes call instruction;
It is corresponding, the web analysis device further include: the second resolution unit, for according to the tune in the template found
Pre-configured public resolution component is called with instruction, using the public resolution component to needing two in the parsing result
The field of secondary parsing is handled, and secondary parsing result is obtained;
Wherein, the public resolution component refers to resolver, and different public resolution components has different analytic abilities.
A kind of storage medium, is stored thereon with program, which realizes disclosed above any when being executed by processor
Web analysis method.
A kind of processor, for running program, described program executes disclosed above any the processor when running
Web analysis method.
A kind of equipment, the equipment include processor, memory and storage on a memory and can run on a processor
Program, processor execute program when realize any web analysis method disclosed above.
By above-mentioned technical proposal, web analysis method, apparatus, storage medium, processor and equipment provided by the invention,
It can be with pre-configured each resolution rules, then each webpage in realization different web sites, the different spaces of a whole page solves in identical platform
When analysis, for each webpage, it can directly go to transfer out matching parsing from pre-configured each resolution rules
Rule parses the webpage.Matching parsing can be completed since the present invention does not need program on restarting line
The configuration of rule, thus improve work efficiency.
The above description is only an overview of the technical scheme of the present invention, in order to better understand the technical means of the present invention,
And it can be implemented in accordance with the contents of the specification, and in order to allow above and other objects of the present invention, feature and advantage can
It is clearer and more comprehensible, the followings are specific embodiments of the present invention.
Detailed description of the invention
By reading the following detailed description of the preferred embodiment, various other advantages and benefits are common for this field
Technical staff will become clear.The drawings are only for the purpose of illustrating a preferred embodiment, and is not considered as to the present invention
Limitation.And throughout the drawings, the same reference numbers will be used to refer to the same parts.In the accompanying drawings:
Fig. 1 shows a kind of web analysis method flow diagram provided in an embodiment of the present invention;
Fig. 2 shows another web analysis method flow diagrams provided in an embodiment of the present invention;
Fig. 3 shows another web analysis method flow diagram provided in an embodiment of the present invention;
Fig. 4 shows a kind of web analysis apparatus structure schematic diagram provided in an embodiment of the present invention;
Fig. 5 shows another web analysis apparatus structure schematic diagram provided in an embodiment of the present invention.
Specific embodiment
Exemplary embodiments of the present disclosure are described in more detail below with reference to accompanying drawings.Although showing the disclosure in attached drawing
Exemplary embodiment, it being understood, however, that may be realized in various forms the disclosure without should be by embodiments set forth here
It is limited.On the contrary, these embodiments are provided to facilitate a more thoroughly understanding of the present invention, and can be by the scope of the present disclosure
It is fully disclosed to those skilled in the art.
As shown in Figure 1, a kind of web analysis method provided in an embodiment of the present invention, comprising:
Step S01: obtaining web analysis request, wherein carrying the URL of webpage to be resolved in web analysis request
Business field where when (Uniform Resoure Locator, uniform resource locator) and the parsing webpage to be resolved
Scape.
Specifically, URL is the table succinct to the position for the resource that can be obtained from internet and one kind of access method
Show, is the address of standard resource on internet, is commonly called as " network address ".Include the information such as domain name in URL.
For same webpage, the content of required parsing is different under different business scenarios.Such as it to parse
One Baidu search page is all ad contents for needing to parse Baidu and recommending at business scenario A, and in business scenario B
Under, then it is the content for needing to parse normal searching and coming out.In web analysis request, the business scenario carried can be used
It is indicated with the unique corresponding mark of the business scenario.
Step S02: lookup matches with the business scenario and the URL simultaneously from pre-configured each template
Template, wherein the template content of the template includes resolution rules, and different templates have different resolution rules.
Specifically, according to webpage to be resolved URL and parse the webpage to be resolved when where business scenario, Neng Gouwei
One determines the corresponding resolution rules of the webpage to be resolved.Based on this, the pre-configured different web sites of the present embodiment, the different spaces of a whole page
Each webpage corresponding to resolution rules (for example, in advance by the corresponding parsing rule of each webpage of different web sites, the different space of a whole page
Then unify configuration in the database), each resolution rules are stored in respectively in different templates, when needing to solve to current web page
When analysis, business scenario where when URL and parsing current web page according to current web page can be directly from pre-configured each
Template corresponding to current web page is found in template, obtains the resolution rules in the template.
In preprocessing process, in order to realize the unified configuration of each template in the database, need to define template
Resolution rules and determining each nodename meaning, be exemplified below: the resolution rules of template can be defined as Json
The character string rule of the Key-Value form of (JavaScript Object Notation, JS object tag) form, Key table
Show that field type, Value are the Xpath used when carrying out web analysis (Xml Path Language, extensible markups
Language path language) value.Wherein, each node Key of Json form mainly includes two kinds of expression types, and one is node attribute values
Type, secondly be service fields type, the range that nodal community Value Types mainly indicate has the level of present node, parses
Content needs the mark of specially treated, and service fields type is then to indicate breath manner of breathing with each specific field of webpage to be resolved
It closes, nodal community Value Types are to complement each other with service fields type, indispensable.
The resolution rules of template are defined with above-mentioned example and determine that one when each nodename meaning is answered in the following, providing
Use example.
By taking webpage to be resolved is video web-pages as an example, the resolution rules in corresponding template can use following example 1-
1。
Example 1-1
In example 1-1, each nodename meaning are as follows: xpath field be nodal community Value Types, Video, Title and
ViewCount be specific service fields type, the matched attribute results value of Videos be meet Xpath array form it is interior
Hold.
In addition, when by the configuration of each template in the database, it is also necessary to determine the storage format of each template in the database.
For convenience of quickly being managed each template in the database and achieve the purpose that quickly to search template, in the database
The storage format of each template can be divided into domain name, business scenario using the column storage format for supporting nested structure, storage column
It is arranged with template object three, wherein the template object specifically includes the URL canonical matching rule and template content of template.Namely
It says, the storage format of each template is specially Dictionary < string, Dictionary < string, List in the database
<Template>>>, explain: Dictionary<domain name, Dictionary<scene, List<template object>>, wherein template pair
URL canonical matching rule and template content as specifically including template.
Based on this storage format, match simultaneously with the business scenario and the URL searching from database
When template, to achieve the purpose that quick search template, as shown in Fig. 2, the step S02 is specifically included:
Step S021: using analyzing web page request in domain name in the URL that carries as keyword, in database
Template is retrieved, and template corresponding to the domain name in the URL is screened out from it;
Step S022: using analyzing web page request in the business scenario that carries as keyword, described in filtering out
Template corresponding to domain name in URL carries out quadratic search, is screened out from it template corresponding to the business scenario;
Step S023: by the URL canonical matching rule of template corresponding to the URL and the business scenario filtered out
It is matched, finds out the template of successful match.
In brief, lookup mode shown in Fig. 2 is exactly successively with the domain name of webpage to be resolved, URL and when being parsed
The business scenario at place is as keyword, first from the template set filtered out under the domain name in database, then from the template set
In filter out template set under the business scenario, filtered out only from the template set that postsearch screening obtains further according to the URL
One matched template, this lookup mode can reduce the matched calculation amount of URL canonical, achieve the purpose that quick search template.
It, can also be in advance in local for the speed for further increasing inquiry template on the basis of lookup mode shown in Fig. 2
It creates a cache pool, while opening a background thread in rear end, the background thread is for periodically (such as every one
Minute) update the template in the database into the cache pool.The benefit of cache pool is to be locally stored, search speed
Fastly.It is in order to achieve the purpose that the template in cache pool described in real-time update, specifically: template that a background thread is opened in rear end
It is stored in database, the template of external call is the template in cache pool, if carried out to the template in database
Modification, then also wanting the update of dynamic realtime into cache pool, corresponding operation is exactly that a background thread is opened in rear end, then
Periodically (such as every one minute) template in database is updated into cache pool.Each template deposits in the cache pool
Storage format is similarly Dictionary<string, Dictionary<string, List<Template>>>.
Step S03: the webpage to be resolved is parsed using the resolution rules in the template found, is parsed
As a result.
By taking example 1-1 as an example, a certain existing sports cast class video web-pages are parsed using example 1-1, are exported
Parsing result as shown in example 1-2.
Example 1-2
After being parsed using the resolution rules in the template found to the webpage to be resolved, parsing result is direct
Feed back to called side.Web analysis method disclosed in the present embodiment is stateless service, that is to say, that the present embodiment discloses
Web analysis method change not based on the change of called side.
By the above-mentioned associated description to the present embodiment it is found that web analysis method provided in this embodiment can be pre-configured with
Good each resolution rules, then in each webpage for realizing different web sites, the different spaces of a whole page when being parsed in identical platform, for each
Webpage directly can remove to transfer from pre-configured each resolution rules matching resolution rules to the webpage
It is parsed, the configuration for completing matching resolution rules is gone without program on restarting line, thus improve work
Make efficiency.
It is further to note that the web analysis method or the embodiment of the present invention of either prior art use disclose
Above-mentioned web analysis method, suffering a problem that using Shi Douhui, exactly webpage parsed using resolution rules
Afterwards, the certain fields parsed may not be finally desired content, it is necessary to pass through specially treated, just can finally be thought
The parsing result wanted.To realize that this target, the treatment measures that the prior art is taken are to utilize resolution rules to a net every time
Hard coded after page is parsed, to the field progress program for needing specially treated in parsing result, that is to say, that parsing is tied
Each field for needing specially treated requires the parsing result that one resolver of bespoke just can finally be wanted in fruit,
Compiler workload is too big.And the embodiment of the present invention finally desired parsing result and avoids bringing too big volume in order to obtain
Translator program workload proposes another web analysis method, such as Fig. 3 on the basis of aforementioned disclosed web analysis method
It is shown, it specifically includes:
Step S01: obtaining web analysis request, wherein carrying the unification of webpage to be resolved in web analysis request
Resource localizer URL and parse the webpage to be resolved when where business scenario.
Step S02: lookup matches with the business scenario and the URL simultaneously from pre-configured each template
Template, wherein the template content of the template includes resolution rules and call instruction, and there is different templates different parsings to advise
Then.
Step S03: the webpage to be resolved is parsed using the resolution rules in the template found, is parsed
As a result.
Step S04: calling pre-configured public resolution component according to the call instruction in the template found, utilizes
The public resolution component obtains secondary parsing result to needing the field of secondary parsing to handle in the parsing result.
Wherein, the public resolution component refers to resolver, and different public resolution components has different analytic abilities.It is described public
Resolution component can configure in same database with each template.
Web analysis method shown in Fig. 3 is based on aforementioned disclosed web analysis method and proposes that improvement is: in template
Appearance further includes call instruction, after parsing every time using resolution rules to a webpage, also calls corresponding public parsing group
Part handles the field for needing specially treated each in parsing result, with the parsing result finally wanted.These are public
Resolution component is that the present embodiment is pre-configured altogether, and any public resolution component is available for multiple templates and calls, rather than certain
A certain field in the parsing result of one webpage it is dedicated, therefore greatly reduce compiler workload.
For example, having in " NBA All-Star game " and " shuttlecock World Championships " in parsing result shown in aforementioned exemplary 1-2
Space, and desired parsing result is free from the field in space, pre-defines a public resolution component to this present embodiment,
Public ability is TrimeTransfomation, i.e., the operation in space is carried out to the field after parsing.It says for another example, for showing
" 23 times " and " 45 times " in parsing result shown in example 1-2, pre-define a public resolution component, and public ability is
IntegerExtractTransformation carries out canonical matching for numeric type, to select " 23 " and " 45 "
As last output result.At this point, shown in the following 1-3 of template content and final output newly obtained.
The template content newly obtained:
Final output:
Example 1-3
It is obvious that the output result in example 1-3 is exactly finally desired parsing result.
Corresponding with above method embodiment, the present invention also provides a kind of web analysis devices.
As shown in figure 4, a kind of web analysis device provided in an embodiment of the present invention, comprising:
Pretreatment unit 100 is used for pre-configured each template, wherein the template content of the template includes parsing rule
Then, and different templates have different resolution rules;
Acquiring unit 200, for obtaining web analysis request, wherein carrying net to be resolved in web analysis request
Business scenario where when the URL and the parsing webpage to be resolved of page;
Searching unit 300, for from pre-configured each template search simultaneously with the business scenario and the URL
The template to match;
First resolution unit 400, the resolution rules in template for being found using searching unit 200 are to described wait solve
Analysis webpage is parsed, and parsing result is obtained.
Optionally, pretreatment unit 100 is specifically used in advance configuring each template in number according to default storage format is unified
According in library, wherein the storage format of each template is using the column storage format for supporting nested structure, storage in the database
Column are divided into domain name, business scenario and template object, and the template object specifically includes the URL canonical matching rule and template of template
Content.
Optionally, searching unit 300 is specifically used for using the domain name in the URL as keyword, in the database
Template retrieved, be screened out from it template corresponding to the domain name in the URL;Again using the business scenario as key
Word carries out quadratic search to template corresponding to the domain name in the URL filtered out, is screened out from it the business scenario institute
Corresponding template;The URL canonical matching rule of template corresponding to the URL and the business scenario filtered out is carried out again
Matching, finds out the template of successful match.
Optionally, pretreatment unit 300 is also used to localling create a cache pool in advance, while opening one in rear end
Background thread;The background thread is for periodically updating the template in the database into the cache pool.
Optionally, the template content further includes call instruction;It is corresponding, as shown in figure 5, the web analysis device is also
It include: the second resolution unit 500, for calling pre-configured public parsing according to the call instruction in the template found
Component is obtained secondary using the public resolution component to needing the field of secondary parsing to handle in the parsing result
Parsing result.Wherein, the public resolution component refers to resolver, and different public resolution components has different parsing energy
Power.
The web analysis device includes processor and memory, and above-mentioned pretreatment unit 100, is searched at acquiring unit 200
Unit 300, the first resolution unit 400 and second resolution unit 500 etc. are stored as program unit in memory, by handling
Device executes above procedure unit stored in memory to realize corresponding function.
Include kernel in processor, is gone in memory to transfer corresponding program unit by kernel.Kernel can be set one
Or more, web analysis is realized by adjusting kernel parameter.
Memory may include the non-volatile memory in computer-readable medium, random access memory (RAM) and/
Or the forms such as Nonvolatile memory, if read-only memory (ROM) or flash memory (flash RAM), memory include that at least one is deposited
Store up chip.
For above-mentioned apparatus embodiment, since it essentially corresponds to embodiment of the method, so describe fairly simple,
Referring to the related description of embodiment of the method in place of related.
The embodiment of the invention provides a kind of storage mediums, are stored thereon with program, real when which is executed by processor
Existing any web analysis method disclosed above.
The embodiment of the invention provides a kind of processor, the processor is for running program, wherein described program operation
Any web analysis method Shi Zhihang disclosed above.
The embodiment of the invention provides a kind of equipment, equipment include processor, memory and storage on a memory and can
The program run on a processor, processor perform the steps of when executing program
Web analysis request is obtained, wherein carrying the unified resource positioning of webpage to be resolved in web analysis request
Device URL and parse the webpage to be resolved when where business scenario;
The template to match simultaneously with the business scenario and the URL is searched from pre-configured each template,
Described in the template content of template include resolution rules, and different templates have different resolution rules;
The webpage to be resolved is parsed using the resolution rules in the template found, obtains parsing result.
Optionally, it is searching from pre-configured each template while matching with the business scenario and the URL
Template before, the web analysis method further include: in advance by each template with the unified configuration of default storage format in database
In, wherein the storage format of each template is using the column storage format for supporting nested structure, storage column point in the database
For domain name, business scenario and template object, the template object is specifically included in the URL canonical matching rule and template of template
Hold.
Optionally, it is described from pre-configured each template search simultaneously with the business scenario and the URL phase
The template matched, specifically includes:
Using the domain name in the URL as keyword, the template in the database is retrieved, is screened out from it institute
State template corresponding to the domain name in URL;
Using the business scenario as keyword, two are carried out to template corresponding to the domain name in the URL filtered out
Secondary retrieval is screened out from it template corresponding to the business scenario;
The URL is matched with the URL canonical matching rule of template corresponding to the business scenario filtered out,
Find out the template of successful match.
Optionally, described in advance to unify each template after configuration in the database with default storage format, further includes: pre-
A cache pool first is being locallyd create, while opening a background thread in rear end;The background thread will be for periodically will
Template in the database is updated into the cache pool.
Optionally, the template content further includes call instruction;
Corresponding, the resolution rules using in the template found parse the webpage to be resolved, obtain
After parsing result, further includes:
Pre-configured public resolution component is called according to the call instruction in the template found, using described public
Resolution component obtains secondary parsing result to needing the field of secondary parsing to handle in the parsing result;
Wherein, the public resolution component refers to resolver, and different public resolution components has different analytic abilities.
Equipment herein can be server, PC, PAD, mobile phone etc..
The embodiment of the invention also provides a kind of computer program products, when executing on data processing equipment, are suitable for
Execute the program of initialization there are as below methods step:
Web analysis request is obtained, wherein carrying the unified resource positioning of webpage to be resolved in web analysis request
Device URL and parse the webpage to be resolved when where business scenario;
The template to match simultaneously with the business scenario and the URL is searched from pre-configured each template,
Described in the template content of template include resolution rules, and different templates have different resolution rules;
The webpage to be resolved is parsed using the resolution rules in the template found, obtains parsing result.
Optionally, it is searching from pre-configured each template while matching with the business scenario and the URL
Template before, the web analysis method further include: in advance by each template with the unified configuration of default storage format in database
In, wherein the storage format of each template is using the column storage format for supporting nested structure, storage column point in the database
For domain name, business scenario and template object, the template object is specifically included in the URL canonical matching rule and template of template
Hold.
Optionally, it is described from pre-configured each template search simultaneously with the business scenario and the URL phase
The template matched, specifically includes:
Using the domain name in the URL as keyword, the template in the database is retrieved, is screened out from it institute
State template corresponding to the domain name in URL;
Using the business scenario as keyword, two are carried out to template corresponding to the domain name in the URL filtered out
Secondary retrieval is screened out from it template corresponding to the business scenario;
The URL is matched with the URL canonical matching rule of template corresponding to the business scenario filtered out,
Find out the template of successful match.
Optionally, described in advance to unify each template after configuration in the database with default storage format, further includes: pre-
A cache pool first is being locallyd create, while opening a background thread in rear end;The background thread will be for periodically will
Template in the database is updated into the cache pool.
Optionally, the template content further includes call instruction;
Corresponding, the resolution rules using in the template found parse the webpage to be resolved, obtain
After parsing result, further includes:
Pre-configured public resolution component is called according to the call instruction in the template found, using described public
Resolution component obtains secondary parsing result to needing the field of secondary parsing to handle in the parsing result;
Wherein, the public resolution component refers to resolver, and different public resolution components has different analytic abilities.
It should be understood by those skilled in the art that, embodiments herein can provide as method, system or computer program
Product.Therefore, complete hardware embodiment, complete software embodiment or reality combining software and hardware aspects can be used in the application
Apply the form of example.Moreover, it wherein includes the computer of computer usable program code that the application, which can be used in one or more,
The computer program implemented in usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) produces
The form of product.
The application is referring to method, the process of equipment (system) and computer program product according to the embodiment of the present application
Figure and/or block diagram describe.It should be understood that every one stream in flowchart and/or the block diagram can be realized by computer program instructions
The combination of process and/or box in journey and/or box and flowchart and/or the block diagram.It can provide these computer programs
Instruct the processor of general purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices to produce
A raw machine, so that being generated by the instruction that computer or the processor of other programmable data processing devices execute for real
The device for the function of being specified in present one or more flows of the flowchart and/or one or more blocks of the block diagram.
These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy
Determine in the computer-readable memory that mode works, so that it includes referring to that instruction stored in the computer readable memory, which generates,
Enable the manufacture of device, the command device realize in one box of one or more flows of the flowchart and/or block diagram or
The function of being specified in multiple boxes.
These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that counting
Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer or
The instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or block diagram one
The step of function of being specified in a box or multiple boxes.
In a typical configuration, calculating equipment includes one or more processors (CPU), input/output interface, net
Network interface and memory.
Memory may include the non-volatile memory in computer-readable medium, random access memory (RAM) and/
Or the forms such as Nonvolatile memory, such as read-only memory (ROM) or flash memory (flash RAM).Memory is computer-readable Jie
The example of matter.
Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by any method
Or technology come realize information store.Information can be computer readable instructions, data structure, the module of program or other data.
The example of the storage medium of computer includes, but are not limited to phase change memory (PRAM), static random access memory (SRAM), moves
State random access memory (DRAM), other kinds of random access memory (RAM), read-only memory (ROM), electric erasable
Programmable read only memory (EEPROM), flash memory or other memory techniques, read-only disc read only memory (CD-ROM) (CD-ROM),
Digital versatile disc (DVD) or other optical storage, magnetic cassettes, tape magnetic disk storage or other magnetic storage devices
Or any other non-transmission medium, can be used for storage can be accessed by a computing device information.As defined in this article, it calculates
Machine readable medium does not include temporary computer readable media (transitory media), such as the data-signal and carrier wave of modulation.
It should also be noted that, the terms "include", "comprise" or its any other variant are intended to nonexcludability
It include so that the process, method, commodity or the equipment that include a series of elements not only include those elements, but also to wrap
Include other elements that are not explicitly listed, or further include for this process, method, commodity or equipment intrinsic want
Element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that including element
There is also other identical elements in process, method, commodity or equipment.
It will be understood by those skilled in the art that embodiments herein can provide as method, system or computer program product.
Therefore, complete hardware embodiment, complete software embodiment or embodiment combining software and hardware aspects can be used in the application
Form.It is deposited moreover, the application can be used to can be used in the computer that one or more wherein includes computer usable program code
The shape for the computer program product implemented on storage media (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.)
Formula.
The above is only embodiments herein, are not intended to limit this application.To those skilled in the art,
Various changes and changes are possible in this application.It is all within the spirit and principles of the present application made by any modification, equivalent replacement,
Improve etc., it should be included within the scope of the claims of this application.
Claims (10)
1. a kind of web analysis method characterized by comprising
Web analysis request is obtained, wherein carrying the uniform resource locator of webpage to be resolved in web analysis request
URL and parse the webpage to be resolved when where business scenario;
The template to match simultaneously with the business scenario and the URL is searched from pre-configured each template, wherein institute
The template content for stating template includes resolution rules, and different templates have different resolution rules;
The webpage to be resolved is parsed using the resolution rules in the template found, obtains parsing result.
2. web analysis method according to claim 1, which is characterized in that searched from pre-configured each template
Before the template to match simultaneously with the business scenario and the URL, the web analysis method further include: in advance by each mould
Version configures in the database so that default storage format is unified, wherein the storage format of each template, which uses, in the database supports
The column storage format of nested structure, storage column are divided into domain name, business scenario and template object, and the template object specifically wraps
Include the URL canonical matching rule and template content of template.
3. web analysis method according to claim 2, which is characterized in that described to be looked into from pre-configured each template
The template looked for while matched with the business scenario and the URL, comprising:
Using the domain name in the URL as keyword, the template in the database is retrieved, is screened out from it described
Template corresponding to domain name in URL;
Using the business scenario as keyword, secondary inspection is carried out to template corresponding to the domain name in the URL filtered out
Rope is screened out from it template corresponding to the business scenario;
The URL is matched with the URL canonical matching rule of template corresponding to the business scenario filtered out, is searched
The template of successful match out.
4. web analysis method according to claim 2, which is characterized in that described in advance by each template with default storage lattice
After the unified configuration in the database of formula, the web analysis method further include:
A cache pool is being locallyd create in advance, while opening a background thread in rear end;The background thread is used for the period
Property by the database template update into the cache pool.
5. web analysis method described in any one of -4 according to claim 1, which is characterized in that the template content further includes
Call instruction;
Corresponding, the resolution rules using in the template found parse the webpage to be resolved, are parsed
As a result after, the web analysis method further include:
Pre-configured public resolution component is called according to the call instruction in the template found, utilizes the public parsing
Component obtains secondary parsing result to needing the field of secondary parsing to handle in the parsing result;
Wherein, the public resolution component refers to resolver, and different public resolution components has different analytic abilities.
6. a kind of web analysis device characterized by comprising
Pretreatment unit is used for pre-configured each template, wherein the template content of the template includes resolution rules, and different
Template has different resolution rules;
Acquiring unit, for obtaining web analysis request, wherein carrying the URL of webpage to be resolved in web analysis request
And the business scenario where when the parsing webpage to be resolved;
Searching unit matches with the business scenario and the URL simultaneously for searching from pre-configured each template
Template;
First resolution unit, the resolution rules in template for being found using the searching unit are to the webpage to be resolved
It is parsed, obtains parsing result.
7. web analysis device according to claim 6, which is characterized in that the template content further includes call instruction;
It is corresponding, the web analysis device further include: the second resolution unit, for being referred to according to the calling in the template found
It enables and calls pre-configured public resolution component, using the public resolution component to needing secondary solution in the parsing result
The field of analysis is handled, and secondary parsing result is obtained;
Wherein, the public resolution component refers to resolver, and different public resolution components has different analytic abilities.
8. a kind of storage medium, is stored thereon with program, which is characterized in that the program realizes claim when being executed by processor
1 to 5 described in any item web analysis methods.
9. a kind of processor, the processor is for running program, which is characterized in that perform claim requirement when described program is run
1 to 5 described in any item web analysis methods.
10. a kind of equipment, the equipment includes processor, memory and storage on a memory and can run on a processor
Program, which is characterized in that processor realizes web analysis method described in any one of claim 1 to 5 when executing program.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710758003.5A CN110020236B (en) | 2017-08-29 | 2017-08-29 | Webpage parsing method, device, storage medium, processor and equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710758003.5A CN110020236B (en) | 2017-08-29 | 2017-08-29 | Webpage parsing method, device, storage medium, processor and equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110020236A true CN110020236A (en) | 2019-07-16 |
CN110020236B CN110020236B (en) | 2021-11-30 |
Family
ID=67186156
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710758003.5A Active CN110020236B (en) | 2017-08-29 | 2017-08-29 | Webpage parsing method, device, storage medium, processor and equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110020236B (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111125565A (en) * | 2019-11-01 | 2020-05-08 | 上海掌门科技有限公司 | A method and device for inputting information in an application |
CN112597410A (en) * | 2020-12-10 | 2021-04-02 | 北京明朝万达科技股份有限公司 | Method and device for performing structured extraction on webpage content based on rule configuration library |
CN112947906A (en) * | 2019-11-26 | 2021-06-11 | 贝壳技术有限公司 | Condition analysis method and configuration platform |
CN113867881A (en) * | 2021-10-19 | 2021-12-31 | 创优数字科技(广东)有限公司 | Application home page dynamic display method, device, equipment and medium |
CN114020276A (en) * | 2021-11-05 | 2022-02-08 | 山东库睿科技有限公司 | Data processing method, device, electronic equipment and medium |
CN114218442A (en) * | 2021-12-10 | 2022-03-22 | 北京云迹科技股份有限公司 | A data processing method, system, electronic device and readable storage medium |
CN114692050A (en) * | 2022-03-30 | 2022-07-01 | 北京金堤科技有限公司 | Page parsing method and device, computer readable medium and electronic device |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101916285A (en) * | 2010-08-20 | 2010-12-15 | 北京新岸线网络技术有限公司 | Method and device for analyzing internet web page contents |
CN102254046A (en) * | 2011-08-18 | 2011-11-23 | 深圳市融创天下科技股份有限公司 | Webpage data acquiring method and system |
CN103761330A (en) * | 2014-02-10 | 2014-04-30 | 赛特斯信息科技股份有限公司 | System and method for achieving automatic Internet information extraction based on template configuration |
CN103793461A (en) * | 2013-12-02 | 2014-05-14 | 北京奇虎科技有限公司 | Webpage information analysis method and device |
CN104572874A (en) * | 2014-12-19 | 2015-04-29 | 北京锐安科技有限公司 | Webpage information extraction method and device |
US20160055132A1 (en) * | 2014-08-20 | 2016-02-25 | Vertafore, Inc. | Automated customized web portal template generation systems and methods |
CN105630839A (en) * | 2014-11-07 | 2016-06-01 | 阿里巴巴集团控股有限公司 | Webpage information acquisition method and device |
CN106055585A (en) * | 2016-05-20 | 2016-10-26 | 北京神州绿盟信息安全科技股份有限公司 | Log analysis method and apparatus |
US20170192941A1 (en) * | 2016-01-05 | 2017-07-06 | Quixey, Inc. | Computer-Automated Generation of Application Deep Links |
-
2017
- 2017-08-29 CN CN201710758003.5A patent/CN110020236B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101916285A (en) * | 2010-08-20 | 2010-12-15 | 北京新岸线网络技术有限公司 | Method and device for analyzing internet web page contents |
CN102254046A (en) * | 2011-08-18 | 2011-11-23 | 深圳市融创天下科技股份有限公司 | Webpage data acquiring method and system |
CN103793461A (en) * | 2013-12-02 | 2014-05-14 | 北京奇虎科技有限公司 | Webpage information analysis method and device |
CN103761330A (en) * | 2014-02-10 | 2014-04-30 | 赛特斯信息科技股份有限公司 | System and method for achieving automatic Internet information extraction based on template configuration |
US20160055132A1 (en) * | 2014-08-20 | 2016-02-25 | Vertafore, Inc. | Automated customized web portal template generation systems and methods |
CN105630839A (en) * | 2014-11-07 | 2016-06-01 | 阿里巴巴集团控股有限公司 | Webpage information acquisition method and device |
CN104572874A (en) * | 2014-12-19 | 2015-04-29 | 北京锐安科技有限公司 | Webpage information extraction method and device |
US20170192941A1 (en) * | 2016-01-05 | 2017-07-06 | Quixey, Inc. | Computer-Automated Generation of Application Deep Links |
CN106055585A (en) * | 2016-05-20 | 2016-10-26 | 北京神州绿盟信息安全科技股份有限公司 | Log analysis method and apparatus |
Non-Patent Citations (4)
Title |
---|
J. HU 等: "《2009 International Conference on Management and Service Science》", 《2009 INTERNATIONAL CONFERENCE ON MANAGEMENT AND SERVICE SCIENCE》 * |
乔峰: "基于模板化网络爬虫技术的Web网页信息抽取", 《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》 * |
李舒晨 等: "网络舆情分析中网页信息预处理方案的实现", 《电脑与电信》 * |
顾韵华 等: ""基于模板和领域本体的Deep Web信息抽取研究"", 《计算机工程与设计》 * |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111125565A (en) * | 2019-11-01 | 2020-05-08 | 上海掌门科技有限公司 | A method and device for inputting information in an application |
CN112947906A (en) * | 2019-11-26 | 2021-06-11 | 贝壳技术有限公司 | Condition analysis method and configuration platform |
CN112597410A (en) * | 2020-12-10 | 2021-04-02 | 北京明朝万达科技股份有限公司 | Method and device for performing structured extraction on webpage content based on rule configuration library |
CN113867881A (en) * | 2021-10-19 | 2021-12-31 | 创优数字科技(广东)有限公司 | Application home page dynamic display method, device, equipment and medium |
CN114020276A (en) * | 2021-11-05 | 2022-02-08 | 山东库睿科技有限公司 | Data processing method, device, electronic equipment and medium |
CN114218442A (en) * | 2021-12-10 | 2022-03-22 | 北京云迹科技股份有限公司 | A data processing method, system, electronic device and readable storage medium |
CN114692050A (en) * | 2022-03-30 | 2022-07-01 | 北京金堤科技有限公司 | Page parsing method and device, computer readable medium and electronic device |
Also Published As
Publication number | Publication date |
---|---|
CN110020236B (en) | 2021-11-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110020236A (en) | Web analysis method, apparatus, storage medium, processor and equipment | |
CN107038207B (en) | Data query method, data processing method and device | |
US11934403B2 (en) | Generating training data for natural language search systems | |
CN113051285B (en) | SQL sentence conversion method, system, equipment and storage medium | |
US11100420B2 (en) | Input processing for machine learning | |
US11546230B2 (en) | Real time streaming analytics platform | |
US20230106226A1 (en) | Code enrichment for training language models relating to computer programming | |
US10956132B1 (en) | Unified code and data management for model development | |
US11068244B2 (en) | Optimized transpilation | |
CN107220098B (en) | Method and device for implementing rule engine | |
EP2924633A1 (en) | A system and method for extracting a business rule embedded in an application source code | |
CN107391528B (en) | Front-end component dependent information searching method and equipment | |
CN110866029B (en) | sql statement construction method, device, server and readable storage medium | |
CN110119401A (en) | Processing method, device, server and the storage medium of user's portrait | |
CN108984155A (en) | Flow chart of data processing setting method and device | |
KR102033416B1 (en) | Method for generating data extracted from document and apparatus thereof | |
CN110083625A (en) | Realtime stream processing method, equipment, data processing equipment and medium | |
CN109284115A (en) | A kind of method and device generating tracker script | |
CN108932225B (en) | Method and system for converting natural language requirements into semantic modeling language statements | |
US8666951B2 (en) | Managing multiple versions of enterprise meta-models using semantic based indexing | |
Loseto et al. | Linked Data (in low-resource) Platforms: a mapping for Constrained Application Protocol | |
CN113779311A (en) | Data processing method, device and storage medium | |
CN109614098A (en) | The generation method and device of configuration interface | |
Vesić et al. | Comparative analysis of web application performance in case of using rest versus graphql | |
US10223086B2 (en) | Systems and methods for code parsing and lineage detection |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information |
Address after: 100080 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing Applicant after: Beijing Guoshuang Technology Co.,Ltd. Address before: 100086 Beijing city Haidian District Shuangyushu Area No. 76 Zhichun Road cuigongfandian 8 layer A Applicant before: Beijing Guoshuang Technology Co.,Ltd. |
|
CB02 | Change of applicant information | ||
GR01 | Patent grant | ||
GR01 | Patent grant |