CN107729564A

CN107729564A - A kind of distributed focused web crawler web page crawl method and system

Info

Publication number: CN107729564A
Application number: CN201711113373.XA
Authority: CN
Inventors: 倪学峰; 于海群; 暴筱; 张猛; 林小俊
Original assignee: Beijing Zhong Hui Information Technology Ltd By Share Ltd
Current assignee: Beijing Zhong Hui Information Technology Ltd By Share Ltd
Priority date: 2017-11-13
Filing date: 2017-11-13
Publication date: 2018-02-23

Abstract

The invention discloses a kind of distributed focused web crawler web page crawl method and system；This method is：1) linking inlet ports of data grabber are obtained according to kind of subtask, generate task to be downloaded；2) web data according to corresponding to obtaining the task to be downloaded；3) target structural data of setting is extracted from web data；Wherein, resolution scheduling module takes out task to be resolved from task queue to be resolved, and parsing module is sent to by load balancing；The parsing template of acquisition of information matching of the parsing module in parsing task, target structural data is parsed from the web data of acquisition；Parsing template corresponding to one is set for each setting website, parsing template is the file of the XML format of some regular expressions composition, and each regular expression can match an attribute of target structural data from web data.The present invention greatly promotes data grabber efficiency, has the cohesion and the transparency of height.

Description

Distributed focused web crawler web page crawling method and system

Technical Field

The invention belongs to the field of web crawlers, and particularly relates to a distributed web crawler focusing web page crawling method and system, which can send requests of different modes to different websites and analyze data of different structures.

Background

A web crawler can use various existing resources to automatically capture a large amount of web page information on the internet, and is sometimes called a "web Spider (Spider)". Focused web crawlers, which may also be referred to as topic web crawlers, crawl data on purpose, crawling related pages according to a particular topic. Compared with the general web crawler, the crawler is not used for crawling the data of the whole network without destination, but selective crawling is adopted, so that the quantity of crawled web pages can be reduced, and meanwhile, the web page updating efficiency is improved, so that the requirements of the crawler on crawling speed and storage space are not high, but a good crawling strategy is required to evaluate whether the pages or links need to be crawled or not.

Because the WEB crawler crawling process needs to acquire a request URL, send a WEB request to download a page, analyze structured data from the page, filter repeated data and process a seed task in 5 links, the resource consumption of each link is different, and the efficiency and the stability of the whole crawler system are affected when each link goes wrong. In addition, with the change of internet technology, more and more websites and more information are provided, and most website data are asynchronous requests; or data is encrypted by part of websites at the server side, and the JS script is executed by the front end to decrypt the data, so that a crawler system compatible with various websites and various data is lacked.

Disclosure of Invention

The invention provides a distributed web crawler web crawling method and a web crawler system, which can support various web request modes and are flexible in structural data analysis and configuration. The crawling process is modularized and functional, the data grabbing efficiency is greatly improved, and the crawling device has high cohesiveness and transparency.

The invention utilizes the distributed technology to the focused web crawler system, modularizes 5 processing links and has single function of each module, thereby improving the working efficiency of the whole system and leading the horizontal expansion of the system to be more convenient and simpler. On the other hand, the invention introduces the KAFKA message queue for decoupling between modules, and the KAFKA also has the characteristics of high throughput, high availability and easy expansion, so that the introduction of the KAFKA storage medium greatly improves the efficiency and stability of the invention.

The invention introduces a Selenium automatic test tool to call a browser or a pseudo browser to load data; and for the encrypted data, the JS script is executed by calling the JS engine to decrypt the data so as to crawl the data.

The technical scheme adopted by the invention is as follows:

a distributed web crawler web page crawling method comprises the following steps:

1) the task generation module is used for acquiring an entry link of data capture according to the seed task and generating a task to be downloaded;

2) the webpage downloading module is used for judging whether to request webpage content in a mode of calling a browser or a pseudo browser by using a traditional HTTP request, a Selenium or needing to use a JS engine to execute a JS script according to the task to be downloaded;

3) and the webpage analysis module is used for extracting the set target structured data from the disordered downloaded webpage source code content by using the regular expression. If the online comment data of a certain website are obtained, the webpage analysis aims to obtain a set of comment data from the comment page source code, wherein the set of comment data is the finally desired target structured data;

4) the data duplication elimination module is used for filtering out data which is duplicated with the previous downloading in the data downloaded at the time, and judging whether to continue crawling the seed subtasks or not according to the duplication condition of the data;

5) and the seed task iteration module is used for taking out the seed task to be downloaded from the seed task queue to generate the task to be downloaded, and repeating the steps 1) -5).

Further, step 1) obtains an entry link of data to be crawled from a database, splices a webpage request Url according to a corresponding link splicing rule, generates a task to be downloaded, and stores the task to be downloaded into a task queue.

Further, the step 2) of downloading the web page includes: a downloader scheduling module and a downloader module.

Furthermore, the downloader scheduling module is used for taking out the task to be downloaded from the task queue to be downloaded, distributing the task to be downloaded to the downloader module, receiving the download completion task and storing the download completion task into the queue to be analyzed; and the downloader module receives the task to be downloaded, sends a Web request to download webpage data, stores the webpage data into the MongoDB database, and sends the download completion task to the downloader scheduling module.

Further, the step 3) of parsing the web page includes: the analysis scheduling module and the analysis module.

Furthermore, the parsing and scheduling module is used for taking out the task to be parsed from the task queue to be parsed, sending the task to be parsed to the parsing module through nginx load balancing, and receiving the parsed task and storing the parsed task into the task queue to be rearranged; the analysis module receives the task to be analyzed, acquires an analysis template in Redis according to the information in the task, acquires a downloaded webpage in MongoDB, analyzes the target structured data and sends the target structured data to the analysis scheduling module.

Furthermore, the parsing template is used for extracting target structured data from the downloaded web page content, and each regular expression can match an attribute of the target structured data from the downloaded web page content. Because the web page structures of the websites are different and the data captured from the websites are different, each website has a corresponding analysis template. Therefore, in the data crawling process, the content of the analysis template needs to be determined according to the webpage structure and the data structure of the target structured data, and then the webpage is analyzed by using the analysis template.

Further, the step 4) of data rearrangement comprises: the device comprises a weight-removing scheduling module and a weight-removing module.

Further, the re-scheduling module is configured to take out the tasks to be re-scheduled from the task queue to be re-scheduled, send the tasks to be re-scheduled to the re-scheduling module through nginx load balancing, and receive the re-scheduling completion tasks. Finally, storing the re-arranging completion task into a re-arranging completion queue, adding the newly downloaded data into a corresponding data storage queue, and adding the seed task into a seed task queue; and the duplicate removal module receives the tasks to be subjected to duplicate removal, acquires the duplicate removal template in the Redis according to the information in the tasks, filters the duplicate data, judges whether to continue downloading the seed tasks or not, and sends the seed tasks to the duplicate removal scheduling module.

Further, the deduplication template is used for representing rules of whether different website data are used for filtering repeated data and the seed task to continue downloading.

Further, the step 4) iterates the seed tasks, and the seed tasks in the seed task queue generate tasks to be downloaded, and store the tasks to be downloaded into the task queue to be downloaded.

The invention has the following beneficial effects:

the distributed focused crawler system provided by the invention can support various webpage request modes, and the structured data analysis configuration is flexible. The crawling process is modularized and functional, the data grabbing efficiency is greatly improved, and the crawling device has high cohesiveness and transparency.

Drawings

FIG. 1 is an architecture diagram of a crawler system according to an embodiment of the present invention.

FIG. 2 is a flow diagram of a crawler system task generation module according to an embodiment of the present invention.

Fig. 3 is an architecture diagram of a web page download module of the crawler system according to an embodiment of the present invention.

Fig. 4 is an architecture diagram of a web page parsing module of the crawler system according to the embodiment of the present invention.

FIG. 5 is an architecture diagram of a crawler system data deduplication module of an embodiment of the present invention.

FIG. 6 is a flow diagram of a crawler system seed task iteration module according to an embodiment of the invention.

Detailed Description

The invention is further illustrated by the following specific examples and the accompanying drawings.

FIG. 1 is an architecture diagram of a crawler system according to an embodiment of the present invention. As shown in fig. 1, the crawler system is totally divided into 5 functional modules, namely a task generation module, a web page downloading module, a web page parsing module, a data rearrangement module and a seed task iteration module, which cooperate to complete the web page crawling work, and the modules adopt KAFKA message queues to achieve the purpose of module decoupling. The KAFKA message queue is only one storage medium for storing tasks, and other storage media can achieve the same purpose and effect to those skilled in the art. For example, MQs store message queues, REDIS databases, but KAFKA has significant advantages in terms of throughput, scalability, availability, etc. On the other hand, the crawler system is compatible with various websites, different tasks generation rules, webpage downloading rules, webpage analysis rules and data rearrangement rules are provided for different websites, and all the rules are stored in the ZooKeeper. ZooKeeper is a distributed application coordination service that provides functions including: configuration maintenance, domain name service, distributed synchronization, group service, etc. According to the method, the crawler system stores various rules in the ZooKeeper, once the website is changed, the rules are changed, and the ZooKeeper can immediately push new rules to the crawler system, so that the stability of data capture is ensured.

FIG. 2 is a flow diagram of a crawler system task generation module according to an embodiment of the present invention. As shown in fig. 2, when the website data needs to be crawled, the task generating module generates a to-be-downloaded task from the website start URL according to the task generating rule, and stores the to-be-downloaded task into the to-be-downloaded task queue. In the data crawling process, some target data are found to be asynchronously requested, and when the data are crawled, the crawler system needs to download the webpage by using the AJAX URL of the asynchronous request, so that a task generation rule is introduced for converting the website starting URL to be crawled into the AJAX URL of the requested data.

For example, when the website X hotel point comment is captured, the URL of the X hotel is http:// hotel. com/city/beijing _ city/dt-X/, and the AJAX URL asynchronously loaded by the hotel point comment data is http:// review. com/api/h/X/detail/v1/page/1, when the website point comment data is crawled, the task generation module needs to convert the URL of the X hotel into the AJAX URL asynchronously loaded by the point comment data. The corresponding task generation rules and descriptions (as in table 1) are as follows:

table 1 is a rule description table

Fig. 3 is an architecture diagram of a web page download module of the crawler system according to an embodiment of the present invention. As shown in fig. 3, the web page download module includes a downloader scheduling module and a downloader module, where only one downloader scheduling module exists, and the downloader module can be added arbitrarily according to the load condition of the web page download module. After the downloader scheduling module is started, available downloader modules in the ZooKeeper are firstly collected, tasks to be downloaded are acquired from a task queue to be downloaded, are sequentially distributed to the available downloader modules, and tasks which are downloaded and completed by the downloader modules are received and added to the task queue to be analyzed. In order to prevent the WEB request from being prohibited by a target website in the crawling process, each downloader module requests that the same website has a maximum concurrency upper limit, namely when the concurrency number of the downloader A accessing the website reaches the upper limit, the downloader scheduling module distributes other tasks accessing the website to another downloader module of which the concurrency number does not reach the upper limit. The crawler system of the embodiment of the invention adopts a mechanism of limiting the maximum concurrent number of accessing each website to realize the load balance of the webpage downloading module. After the downloader module is started, the information of the downloader module is registered in the Zookeeper, so that the purpose of informing the downloader scheduling module is achieved, the downloader scheduling module monitors the registered information of the downloader module, the task to be downloaded is distributed to the downloader, the downloader module receives the task to be downloaded, sends a request according to a specific mode, downloads a webpage and stores the webpage in the mongoDB, and finally sends the task completion condition to the downloader scheduler.

According to the embodiment of the invention, the downloader scheduling module and the downloader module in the crawler system adopt a Socket communication mechanism, so that the network transmission cost is greatly reduced. Register self information in the ZooKeeper when downloader module starts to inform downloader scheduling module, and utilize the heartbeat detection mechanism of ZooKeeper, detect downloader module running state, in case the downloader module stops to operate, downloader scheduling module hears immediately through ZooKeeper, just no longer gives this downloader module distribution and treats the download task, thereby has guaranteed the stability of system.

The downloader module of the embodiment of the invention can support the downloading of the webpage in a common http request mode on one hand, and on the other hand, is unusual, and downloads the webpage in a mode of calling Chrome, Firefox, PhantomJS browser or pseudo browser by adopting the Selenium automated testing technology. In addition, the JS code in the page can be automatically executed to generate a request Cookie or Token, and the request parameter can be encrypted or the response data can be decrypted.

Fig. 4 is an architecture diagram of a web page parsing module of the crawler system according to the embodiment of the present invention. As shown in fig. 4, the webpage parsing module includes a parsing scheduling module and a parsing module, task scheduling is implemented between the parsing scheduling module and the parsing module through a Nginx load balancing, only one parsing scheduling module and one Nginx module need to be deployed, and a plurality of parsing modules can be deployed according to system pressure. And the analysis scheduling module takes out the tasks to be analyzed from the task queue to be analyzed, distributes the tasks to the analysis module through Nginx load balance, receives analysis results after the tasks are processed, and adds the analysis results to the task queue to be rearranged. When the analysis module is started, the data analysis template of each website in the ZooKeeper is obtained and monitored, after the task to be analyzed is received, the webpage content corresponding to the task is obtained in the MongoDB, the webpage content is converted into structured target data by using the corresponding analysis template, and the structured target data is returned to the analysis scheduling module.

The webpage analysis module of the crawler system in the embodiment of the invention adopts Nginx for load balancing, can continuously keep low resource, low consumption, high performance and high concurrency, has the advantages of simple installation, easy configuration, few bugs, convenient starting and the like, and more importantly, the version of the service is upgraded under the condition of ensuring uninterrupted service.

According to the webpage analysis method and the webpage analysis system, the webpage analysis module of the crawler system stores the analysis template in the ZooKeeper, and the analysis module monitors the analysis template in the ZooKeeper in real time, so that the analysis template in the system can be updated immediately without interrupting service after a website is changed, the stability of data capture is guaranteed, and meanwhile, the operation and maintenance cost of the system is greatly reduced.

Examples and descriptions of the parsing template (as in table 2) are as follows:

table 2 is a rule introduction table

FIG. 5 is an architecture diagram of a crawler system data deduplication module of an embodiment of the present invention. As shown in fig. 5, the data rearrangement module includes a rearrangement scheduling module and a rearrangement module, task scheduling is implemented between the rearrangement scheduling module and the rearrangement module through load balancing of nginnx, only one rearrangement scheduling module and nginnx needs to be deployed, and a plurality of rearrangement modules can be deployed according to system pressure. And the re-arranging scheduling module takes out the tasks to be re-arranged from the task queue to be re-arranged, distributes the tasks to the re-arranging module through Nginx load balance, receives the re-arranging result after the tasks are processed, splits the re-arranging result into the re-arranging task, needs to store data and needs to continuously capture the seed task, and respectively adds the re-arranging task, the data storage queue and the seed task queue to the re-arranging task queue. The duplicate removal module acquires and monitors a data duplicate removal rule of each website in the ZooKeeper when being started, generates a Hash value with uniqueness by using the corresponding duplicate removal rule after receiving a task to be duplicated, searches whether the Hash value exists in a MongoDB duplicate removal database, and if so, indicates that the data of the website has been crawled and needs to be filtered; if the data does not exist, the data is not crawled, and the data needs to be returned to the re-scheduling module together with the re-scheduling task. In addition, the repetition eliminating module can also calculate the repetition rate of the data crawled this time and the data stored in the storage queue, so as to determine whether to continue crawling the seed subtasks. If the repetition rate is low, the seed task and the re-arranging completion task need to be returned to the re-arranging scheduling module.

The data rearrangement module of the crawler system in the embodiment of the invention adopts Nginx for load balancing, and has the same advantages as the analysis module.

The data rearrangement rule of the crawler system is stored in the ZooKeeper, and the rearrangement module can monitor the rearrangement rule in the ZooKeeper in real time, so that the rearrangement rule in the system can be updated immediately without interrupting service when the unique data identification calculation method is changed after the website is changed, the stability of data capture is ensured, and the operation and maintenance cost of the system is greatly reduced.

The weight rejection rules are exemplified and described (as in table 3) as follows:

table 3 is a rule introduction table

FIG. 6 is a flow diagram of a crawler system seed task iteration module according to an embodiment of the invention. When the seed tasks needing to be crawled exist in the seed task queue, the seed task iteration module creates the seed tasks into the tasks to be downloaded according to the task generation rules and stores the tasks to be downloaded into the task queue to be downloaded. The task generation rule of the process is consistent with the task generation rule introduced in the task generation module. The seed task can be a URL in an original task request page, and can also be generated by splicing AJAX URLs, for example, http:// review ·.

According to the distributed focused crawler, the whole data crawling process is divided into five modules, the processing process is modularized and the functions are unified, Kafka message queues are adopted for decoupling the modules, Nginx load balancing is utilized during task processing, and the efficiency of a crawler system is greatly improved. When the data is downloaded, the method can support a common HTTP request, can be compatible with a Chrome, Firefox, PhantomJS browser or a pseudo browser to request a webpage, and can automatically execute JS codes in the webpage to acquire HTTP request parameters. In addition, the analysis template and the rearrangement rule are stored in the ZooKeeper, the system can monitor rule change in real time, data in a system memory can be immediately updated without interrupting service, the stability of data capture is guaranteed, and meanwhile the operation and maintenance cost of the system is greatly reduced.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, replacement, or improvement made within the principle of the present invention should be included in the scope of protection of the present invention.

Claims

1. A distributed web crawler web page crawling method includes the following steps:

1) the task generation module acquires an entry link for data capture according to the seed task and generates a task to be downloaded;

2) the webpage downloading module acquires corresponding webpage data according to the task to be downloaded;

3) the webpage analysis module extracts set target structured data from the webpage data; the webpage analysis module comprises an analysis scheduling module and a plurality of analysis modules; the analysis scheduling module is used for taking out the tasks to be analyzed from the task queue to be analyzed and sending the tasks to be analyzed to the analysis module through load balancing; the analysis module is used for acquiring a matched analysis template according to information in an analysis task and analyzing the target structured data from the acquired webpage data; and setting a corresponding analysis template for each set website, wherein the analysis template is an XML (extensive Makeup language) format file consisting of a plurality of regular expressions, and each regular expression can match one attribute of the target structured data from the webpage data.

2. The method of claim 1, wherein the web page download module comprises a downloader scheduling module and downloader modules; the downloader scheduling module acquires tasks to be downloaded from the task queue to be downloaded according to the number of currently available downloader modules, distributes the tasks to the available downloader modules, receives the tasks downloaded by the downloader modules and adds the tasks to the task queue to be analyzed; the method comprises the steps that a downloader scheduling module sets a maximum concurrency upper limit for each downloader module, and when the downloader module requests that the concurrency number of the same website reaches the maximum concurrency upper limit of the downloader module, the downloader scheduling module distributes other tasks accessing the website to another downloader module of which the concurrency number does not reach the upper limit.

3. The method of claim 2, wherein the downloader module registers its own information in a distributed application coordination service Zookeeper, and the downloader scheduling module acquires a currently available downloader module from the application coordination service Zookeeper; the application program coordination service Zookeeper is used for configuration maintenance, domain name service, distributed synchronization and group service.

4. The method of claim 2 or 3, wherein the downloader scheduling module and the downloader module communicate using a Socket communication mechanism; the method comprises the steps that a downloader scheduling module detects the running state of a downloader module by using a heartbeat detection mechanism in application program coordination service Zookeeper, and when the downloader module stops running, the downloader scheduling module does not distribute tasks to be downloaded to the downloader module any more.

5. The method of claim 1, wherein task invocation is accomplished between the parsing scheduling module and the parsing module through Nginx load balancing.

6. The method of claim 1, wherein the currently acquired target structured data is monitored by a duplication module for duplication with previously downloaded acquired target structured data, and if duplication occurs, the duplicated target structured data is deleted and whether to continue to execute the current seed task is determined according to the duplication ratio.

7. A distributed web crawler web page crawling system is characterized by comprising a task generating module, a web page downloading module and a web page analyzing module; wherein,

the task generating module is used for acquiring an entry link for data capture according to the seed task and generating a task to be downloaded;

the webpage downloading module is used for acquiring corresponding webpage data according to the task to be downloaded;

the webpage analysis module is used for extracting set target structured data from the webpage data; the webpage analysis module comprises an analysis scheduling module and a plurality of analysis modules; the analysis scheduling module is used for taking out the tasks to be analyzed from the task queue to be analyzed and sending the tasks to be analyzed to the analysis module through load balancing; the analysis module is used for acquiring a matched analysis template according to information in an analysis task and analyzing the target structured data from the acquired webpage data; and setting a corresponding analysis template for each set website, wherein the analysis template is an XML (extensive Makeup language) format file consisting of a plurality of regular expressions, and each regular expression can match one attribute of the target structured data from the webpage data.

8. The system of claim 7, wherein the web page download module comprises a downloader scheduling module and downloader modules; the downloader scheduling module acquires tasks to be downloaded from the task queue to be downloaded according to the number of currently available downloader modules, distributes the tasks to the available downloader modules, receives the tasks downloaded by the downloader modules and adds the tasks to the task queue to be analyzed; the method comprises the steps that a downloader scheduling module sets a maximum concurrency upper limit for each downloader module, and when the downloader module requests that the concurrency number of the same website reaches the maximum concurrency upper limit of the downloader module, the downloader scheduling module distributes other tasks accessing the website to another downloader module of which the concurrency number does not reach the upper limit.

9. The system of claim 8, wherein the downloader module registers its own information in a distributed application coordination service Zookeeper, and the downloader scheduling module acquires a currently available downloader module from the application coordination service Zookeeper; the application program coordination service Zookeeper is used for configuration maintenance, domain name service, distributed synchronization and group service; and the analysis scheduling module and the analysis module realize task scheduling through Nginx load balancing.

10. The system of claim 7, further comprising a re-ordering module for monitoring whether the currently obtained target structured data is duplicated with the previously downloaded obtained target structured data, and if so, deleting the duplicated target structured data and determining whether to continue to execute the current seed task according to a repetition rate.