CN103310012A

CN103310012A - Distributed web crawler system

Info

Publication number: CN103310012A
Application number: CN2013102749513A
Authority: CN
Inventors: 王宝会; 于雷; 王丽华; 王新河; 尹科
Original assignee: Beihang University
Current assignee: Huike Education Technology Group Co ltd
Priority date: 2013-07-02
Filing date: 2013-07-02
Publication date: 2013-09-18
Anticipated expiration: 2033-07-02
Also published as: CN103310012B

Abstract

A distributed web crawler system is suitable for the field of network information collection and comprises a management portal, a central node server and a distributed sub-node server, wherein the management portal is a Web interface provided for an administrator by the crawler system and can be used for viewing the logs of the central node server and the distributed sub-node server, setting and adding themes, updating a URL (uniform resource locator) seed of a theme, configuring a theme capture frequency parameter, and controlling a crawler state; the central node server and the distributed sub-node server are the main bodies of the system and can be used for operating the themes, learning a data extractor, analyzing pages and storing target pages. According to the distributed web crawler system, the capture of different themes can be accommodated by a crawler, the webpage capture speed is increased, and the quality meets the user requirement.

Description

A kind of distributed network crawler system

Technical field

The present invention relates to a kind of distributed network crawler system, belong to the network information gathering field.

Background technology

The fast development of network has brought the explosive increase of WWW quantity of information, traditional common search engine effect as the internet information gopher becomes more and more important, but owing to itself there being the limitation such as the network coverage is low, loss is high, therefore can not provide accurately comprehensively information for the user.In order to overcome the above deficiency of universal search engine, topic search engine arises at the historic moment, and its target is with limited bandwidth and hardware resource consumption, for the user provides the most accurate result in its care field.

Theme Crawler of Content is the basis of topic search engine, and the speed of its crawl webpage and quality are the important indicators that determines the search engine quality.It is the system of an automatic downloading web pages in the restriction field, screens according to certain priority order and degree of subject relativity and obtains the page.Different from general reptile, Theme Crawler of Content is not pursued high coverage rate, but optionally gets the Topic relative page, has that resource occupation is low, index data base upgrades convenient, the accurate advantage of the buffer memory page.

But prior art all can't realize judging the page from the correlativity of theme and hold different theme crawl etc. in a crawler system at present, and the speed and the quality that therefore cause grasping webpage can not satisfy customer requirements.

Summary of the invention

The technology of the present invention is dealt with problems: overcome the deficiencies in the prior art, a kind of distributed network crawler system is provided, realized that a reptile holds the crawl of different themes, the speed and the quality that have improved the crawl webpage can not satisfy customer requirements.

The technology of the present invention solution: a kind of distributed network crawler system comprises: managing portal, Centroid server, distributed child node server; Managing portal is the Web interface that crawler system provides the keeper, can check the daily record of Centroid server and distributed child node server, the interpolation theme is set, upgrade the URL seed of certain theme, the crawl frequency parameter of configuration theme, the state of control reptile; Centroid server and distributed child node server reptile are the main bodys of system, finish the storage of study, page analysis and the target pages of theme operation, data pick-up device;

(1) Centroid server comprises URL controller, decimator module and theme control module;

The theme control module, the data of sending from management interface receiving management door, comprise data of description, interpolation and the deletion action data of theme, the data of control theme crawl frequency, finish the operation about theme, comprise description, interpolation and deletion to theme; Control theme crawl frequency; Edit each theme seed formation, and the formation of theme seed is sent to decimator module and URL controller module;

Decimator module, after receiving the formation of theme seed, at first come the webpage of the URL address of seed formation representative is classified by the fundamental analysis device, be divided into Deep Web webpage and data-intensive (Data-intensive) webpage, then respectively two kinds of pages are extracted analysis, find data pick-up device corresponding to each type after the analysis, send to the URL controller again URL address and the corresponding data pick-up device corresponding record of advancing, and record;

The URL controller, receive seed formation and the URL address of decimator module transmission and the withdrawal device record of correspondence that the theme control module sends, these two data are integrated, corresponding with corresponding data pick-up device the URL address, there is not the URL address of corresponding withdrawal device just to correspond to general withdrawal device, formation is lined up in all URL addresses, by the division of tasks method task is sent to the distributed child node of each reptile;

(2) distributed child node server comprises child node URL controller, data pick-up device, search controller, webpage grabber;

Child node URL controller, the seed URL that the receiving center node server sends over and corresponding data withdrawal device information; Receive and at first to carry out the URL address behind the URL and look into heavily, then will not have the URL address of repeated acquisition to arrange into formation, and with in the formation URL address and corresponding data pick-up device information send to data pick-up device and webpage grabber;

The data pick-up device, the Deep Web webpage from child node URL formation carried out page analysis and extract in the page URL form new URL, form new URL, be equivalent to the object behind the submission of sheet, pass to the webpage grabber; After receiving the page that search controller sends, use whose withdrawal device corresponding to page URL address to carry out the extraction of URL address in content extraction and the page, then URL address base etc. is sent in the URL address to be collected;

The webpage grabber receives the URL address that sends over from URL controller and data pick-up device, then carries out the crawl of webpage, and the webpage of crawl offers search controller;

Search controller is analyzed receiving the page that collects, and satisfactory Page-saving enters pool of page, otherwise the page is passed to the data pick-up device.

Described division of tasks method adopts the weighted least-connection scheduling method, and the specific implementation process is as follows:

(1) calculates the poor of PR minimum value in the PR of seed and the URL formation, with the ratio cc of PR maximal value and the minimum value difference of PR _i:;

α_{i} = \frac{{PR}_{i} - low (PR)}{top (PR) - low (PR)}

PR is the webpage rank, i=0, and 1 ..., n-1, Low (PR) and Top (PR) are respectively PR maximal value and minimum value;

(2) weight of calculating search depth, the target pages degree of depth of the context guide is 3, the weights influence factor-beta of search depth _iBe itself degree of depth L _iInverse:

β _i＝1/L _i

(3) calculate the crawl frequency, by the Sigmoid function, the even strictly monotone of Sigmoid smoothing of functions, threshold range is (0.5～1), specifically grasps frequency x _iBe calculated as follows:

x_{i} = \frac{F_{i} - low (F)}{top (F) - low (F)}

Wherein, Fi is the crawl frequency of seed; Low and Top obtain respectively formation medium frequency maximal value and minimum value.

Crawl frequency influence factor gamma _iBe calculated as:

γ_{i} = \frac{1}{1 + e^{- axi}}

The a value is the weighting factor behind the linear smoothing result greater than 1, and target is to enlarge first step result of calculation.

(4) judge according to the Sigmoid function curve, a gets 2.5 in system of the present invention.Can draw thus, the priority weighting of seed is the arithmetic mean of 3 factors of influence:

Q_{i} = \frac{α_{i} + β_{i} + γ_{i}}{3}

(5) then carry out sort descending according to the Qi value.The URL Weight algorithm of Centroid has been inherited in URL formation in the child node, reptile based on the withdrawal device guiding in the system only can crawl in the website that seed limits, crawl frequency and 2 factors of website importance are constant in the Q value, only can with the search depth factor variations, be calculated as follows:

Q = Q_{prev} - \frac{β_{prev} - β}{3}

Wherein, Qprev is that uncle URL transmits the weights that get off; β prev is the search depth factor of father URL; β is the search depth factor of object URL.Child node formation URL number is many, adopts the dichotomy ordering to exchange the raising of efficient for the space.Through theoretical analysis and actual test, the URL weights are even smooth distribution between 0-1, the situation that the single factors of having avoided the violent decay of 1 factor to cause plays a decisive role, take into account simultaneously destination object, crawl strategy and 3 principal elements of search depth, embodied well priority difference.Even a kind of special circumstances that this algorithm is realized are requests of transmitting in the face of searcher, this moment, priority was the highest, and the Q value is made as 1, and transmittance process is unattenuated.

(6) each child node represents its handling property with corresponding weights.Default weights are made as 1, and the system manager can dynamically arrange the weights of server.Weighted least-connection scheduling makes the built linking number of server and its weights proportional when the new connection of scheduling as far as possible.The algorithm flow of weighted least-connection scheduling is as follows: suppose to have one group of server S={ S0, S1,, Sn-1}, the weights of W (Si) expression server S i, current linking number of C (Si) expression server S i.The summation of the current linking number of Servers-all be CSUM=Σ C (Si) (i=0,1 ..., n-1).

Current new connection request can be sent out server S m, and and if only if, and server S m meets the following conditions sends seed again:

\begin{matrix} \frac{C (Sm) / CSUM)}{W (Sm)} = \min {\frac{C (Si) / CSUM}{W (Si)}} \\ &DoubleRightArrow; \frac{C (Sm)}{W (Sm)} = \min {\frac{C (Σ_{t})}{W (Si)}}, (i = 0,1, \cdot \cdot \cdot, n - 1) \end{matrix}

Wherein, W (Si) is not 0.The daily record of child node feeds back to Centroid at regular intervals, and child servers linking number C (Si) obtains by reading daily record.This method compares the ratio of child node linking number and priori weights, obtains the child node of minimum load, distributes the new task that crawls.

The present invention's advantage compared with prior art is:

(1) invention designed a kind of for the field in the search engine of a plurality of themes, the subsystem that comprises a series of subject searches (such as air ticket, hotel), they share 1 reptile, realized that a reptile holds the crawl of different themes, division of tasks algorithm in a kind of task distribution that is directed to distributed reptile newly of the present invention, the current existing document that relates to this framework all is the summary description, do not solve simultaneously multi-threaded and deposit that the URL that may occur in the situation distributes and the problem such as algorithm compatibility, the invention solves this problem.

(2) framework of the present invention adopts the multi-threaded strategy based on classification annotation, solve the problem of multi-threaded self-adaptation compatibility in the same crawler system, by secondary weighting division of tasks algorithm, solve the URL assignment problem of based target guiding, load balancing, strengthened the system expandability.

(3) improving one's methods of the URL storage policy of the present invention's proposition can be supported efficiently the URL inquiry, be inserted and the repeatability detection.The subject search system of native system exploitation offers the abundant input interface of user's topicalization, and returns accurate structured content, and its reptile has been adopted the framework of based on data withdrawal device.The existing document that relates to this framework all is the summary description, does not solve simultaneously multi-threaded and deposits that the URL that may occur in the situation distributes and the problem such as algorithm compatibility.

Description of drawings

Fig. 1 is the overall architecture schematic diagram of distributed reptile of the present invention system;

Fig. 2 is the Centroid server rack composition among the present invention;

Fig. 3 is the Organization Chart of distributed node server among the present invention.

Embodiment

System of the present invention adopts the distributed system architecture of based on data withdrawal device, formed by a center main controlled node and distributed crawler server, and the whole system collaborative work that cooperatively interacts, its overall architecture is seen Fig. 1.

As shown in Figure 1, the present invention mainly is comprised of following module:

1, managing portal

Managing portal be crawler system to the Web interface that the keeper provides, can check the daily record of center and child servers, the interpolation theme is set, upgrade the URL seed of certain theme, the parameters such as crawl frequency of configuration theme, the state etc. of control reptile.Centroid and distribution reptile are the main bodys of system, finish the storage of study, page analysis and the target pages of theme operation, data pick-up device.

2, Centroid server

Reptile center main controlled node is control axis, mainly comprises URL controller, decimator module and theme control module, as shown in Figure 2.The concrete function of three modules is seen following introduction:

(1) theme control module

The theme control module is finished the operation about theme, comprises description, interpolation and deletion to theme; Control theme crawl frequency; Edit each theme seed formation.The authoritative page of corresponding theme is got in seed team's column selection, the i.e. more representational page that can be used as a series of target information initial positions in this theme, such as the Theme Crawler of Content of hotel search, its authoritative page is exactly to book rooms to comprise the start page that the webpage of inquire about Form or its hotel information are tabulated in the net.Use first universal search engine searching motif descriptive text, obtain the expansion page set of corresponding theme, because limited amount, so obtain again the seed formation of the authoritative page by artificial examination.

(2) decimator module

Adopt content-based web page analysis algorithm, start with from the URL seed, training forms the data pick-up device for the authoritative website of seed representative.The seed that satisfies a upper module demand mainly is divided into 2 classes: Deep Web webpage and data-intensive (Data-intensive) webpage, adopt the basic classification device of memory character can distinguish 2 kinds of pages, use for the Deep Web page the improved specific area of tour field dictionary is matched suitable complete interface input based on the inquiry detection method of example.For the latter's structured features, the URL that the strategy that adopts page piece and catalogue to find carries out the bottom page extracts.Through above process, can find the applicable data pick-up device (analytical algorithm path and search depth) of URL seed, in child node crawl process, this model instructs the page of the targeted sites of seed representative to resolve.

(3) URL controller

Mainly be responsible for the ordering of the URL formation in the Centroid, and carry out division of tasks according to each child node load feedback.Because adopt secondary URL collocation strategy, so the Centroid server is only stored seed URL, sort algorithm is determined priority according to theme crawl frequency and seed representative website weight, and the concurrency of unit of account time needs.Division of tasks adopts the weighted least-connection scheduling method.

The implementation procedure of Centroid server is:

(1) theme control module data (data of description of theme, interpolation and the deletion action data of sending from management interface receiving management door; The data of control theme crawl frequency; ) this module finishes the operation about theme, comprises description, interpolation and deletion to theme; Control theme crawl frequency; Edit each theme seed formation.And the formation of theme seed is sent to decimator module and URL controller module.(annotate: the seed formation is exactly URL address queue, URL address queue is exactly one group of URL address, but there is its singularity the URL address of seed formation, because the URL address of seed formation substantially all needs target to gather the homepage URL address of website, perhaps need the homepage URL address of the column classification that gathers etc.）

(2) after decimator module receives the formation of theme seed, at first coming the webpage of the URL address of seed formation representative is classified by the fundamental analysis device, (process of classification at first is the webpage that needs access URL address is pointed to, web page contents gathered then carry out follow-up sort operation), be divided into Deep Web webpage and data-intensive (Data-intensive) webpage, (introduction please see that the function for decimator module is introduced in detail in the top article in detail then respectively two kinds of pages to be extracted analysis.), find data pick-up device corresponding to each type after the analysis.And URL address and the corresponding data pick-up device corresponding record of advancing, and this record sent to the URL controller.

(3) URL controller receives seed formation and the URL address of decimator module transmission and the withdrawal device record of correspondence that the theme control module sends.And these two data are integrated, corresponding with corresponding data pick-up device the URL address, do not have the URL address of corresponding withdrawal device just to correspond to general withdrawal device, formation is lined up in all URL addresses, by the division of tasks method task is sent to the distributed child node of each reptile.

3, distributed child node server

As shown in Figure 3, distributed child node server is the implementation person who crawls, and mainly comprises child node URL controller, data pick-up device, search controller, webpage grabber.

The distributed child node implementation procedure of reptile is as follows:

(1) the seed URL and the corresponding data withdrawal device information that send over of child node URL controller receiving center node server; At first carry out the URL address behind the reception URL and look into heavily (internet reptile, URL looks into the address weighing method to be had a lot, because be not this paper emphasis, here do not do and look in detail the weighing method introduction, can use any general URL address and look into weighing method), then will there be the URL address of repeated acquisition to arrange into formation.And with in the formation URL address and corresponding data pick-up device information send to data pick-up device module and webpage grabber module.

(2) the data pick-up device carry out page analysis from the Deep Web webpage of child node URL formation and extract in the page URL form new URL, form new URL, be equivalent to the object behind the submission of sheet, pass to the webpage grabber.

(3) webpage grabber module receives the URL address that sends in the first two step, then carries out the crawl of webpage.

(4) webpage of webpage grabber crawl offers search controller.

(5) search controller receives the page that collects, and the page is analyzed, and satisfactory Page-saving enters pool of page, otherwise the page is passed to the data pick-up device.

(6) after the data pick-up device receives the page that search controller sends, use whose withdrawal device corresponding to page URL address to carry out the extraction of URL address in content extraction and the page, then URL address base etc. is sent in the URL address to be collected.

The below introduces each module more in detail.

(1) child node URL controller

Child node URL controller receives the URL from the seed URL of Centroid distribution and webpage extraction, stores url database into, and Trie type data structure is used in storage, can carry out duplicate detection and quick insertion to new adding URL.Used a update strategy take website as unit in the url database, can guarantee that the renewal of content is not subjected to the repeated retardance that detects.Transmit sort algorithm through the legal URL that detects according to secondary URL weighting, the weight that the reception parent page passes over and the ordering of carrying out priority in conjunction with the degree of depth in the search strategy pass to the webpage grabber.

(2) data pick-up device

From the inquiry probe algorithm that the Deep Web object of URL formation adopts Centroid to train, the pattern match input through concrete parameter forms new URL, is equivalent to the object behind the submission of sheet, passes to the webpage grabber.The page after another the search controller judgement of inputting to hang oneself of this module, the URL search strategy guarantees that this is the data-intensive page, according to the page piece discovery algorithm that trains, extract the URL that 2 classes are concerned about: page turn information and subordinate's page of data information, send into url database.

(3) webpage grabber

This is a multi-threaded parallel module, is responsible for gathering the page according to http protocol.Basic step comprises: a. extracts targeted sites address and port numbers out according to page URL, sets up network connection with this address and port; B. by page URL assembling HTTP request header, send to targeted sites, do not receive response message if surpass certain hour, then termination is grasped this page and it is abandoned; Otherwise continuation next step; C. analyze response message, if the status code of returning is 2xx, then return the correct page, enter next step; If status code is 301 or 302, representation page is redirected, and extracts the target URL that makes new advances from response header, returns previous step; If return other status codes, instruction page connection failure, termination are grasped this page and it are abandoned; D. from response header, extract the page infos such as date, length, page type; E. read the content of the page, for the larger page of length, the method that adopts piecemeal to read again splicing guarantees the integrality of content of pages.

(4) search controller

Search strategy of the present invention adopts the best-first search strategy in conjunction with concrete application enhancements, and through the analysis to Deep Web and the directory block formula page, the destination object major part is the text formula page, and the crawl degree of depth is no more than 3 grades.This module is adjudicated according to search strategy the web page contents of crawl, and the text formula page that meets search depth deposits pool of page in, waits for the structuring of index module, otherwise, pass to corresponding data pick-up device and carry out page analysis and URL extraction.

The division of tasks algorithm is specific as follows:

In the distributed reptile system, the equilibrium distribution that crawls task is one of key issue that affects system performance and resource distribution.At present the distributed reptile system adopts centralized or based on the division of tasks strategy of secondary Hash maps.These 2 kinds of strategies just solve the problem of uniform distribution, do not consider impact and the child node loading condition of URL priority.The division of tasks strategy of Theme Crawler of Content should be taken into account the ordering of URL formation and based on the balance dispatching of child node load.For the native system framework, the division of tasks algorithm has comprised that sort algorithm is transmitted in secondary URL weighting and based on the minimum linking URL dispatching method of the weighting of hash.

The sort algorithm that the weighting of design secondary is transmitted in the URL formation in Centroid and child node.In the Centroid level, its URL formation main body is the URL seed of different themes, and the seed attribute that impact crawls quality comprises website importance, crawl frequency and search depth.Seed is the authoritative page that theme is embodied in corresponding website, its PageRank can be mapped to website in the influence power of this subject fields, the PageRank evaluation adopts the PageRank algorithm of topology Network Based as standard, the page PR value that Google provides is integer, theoretical interval range is (0～10), but through statistics, the PR value of most of page is below 7, so, for even normalization, the factor of influence of website importance adopts linear function to calculate, and is specially the difference of PR minimum value in the PR of corresponding seed and the URL formation and the ratio cc of PR maximal value and minimum value difference _i:

α_{i} = \frac{{PR}_{i} - low (PR)}{top (PR) - low (PR)}

Search depth refers to the number of plies that the page is stipulated in best preference strategy, 3 grades altogether, the seed degree of depth that Hidden Web list is arranged is 1, and the data-intensive page degree of depth of catalogue block structured is 2, the target pages degree of depth of the context guide is 3, the weights influence factor-beta of search depth _iBe itself degree of depth L _iInverse.

β _i＝1/L _i

Crawl frequency factor of influence corresponding to the time interval that to be the keeper arrange according to search foreground demand and update strategy upgrades the interval short, and the crawl frequency is large, and then seed priority is higher.The crawl frequency is divided into seed frequency and theme frequency, is essential according to the theme frequency of theme character setting, and if seed frequency does not arrange, its value is just inherited the theme frequency.The sample value difference in distribution of grabbing interval is large, jumping characteristic is strong, and such as transferring the possession of the ticket module because the instantaneity requirement is high, its theme setting is spaced apart 15min; But hotel reservation is because the price change amplitude is little, it arranges the interval and take in the sky as unit, if follow its line shape function standardization fully, then solely large situation about then decaying fast of a value can appear in the frequency influence factor, this factor has just become the determinative of ordering, and is obviously unreasonable.Through research and comparison, obtain the result after at first adopting the linear normalization function, then weighting is finally by crossing Sigmoid function uniform treatment.The even strictly monotone of Sigmoid smoothing of functions, threshold range is (0.5～1), specifically is calculated as follows:

x_{i} = \frac{F_{i} - low (F)}{top (F) - low (F)}

Wherein, Fi is the crawl frequency of seed; Low and Top obtain respectively formation medium frequency maximal value and minimum value; x _iBe the crawl frequency; γ _iBe the crawl frequency influence factor;

γ_{i} = \frac{1}{1 + e^{- axi}}

The a value is the weighting factor behind the linear smoothing result greater than 1, and target is to enlarge first step result of calculation.Judge according to the Sigmoid function curve, a gets 2.5 in system.Can draw thus, the priority weighting of seed is the arithmetic mean of 3 factors of influence:

Q_{i} = \frac{α_{i} + β_{i} + γ_{i}}{3}

Then carry out sort descending according to the Qi value, because Centroid formation number seeds is limited, so adopt insertion sort, can save memory consumption, the time also is similar to other sort algorithms.The URL Weight algorithm of Centroid has been inherited in URL formation in the child node, the analysis found that, reptile based on the withdrawal device guiding in the system only can crawl in the website that seed limits, crawl frequency and 2 factors of website importance are constant in the Q value, only can with the search depth factor variations, be calculated as follows:

Q = Q_{prev} - \frac{β_{prev} - β}{3}

Weighted least-connection scheduling (Weighted Least-Connection Scheduling) algorithm has been adopted in cutting apart of Centroid formation.Each child node represents its handling property with corresponding weights.Default weights are made as 1, and the system manager can dynamically arrange the weights of child node server.Weighted least-connection scheduling makes the built linking number of server and its weights proportional when the new connection of scheduling as far as possible.The allocating task flow process is as follows: suppose to have one group of child node server S=S0, S1 ..., Sn-1}, the weights of W (Si) expression server S i, the current linking number of C (Si) expression child node server S i.The summation of the current linking number of all child node servers be CSUM=Σ C (Si) (i=0,1 ..., n-1).

Current new connection request can be sent out child node server S m, and and if only if, and child node server S m meets the following conditions sends seed again:

\begin{matrix} \frac{C (Sm) / CSUM)}{W (Sm)} = \min {\frac{C (Si) / CSUM}{W (Si)}} \\ &DoubleRightArrow; \frac{C (Sm)}{W (Sm)} = \min {\frac{C (Σ_{t})}{W (Si)}}, (i = 0,1, \cdot \cdot \cdot, n - 1) \end{matrix}

Wherein, W (Si) is not 0.The daily record of child node feeds back to Centroid at regular intervals, and child servers linking number C (Si) obtains by reading daily record.The present invention compares the ratio of child node linking number and priori weights, obtains the child node of minimum load, distributes the new task that crawls.

The non-elaborated part of the present invention belongs to techniques well known.

Above content is in conjunction with concrete preferred implementation further description made for the present invention; can not assert that the specific embodiment of the present invention only limits to this; for the general technical staff of the technical field of the invention; without departing from the inventive concept of the premise; can also make some simple deduction or replace, all should be considered as belonging to the present invention and determine scope of patent protection by claims of submitting to.

Claims

1. a distributed network crawler system is characterized in that comprising: managing portal, Centroid server, distributed child node server; Managing portal is the Web interface that crawler system provides the keeper, can check the daily record of Centroid server and distributed child node server, the interpolation theme is set, upgrade the URL seed of certain theme, the crawl frequency parameter of configuration theme, the state of control reptile; Centroid server and distributed child node server are the main bodys of system as reptile, finish the storage of study, page analysis and the target pages of theme operation, data pick-up device;

Decimator module, after receiving the formation of theme seed, at first come the webpage of the URL address of seed formation representative is classified by the fundamental analysis device, be divided into Deep Web webpage and data-intensive (Data-intensive) webpage, then respectively two kinds of pages are extracted, find data pick-up device corresponding to each type, send to the URL controller again URL address and the corresponding data pick-up device corresponding record of advancing, and record;

The URL controller, receive seed formation and the URL address of decimator module transmission and the data pick-up device record of correspondence that the theme control module sends, these two data are integrated, corresponding with corresponding data pick-up device in the distributed child node server the URL address, there is not the URL address of corresponding data withdrawal device just to correspond to general withdrawal device, formation is lined up in all URL addresses, by the division of tasks method task is sent to the distributed child node of each reptile;

Child node URL controller, the seed URL that the receiving center node server sends over and corresponding data withdrawal device information; At first carry out the URL address behind the reception URL and look into heavily, then will not have the URL address of repeated acquisition to arrange into formation, and URL address in the formation and corresponding data pick-up device information are sent to data pick-up device and webpage grabber;

The data pick-up device, the Deep Web webpage from child node URL formation carried out page analysis and extract in the page URL form new URL, be equivalent to the object behind the submission of sheet, pass to the webpage grabber; After receiving the page that search controller sends, use withdrawal device corresponding to page URL address to carry out the extraction of URL address in content extraction and the page, then URL address base etc. is sent in the URL address to be collected;

The webpage grabber receives the URL address that child node URL controller and data pick-up device send over, and carries out the crawl of webpage, and the webpage after the crawl offers search controller;

2. a kind of distributed network crawler system according to claim 1, it is characterized in that: described division of tasks method specific implementation process is as follows:

α_{i} = \frac{{PR}_{i} - low (PR)}{top (PR) - low (PR)}

β _i＝1/L _i

(3) by Sigmoid function calculation crawl frequency, crawl frequency x _iBe calculated as follows:

x_{i} = \frac{F_{i} - low (F)}{top (F) - low (F)}

Wherein, Fi is the crawl frequency of seed; Low and Top obtain respectively formation medium frequency maximal value and minimum value;

Crawl frequency influence factor gamma _iBe calculated as:

γ_{i} = \frac{1}{1 + e^{- axi}}

The a value is the weighting factor behind the linear smoothing result greater than 1, and target is to enlarge first step result of calculation;

(4) judge according to the Sigmoid function curve, the priority weighting of seed is the arithmetic mean Q of 3 factors of influence _i:

Q_{i} = \frac{α_{i} + β_{i} + γ_{i}}{3}

(5) then carry out sort descending according to the Qi value, Q is crawl frequency and constant also only can with the value of search depth factor variations, being calculated as follows of 2 factors of website importance:

Q = Q_{prev} - \frac{β_{prev} - β}{3}

Wherein, Qprev is that uncle URL transmits the weights that get off; β prev is the search depth factor of father URL; β is the search depth factor of object URL;

(6) the new task distribution method is, suppose to have one group of child node server S=S0, S1 ... Sn-1}, the weights of W (Si) expression child node server S i, C (Si) child node represents the current linking number of server S i, the summation of the current linking number of all child node servers is CSUM=Σ C (Si), i=0,1 ..., n-1;

\begin{matrix} \frac{C (Sm) / CSUM)}{W (Sm)} = \min {\frac{C (Si) / CSUM}{W (Si)}} \\ &DoubleRightArrow; \frac{C (Sm)}{W (Sm)} = \min {\frac{C (Σ_{t})}{W (Si)}} \end{matrix}

Wherein, W (Si) is not 0, the daily record of child node feeds back in the Centroid server at regular intervals, child node server linking number C (Si) obtains by reading daily record, the ratio of more distributed child node server linking number and priori weights, obtain the child node of minimum load, distribute the new task that crawls.