Nothing Special   »   [go: up one dir, main page]

CN103310012A - Distributed web crawler system - Google Patents

Distributed web crawler system Download PDF

Info

Publication number
CN103310012A
CN103310012A CN2013102749513A CN201310274951A CN103310012A CN 103310012 A CN103310012 A CN 103310012A CN 2013102749513 A CN2013102749513 A CN 2013102749513A CN 201310274951 A CN201310274951 A CN 201310274951A CN 103310012 A CN103310012 A CN 103310012A
Authority
CN
China
Prior art keywords
url
child node
theme
page
seed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2013102749513A
Other languages
Chinese (zh)
Other versions
CN103310012B (en
Inventor
王宝会
于雷
王丽华
王新河
尹科
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huike Education Technology Group Co ltd
Original Assignee
Beihang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beihang University filed Critical Beihang University
Priority to CN201310274951.3A priority Critical patent/CN103310012B/en
Publication of CN103310012A publication Critical patent/CN103310012A/en
Application granted granted Critical
Publication of CN103310012B publication Critical patent/CN103310012B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Transfer Between Computers (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A distributed web crawler system is suitable for the field of network information collection and comprises a management portal, a central node server and a distributed sub-node server, wherein the management portal is a Web interface provided for an administrator by the crawler system and can be used for viewing the logs of the central node server and the distributed sub-node server, setting and adding themes, updating a URL (uniform resource locator) seed of a theme, configuring a theme capture frequency parameter, and controlling a crawler state; the central node server and the distributed sub-node server are the main bodies of the system and can be used for operating the themes, learning a data extractor, analyzing pages and storing target pages. According to the distributed web crawler system, the capture of different themes can be accommodated by a crawler, the webpage capture speed is increased, and the quality meets the user requirement.

Description

A kind of distributed network crawler system
Technical field
The present invention relates to a kind of distributed network crawler system, belong to the network information gathering field.
Background technology
The fast development of network has brought the explosive increase of WWW quantity of information, traditional common search engine effect as the internet information gopher becomes more and more important, but owing to itself there being the limitation such as the network coverage is low, loss is high, therefore can not provide accurately comprehensively information for the user.In order to overcome the above deficiency of universal search engine, topic search engine arises at the historic moment, and its target is with limited bandwidth and hardware resource consumption, for the user provides the most accurate result in its care field.
Theme Crawler of Content is the basis of topic search engine, and the speed of its crawl webpage and quality are the important indicators that determines the search engine quality.It is the system of an automatic downloading web pages in the restriction field, screens according to certain priority order and degree of subject relativity and obtains the page.Different from general reptile, Theme Crawler of Content is not pursued high coverage rate, but optionally gets the Topic relative page, has that resource occupation is low, index data base upgrades convenient, the accurate advantage of the buffer memory page.
But prior art all can't realize judging the page from the correlativity of theme and hold different theme crawl etc. in a crawler system at present, and the speed and the quality that therefore cause grasping webpage can not satisfy customer requirements.
Summary of the invention
The technology of the present invention is dealt with problems: overcome the deficiencies in the prior art, a kind of distributed network crawler system is provided, realized that a reptile holds the crawl of different themes, the speed and the quality that have improved the crawl webpage can not satisfy customer requirements.
The technology of the present invention solution: a kind of distributed network crawler system comprises: managing portal, Centroid server, distributed child node server; Managing portal is the Web interface that crawler system provides the keeper, can check the daily record of Centroid server and distributed child node server, the interpolation theme is set, upgrade the URL seed of certain theme, the crawl frequency parameter of configuration theme, the state of control reptile; Centroid server and distributed child node server reptile are the main bodys of system, finish the storage of study, page analysis and the target pages of theme operation, data pick-up device;
(1) Centroid server comprises URL controller, decimator module and theme control module;
The theme control module, the data of sending from management interface receiving management door, comprise data of description, interpolation and the deletion action data of theme, the data of control theme crawl frequency, finish the operation about theme, comprise description, interpolation and deletion to theme; Control theme crawl frequency; Edit each theme seed formation, and the formation of theme seed is sent to decimator module and URL controller module;
Decimator module, after receiving the formation of theme seed, at first come the webpage of the URL address of seed formation representative is classified by the fundamental analysis device, be divided into Deep Web webpage and data-intensive (Data-intensive) webpage, then respectively two kinds of pages are extracted analysis, find data pick-up device corresponding to each type after the analysis, send to the URL controller again URL address and the corresponding data pick-up device corresponding record of advancing, and record;
The URL controller, receive seed formation and the URL address of decimator module transmission and the withdrawal device record of correspondence that the theme control module sends, these two data are integrated, corresponding with corresponding data pick-up device the URL address, there is not the URL address of corresponding withdrawal device just to correspond to general withdrawal device, formation is lined up in all URL addresses, by the division of tasks method task is sent to the distributed child node of each reptile;
(2) distributed child node server comprises child node URL controller, data pick-up device, search controller, webpage grabber;
Child node URL controller, the seed URL that the receiving center node server sends over and corresponding data withdrawal device information; Receive and at first to carry out the URL address behind the URL and look into heavily, then will not have the URL address of repeated acquisition to arrange into formation, and with in the formation URL address and corresponding data pick-up device information send to data pick-up device and webpage grabber;
The data pick-up device, the Deep Web webpage from child node URL formation carried out page analysis and extract in the page URL form new URL, form new URL, be equivalent to the object behind the submission of sheet, pass to the webpage grabber; After receiving the page that search controller sends, use whose withdrawal device corresponding to page URL address to carry out the extraction of URL address in content extraction and the page, then URL address base etc. is sent in the URL address to be collected;
The webpage grabber receives the URL address that sends over from URL controller and data pick-up device, then carries out the crawl of webpage, and the webpage of crawl offers search controller;
Search controller is analyzed receiving the page that collects, and satisfactory Page-saving enters pool of page, otherwise the page is passed to the data pick-up device.
Described division of tasks method adopts the weighted least-connection scheduling method, and the specific implementation process is as follows:
(1) calculates the poor of PR minimum value in the PR of seed and the URL formation, with the ratio cc of PR maximal value and the minimum value difference of PR i:;
α i = PR i - low ( PR ) top ( PR ) - low ( PR )
PR is the webpage rank, i=0, and 1 ..., n-1, Low (PR) and Top (PR) are respectively PR maximal value and minimum value;
(2) weight of calculating search depth, the target pages degree of depth of the context guide is 3, the weights influence factor-beta of search depth iBe itself degree of depth L iInverse:
β i=1/L i
(3) calculate the crawl frequency, by the Sigmoid function, the even strictly monotone of Sigmoid smoothing of functions, threshold range is (0.5~1), specifically grasps frequency x iBe calculated as follows:
x i = F i - low ( F ) top ( F ) - low ( F )
Wherein, Fi is the crawl frequency of seed; Low and Top obtain respectively formation medium frequency maximal value and minimum value.
Crawl frequency influence factor gamma iBe calculated as:
γ i = 1 1 + e - axi
The a value is the weighting factor behind the linear smoothing result greater than 1, and target is to enlarge first step result of calculation.
(4) judge according to the Sigmoid function curve, a gets 2.5 in system of the present invention.Can draw thus, the priority weighting of seed is the arithmetic mean of 3 factors of influence:
Q i = α i + β i + γ i 3
(5) then carry out sort descending according to the Qi value.The URL Weight algorithm of Centroid has been inherited in URL formation in the child node, reptile based on the withdrawal device guiding in the system only can crawl in the website that seed limits, crawl frequency and 2 factors of website importance are constant in the Q value, only can with the search depth factor variations, be calculated as follows:
Q = Q prev - β prev - β 3
Wherein, Qprev is that uncle URL transmits the weights that get off; β prev is the search depth factor of father URL; β is the search depth factor of object URL.Child node formation URL number is many, adopts the dichotomy ordering to exchange the raising of efficient for the space.Through theoretical analysis and actual test, the URL weights are even smooth distribution between 0-1, the situation that the single factors of having avoided the violent decay of 1 factor to cause plays a decisive role, take into account simultaneously destination object, crawl strategy and 3 principal elements of search depth, embodied well priority difference.Even a kind of special circumstances that this algorithm is realized are requests of transmitting in the face of searcher, this moment, priority was the highest, and the Q value is made as 1, and transmittance process is unattenuated.
(6) each child node represents its handling property with corresponding weights.Default weights are made as 1, and the system manager can dynamically arrange the weights of server.Weighted least-connection scheduling makes the built linking number of server and its weights proportional when the new connection of scheduling as far as possible.The algorithm flow of weighted least-connection scheduling is as follows: suppose to have one group of server S={ S0, S1,, Sn-1}, the weights of W (Si) expression server S i, current linking number of C (Si) expression server S i.The summation of the current linking number of Servers-all be CSUM=Σ C (Si) (i=0,1 ..., n-1).
Current new connection request can be sent out server S m, and and if only if, and server S m meets the following conditions sends seed again:
C ( Sm ) / CSUM ) W ( Sm ) = min { C ( Si ) / CSUM W ( Si ) } ⇒ C ( Sm ) W ( Sm ) = min { C ( Σ t ) W ( Si ) } , ( i = 0,1 , · · · , n - 1 )
Wherein, W (Si) is not 0.The daily record of child node feeds back to Centroid at regular intervals, and child servers linking number C (Si) obtains by reading daily record.This method compares the ratio of child node linking number and priori weights, obtains the child node of minimum load, distributes the new task that crawls.
The present invention's advantage compared with prior art is:
(1) invention designed a kind of for the field in the search engine of a plurality of themes, the subsystem that comprises a series of subject searches (such as air ticket, hotel), they share 1 reptile, realized that a reptile holds the crawl of different themes, division of tasks algorithm in a kind of task distribution that is directed to distributed reptile newly of the present invention, the current existing document that relates to this framework all is the summary description, do not solve simultaneously multi-threaded and deposit that the URL that may occur in the situation distributes and the problem such as algorithm compatibility, the invention solves this problem.
(2) framework of the present invention adopts the multi-threaded strategy based on classification annotation, solve the problem of multi-threaded self-adaptation compatibility in the same crawler system, by secondary weighting division of tasks algorithm, solve the URL assignment problem of based target guiding, load balancing, strengthened the system expandability.
(3) improving one's methods of the URL storage policy of the present invention's proposition can be supported efficiently the URL inquiry, be inserted and the repeatability detection.The subject search system of native system exploitation offers the abundant input interface of user's topicalization, and returns accurate structured content, and its reptile has been adopted the framework of based on data withdrawal device.The existing document that relates to this framework all is the summary description, does not solve simultaneously multi-threaded and deposits that the URL that may occur in the situation distributes and the problem such as algorithm compatibility.
Description of drawings
Fig. 1 is the overall architecture schematic diagram of distributed reptile of the present invention system;
Fig. 2 is the Centroid server rack composition among the present invention;
Fig. 3 is the Organization Chart of distributed node server among the present invention.
Embodiment
System of the present invention adopts the distributed system architecture of based on data withdrawal device, formed by a center main controlled node and distributed crawler server, and the whole system collaborative work that cooperatively interacts, its overall architecture is seen Fig. 1.
As shown in Figure 1, the present invention mainly is comprised of following module:
1, managing portal
Managing portal be crawler system to the Web interface that the keeper provides, can check the daily record of center and child servers, the interpolation theme is set, upgrade the URL seed of certain theme, the parameters such as crawl frequency of configuration theme, the state etc. of control reptile.Centroid and distribution reptile are the main bodys of system, finish the storage of study, page analysis and the target pages of theme operation, data pick-up device.
2, Centroid server
Reptile center main controlled node is control axis, mainly comprises URL controller, decimator module and theme control module, as shown in Figure 2.The concrete function of three modules is seen following introduction:
(1) theme control module
The theme control module is finished the operation about theme, comprises description, interpolation and deletion to theme; Control theme crawl frequency; Edit each theme seed formation.The authoritative page of corresponding theme is got in seed team's column selection, the i.e. more representational page that can be used as a series of target information initial positions in this theme, such as the Theme Crawler of Content of hotel search, its authoritative page is exactly to book rooms to comprise the start page that the webpage of inquire about Form or its hotel information are tabulated in the net.Use first universal search engine searching motif descriptive text, obtain the expansion page set of corresponding theme, because limited amount, so obtain again the seed formation of the authoritative page by artificial examination.
(2) decimator module
Adopt content-based web page analysis algorithm, start with from the URL seed, training forms the data pick-up device for the authoritative website of seed representative.The seed that satisfies a upper module demand mainly is divided into 2 classes: Deep Web webpage and data-intensive (Data-intensive) webpage, adopt the basic classification device of memory character can distinguish 2 kinds of pages, use for the Deep Web page the improved specific area of tour field dictionary is matched suitable complete interface input based on the inquiry detection method of example.For the latter's structured features, the URL that the strategy that adopts page piece and catalogue to find carries out the bottom page extracts.Through above process, can find the applicable data pick-up device (analytical algorithm path and search depth) of URL seed, in child node crawl process, this model instructs the page of the targeted sites of seed representative to resolve.
(3) URL controller
Mainly be responsible for the ordering of the URL formation in the Centroid, and carry out division of tasks according to each child node load feedback.Because adopt secondary URL collocation strategy, so the Centroid server is only stored seed URL, sort algorithm is determined priority according to theme crawl frequency and seed representative website weight, and the concurrency of unit of account time needs.Division of tasks adopts the weighted least-connection scheduling method.
The implementation procedure of Centroid server is:
(1) theme control module data (data of description of theme, interpolation and the deletion action data of sending from management interface receiving management door; The data of control theme crawl frequency; ) this module finishes the operation about theme, comprises description, interpolation and deletion to theme; Control theme crawl frequency; Edit each theme seed formation.And the formation of theme seed is sent to decimator module and URL controller module.(annotate: the seed formation is exactly URL address queue, URL address queue is exactly one group of URL address, but there is its singularity the URL address of seed formation, because the URL address of seed formation substantially all needs target to gather the homepage URL address of website, perhaps need the homepage URL address of the column classification that gathers etc.)
(2) after decimator module receives the formation of theme seed, at first coming the webpage of the URL address of seed formation representative is classified by the fundamental analysis device, (process of classification at first is the webpage that needs access URL address is pointed to, web page contents gathered then carry out follow-up sort operation), be divided into Deep Web webpage and data-intensive (Data-intensive) webpage, (introduction please see that the function for decimator module is introduced in detail in the top article in detail then respectively two kinds of pages to be extracted analysis.), find data pick-up device corresponding to each type after the analysis.And URL address and the corresponding data pick-up device corresponding record of advancing, and this record sent to the URL controller.
(3) URL controller receives seed formation and the URL address of decimator module transmission and the withdrawal device record of correspondence that the theme control module sends.And these two data are integrated, corresponding with corresponding data pick-up device the URL address, do not have the URL address of corresponding withdrawal device just to correspond to general withdrawal device, formation is lined up in all URL addresses, by the division of tasks method task is sent to the distributed child node of each reptile.
3, distributed child node server
As shown in Figure 3, distributed child node server is the implementation person who crawls, and mainly comprises child node URL controller, data pick-up device, search controller, webpage grabber.
The distributed child node implementation procedure of reptile is as follows:
(1) the seed URL and the corresponding data withdrawal device information that send over of child node URL controller receiving center node server; At first carry out the URL address behind the reception URL and look into heavily (internet reptile, URL looks into the address weighing method to be had a lot, because be not this paper emphasis, here do not do and look in detail the weighing method introduction, can use any general URL address and look into weighing method), then will there be the URL address of repeated acquisition to arrange into formation.And with in the formation URL address and corresponding data pick-up device information send to data pick-up device module and webpage grabber module.
(2) the data pick-up device carry out page analysis from the Deep Web webpage of child node URL formation and extract in the page URL form new URL, form new URL, be equivalent to the object behind the submission of sheet, pass to the webpage grabber.
(3) webpage grabber module receives the URL address that sends in the first two step, then carries out the crawl of webpage.
(4) webpage of webpage grabber crawl offers search controller.
(5) search controller receives the page that collects, and the page is analyzed, and satisfactory Page-saving enters pool of page, otherwise the page is passed to the data pick-up device.
(6) after the data pick-up device receives the page that search controller sends, use whose withdrawal device corresponding to page URL address to carry out the extraction of URL address in content extraction and the page, then URL address base etc. is sent in the URL address to be collected.
The below introduces each module more in detail.
(1) child node URL controller
Child node URL controller receives the URL from the seed URL of Centroid distribution and webpage extraction, stores url database into, and Trie type data structure is used in storage, can carry out duplicate detection and quick insertion to new adding URL.Used a update strategy take website as unit in the url database, can guarantee that the renewal of content is not subjected to the repeated retardance that detects.Transmit sort algorithm through the legal URL that detects according to secondary URL weighting, the weight that the reception parent page passes over and the ordering of carrying out priority in conjunction with the degree of depth in the search strategy pass to the webpage grabber.
(2) data pick-up device
From the inquiry probe algorithm that the Deep Web object of URL formation adopts Centroid to train, the pattern match input through concrete parameter forms new URL, is equivalent to the object behind the submission of sheet, passes to the webpage grabber.The page after another the search controller judgement of inputting to hang oneself of this module, the URL search strategy guarantees that this is the data-intensive page, according to the page piece discovery algorithm that trains, extract the URL that 2 classes are concerned about: page turn information and subordinate's page of data information, send into url database.
(3) webpage grabber
This is a multi-threaded parallel module, is responsible for gathering the page according to http protocol.Basic step comprises: a. extracts targeted sites address and port numbers out according to page URL, sets up network connection with this address and port; B. by page URL assembling HTTP request header, send to targeted sites, do not receive response message if surpass certain hour, then termination is grasped this page and it is abandoned; Otherwise continuation next step; C. analyze response message, if the status code of returning is 2xx, then return the correct page, enter next step; If status code is 301 or 302, representation page is redirected, and extracts the target URL that makes new advances from response header, returns previous step; If return other status codes, instruction page connection failure, termination are grasped this page and it are abandoned; D. from response header, extract the page infos such as date, length, page type; E. read the content of the page, for the larger page of length, the method that adopts piecemeal to read again splicing guarantees the integrality of content of pages.
(4) search controller
Search strategy of the present invention adopts the best-first search strategy in conjunction with concrete application enhancements, and through the analysis to Deep Web and the directory block formula page, the destination object major part is the text formula page, and the crawl degree of depth is no more than 3 grades.This module is adjudicated according to search strategy the web page contents of crawl, and the text formula page that meets search depth deposits pool of page in, waits for the structuring of index module, otherwise, pass to corresponding data pick-up device and carry out page analysis and URL extraction.
The division of tasks algorithm is specific as follows:
In the distributed reptile system, the equilibrium distribution that crawls task is one of key issue that affects system performance and resource distribution.At present the distributed reptile system adopts centralized or based on the division of tasks strategy of secondary Hash maps.These 2 kinds of strategies just solve the problem of uniform distribution, do not consider impact and the child node loading condition of URL priority.The division of tasks strategy of Theme Crawler of Content should be taken into account the ordering of URL formation and based on the balance dispatching of child node load.For the native system framework, the division of tasks algorithm has comprised that sort algorithm is transmitted in secondary URL weighting and based on the minimum linking URL dispatching method of the weighting of hash.
The sort algorithm that the weighting of design secondary is transmitted in the URL formation in Centroid and child node.In the Centroid level, its URL formation main body is the URL seed of different themes, and the seed attribute that impact crawls quality comprises website importance, crawl frequency and search depth.Seed is the authoritative page that theme is embodied in corresponding website, its PageRank can be mapped to website in the influence power of this subject fields, the PageRank evaluation adopts the PageRank algorithm of topology Network Based as standard, the page PR value that Google provides is integer, theoretical interval range is (0~10), but through statistics, the PR value of most of page is below 7, so, for even normalization, the factor of influence of website importance adopts linear function to calculate, and is specially the difference of PR minimum value in the PR of corresponding seed and the URL formation and the ratio cc of PR maximal value and minimum value difference i:
α i = PR i - low ( PR ) top ( PR ) - low ( PR )
Search depth refers to the number of plies that the page is stipulated in best preference strategy, 3 grades altogether, the seed degree of depth that Hidden Web list is arranged is 1, and the data-intensive page degree of depth of catalogue block structured is 2, the target pages degree of depth of the context guide is 3, the weights influence factor-beta of search depth iBe itself degree of depth L iInverse.
β i=1/L i
Crawl frequency factor of influence corresponding to the time interval that to be the keeper arrange according to search foreground demand and update strategy upgrades the interval short, and the crawl frequency is large, and then seed priority is higher.The crawl frequency is divided into seed frequency and theme frequency, is essential according to the theme frequency of theme character setting, and if seed frequency does not arrange, its value is just inherited the theme frequency.The sample value difference in distribution of grabbing interval is large, jumping characteristic is strong, and such as transferring the possession of the ticket module because the instantaneity requirement is high, its theme setting is spaced apart 15min; But hotel reservation is because the price change amplitude is little, it arranges the interval and take in the sky as unit, if follow its line shape function standardization fully, then solely large situation about then decaying fast of a value can appear in the frequency influence factor, this factor has just become the determinative of ordering, and is obviously unreasonable.Through research and comparison, obtain the result after at first adopting the linear normalization function, then weighting is finally by crossing Sigmoid function uniform treatment.The even strictly monotone of Sigmoid smoothing of functions, threshold range is (0.5~1), specifically is calculated as follows:
x i = F i - low ( F ) top ( F ) - low ( F )
Wherein, Fi is the crawl frequency of seed; Low and Top obtain respectively formation medium frequency maximal value and minimum value; x iBe the crawl frequency; γ iBe the crawl frequency influence factor;
γ i = 1 1 + e - axi
The a value is the weighting factor behind the linear smoothing result greater than 1, and target is to enlarge first step result of calculation.Judge according to the Sigmoid function curve, a gets 2.5 in system.Can draw thus, the priority weighting of seed is the arithmetic mean of 3 factors of influence:
Q i = α i + β i + γ i 3
Then carry out sort descending according to the Qi value, because Centroid formation number seeds is limited, so adopt insertion sort, can save memory consumption, the time also is similar to other sort algorithms.The URL Weight algorithm of Centroid has been inherited in URL formation in the child node, the analysis found that, reptile based on the withdrawal device guiding in the system only can crawl in the website that seed limits, crawl frequency and 2 factors of website importance are constant in the Q value, only can with the search depth factor variations, be calculated as follows:
Q = Q prev - β prev - β 3
Wherein, Qprev is that uncle URL transmits the weights that get off; β prev is the search depth factor of father URL; β is the search depth factor of object URL.Child node formation URL number is many, adopts the dichotomy ordering to exchange the raising of efficient for the space.Through theoretical analysis and actual test, the URL weights are even smooth distribution between 0-1, the situation that the single factors of having avoided the violent decay of 1 factor to cause plays a decisive role, take into account simultaneously destination object, crawl strategy and 3 principal elements of search depth, embodied well priority difference.Even a kind of special circumstances that this algorithm is realized are requests of transmitting in the face of searcher, this moment, priority was the highest, and the Q value is made as 1, and transmittance process is unattenuated.
Weighted least-connection scheduling (Weighted Least-Connection Scheduling) algorithm has been adopted in cutting apart of Centroid formation.Each child node represents its handling property with corresponding weights.Default weights are made as 1, and the system manager can dynamically arrange the weights of child node server.Weighted least-connection scheduling makes the built linking number of server and its weights proportional when the new connection of scheduling as far as possible.The allocating task flow process is as follows: suppose to have one group of child node server S=S0, S1 ..., Sn-1}, the weights of W (Si) expression server S i, the current linking number of C (Si) expression child node server S i.The summation of the current linking number of all child node servers be CSUM=Σ C (Si) (i=0,1 ..., n-1).
Current new connection request can be sent out child node server S m, and and if only if, and child node server S m meets the following conditions sends seed again:
C ( Sm ) / CSUM ) W ( Sm ) = min { C ( Si ) / CSUM W ( Si ) } ⇒ C ( Sm ) W ( Sm ) = min { C ( Σ t ) W ( Si ) } , ( i = 0,1 , · · · , n - 1 )
Wherein, W (Si) is not 0.The daily record of child node feeds back to Centroid at regular intervals, and child servers linking number C (Si) obtains by reading daily record.The present invention compares the ratio of child node linking number and priori weights, obtains the child node of minimum load, distributes the new task that crawls.
The non-elaborated part of the present invention belongs to techniques well known.
Above content is in conjunction with concrete preferred implementation further description made for the present invention; can not assert that the specific embodiment of the present invention only limits to this; for the general technical staff of the technical field of the invention; without departing from the inventive concept of the premise; can also make some simple deduction or replace, all should be considered as belonging to the present invention and determine scope of patent protection by claims of submitting to.

Claims (2)

1. a distributed network crawler system is characterized in that comprising: managing portal, Centroid server, distributed child node server; Managing portal is the Web interface that crawler system provides the keeper, can check the daily record of Centroid server and distributed child node server, the interpolation theme is set, upgrade the URL seed of certain theme, the crawl frequency parameter of configuration theme, the state of control reptile; Centroid server and distributed child node server are the main bodys of system as reptile, finish the storage of study, page analysis and the target pages of theme operation, data pick-up device;
(1) Centroid server comprises URL controller, decimator module and theme control module;
The theme control module, the data of sending from management interface receiving management door, comprise data of description, interpolation and the deletion action data of theme, the data of control theme crawl frequency, finish the operation about theme, comprise description, interpolation and deletion to theme; Control theme crawl frequency; Edit each theme seed formation, and the formation of theme seed is sent to decimator module and URL controller module;
Decimator module, after receiving the formation of theme seed, at first come the webpage of the URL address of seed formation representative is classified by the fundamental analysis device, be divided into Deep Web webpage and data-intensive (Data-intensive) webpage, then respectively two kinds of pages are extracted, find data pick-up device corresponding to each type, send to the URL controller again URL address and the corresponding data pick-up device corresponding record of advancing, and record;
The URL controller, receive seed formation and the URL address of decimator module transmission and the data pick-up device record of correspondence that the theme control module sends, these two data are integrated, corresponding with corresponding data pick-up device in the distributed child node server the URL address, there is not the URL address of corresponding data withdrawal device just to correspond to general withdrawal device, formation is lined up in all URL addresses, by the division of tasks method task is sent to the distributed child node of each reptile;
(2) distributed child node server comprises child node URL controller, data pick-up device, search controller, webpage grabber;
Child node URL controller, the seed URL that the receiving center node server sends over and corresponding data withdrawal device information; At first carry out the URL address behind the reception URL and look into heavily, then will not have the URL address of repeated acquisition to arrange into formation, and URL address in the formation and corresponding data pick-up device information are sent to data pick-up device and webpage grabber;
The data pick-up device, the Deep Web webpage from child node URL formation carried out page analysis and extract in the page URL form new URL, be equivalent to the object behind the submission of sheet, pass to the webpage grabber; After receiving the page that search controller sends, use withdrawal device corresponding to page URL address to carry out the extraction of URL address in content extraction and the page, then URL address base etc. is sent in the URL address to be collected;
The webpage grabber receives the URL address that child node URL controller and data pick-up device send over, and carries out the crawl of webpage, and the webpage after the crawl offers search controller;
Search controller is analyzed receiving the page that collects, and satisfactory Page-saving enters pool of page, otherwise the page is passed to the data pick-up device.
2. a kind of distributed network crawler system according to claim 1, it is characterized in that: described division of tasks method specific implementation process is as follows:
(1) calculates the poor of PR minimum value in the PR of seed and the URL formation, with the ratio cc of PR maximal value and the minimum value difference of PR i:;
α i = PR i - low ( PR ) top ( PR ) - low ( PR )
PR is the webpage rank, i=0, and 1 ..., n-1, Low (PR) and Top (PR) are respectively PR maximal value and minimum value;
(2) weight of calculating search depth, the target pages degree of depth of the context guide is 3, the weights influence factor-beta of search depth iBe itself degree of depth L iInverse:
β i=1/L i
(3) by Sigmoid function calculation crawl frequency, crawl frequency x iBe calculated as follows:
x i = F i - low ( F ) top ( F ) - low ( F )
Wherein, Fi is the crawl frequency of seed; Low and Top obtain respectively formation medium frequency maximal value and minimum value;
Crawl frequency influence factor gamma iBe calculated as:
γ i = 1 1 + e - axi
The a value is the weighting factor behind the linear smoothing result greater than 1, and target is to enlarge first step result of calculation;
(4) judge according to the Sigmoid function curve, the priority weighting of seed is the arithmetic mean Q of 3 factors of influence i:
Q i = α i + β i + γ i 3
(5) then carry out sort descending according to the Qi value, Q is crawl frequency and constant also only can with the value of search depth factor variations, being calculated as follows of 2 factors of website importance:
Q = Q prev - β prev - β 3
Wherein, Qprev is that uncle URL transmits the weights that get off; β prev is the search depth factor of father URL; β is the search depth factor of object URL;
(6) the new task distribution method is, suppose to have one group of child node server S=S0, S1 ... Sn-1}, the weights of W (Si) expression child node server S i, C (Si) child node represents the current linking number of server S i, the summation of the current linking number of all child node servers is CSUM=Σ C (Si), i=0,1 ..., n-1;
Current new connection request can be sent out child node server S m, and and if only if, and child node server S m meets the following conditions sends seed again:
C ( Sm ) / CSUM ) W ( Sm ) = min { C ( Si ) / CSUM W ( Si ) } ⇒ C ( Sm ) W ( Sm ) = min { C ( Σ t ) W ( Si ) }
Wherein, W (Si) is not 0, the daily record of child node feeds back in the Centroid server at regular intervals, child node server linking number C (Si) obtains by reading daily record, the ratio of more distributed child node server linking number and priori weights, obtain the child node of minimum load, distribute the new task that crawls.
CN201310274951.3A 2013-07-02 2013-07-02 A kind of distributed network crawler system Active CN103310012B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310274951.3A CN103310012B (en) 2013-07-02 2013-07-02 A kind of distributed network crawler system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310274951.3A CN103310012B (en) 2013-07-02 2013-07-02 A kind of distributed network crawler system

Publications (2)

Publication Number Publication Date
CN103310012A true CN103310012A (en) 2013-09-18
CN103310012B CN103310012B (en) 2016-09-28

Family

ID=49135230

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310274951.3A Active CN103310012B (en) 2013-07-02 2013-07-02 A kind of distributed network crawler system

Country Status (1)

Country Link
CN (1) CN103310012B (en)

Cited By (42)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103559219A (en) * 2013-10-18 2014-02-05 北京京东尚科信息技术有限公司 Distributed web crawler capture task dispatching method, dispatching-side device and capture nodes
CN103605670A (en) * 2013-10-29 2014-02-26 北京奇虎科技有限公司 Method and device for determining grabbing frequency of network resource points
CN103997524A (en) * 2014-05-21 2014-08-20 浪潮电子信息产业股份有限公司 Distributed type modularized web crawler with high availability and extendibility
CN104199893A (en) * 2014-08-25 2014-12-10 成都索贝数码科技股份有限公司 System and method for publishing omnimedia contents fast
CN104572901A (en) * 2014-12-25 2015-04-29 小米科技有限责任公司 Method and device for downloading webpage data
CN104699757A (en) * 2015-01-15 2015-06-10 南京邮电大学 Distributed network information acquisition method in cloud environment
CN104778164A (en) * 2014-01-09 2015-07-15 中国银联股份有限公司 Method and device for detecting repeated URL (Uniform Resource Locator)
CN104866555A (en) * 2015-05-15 2015-08-26 浪潮软件集团有限公司 Automatic acquisition method based on web crawler
CN105045838A (en) * 2015-07-01 2015-11-11 华东师范大学 Network crawler system based on distributed storage system
CN105577684A (en) * 2016-01-25 2016-05-11 北京京东尚科信息技术有限公司 Anti-crawling methods, server, client and system
CN105656707A (en) * 2014-11-18 2016-06-08 阿里巴巴集团控股有限公司 Method and system for testing web crawler
CN105843965A (en) * 2016-04-20 2016-08-10 广州精点计算机科技有限公司 Deep web crawler form filling method and device based on URL (uniform resource locator) subject classification
CN105989151A (en) * 2015-03-02 2016-10-05 阿里巴巴集团控股有限公司 Webpage crawling method and apparatus
CN106294822A (en) * 2016-08-17 2017-01-04 国网上海市电力公司 A kind of electric power data visualization system
CN106294393A (en) * 2015-05-20 2017-01-04 天脉聚源(北京)科技有限公司 A kind of method and system of web search
CN106572026A (en) * 2016-10-28 2017-04-19 上海斐讯数据通信技术有限公司 SDN-based load balancing method, device and system
CN106570011A (en) * 2015-10-09 2017-04-19 北京京东尚科信息技术有限公司 Distributed crawler URL seed distribution method, dispatching node, and grabbing node
CN106776934A (en) * 2016-11-30 2017-05-31 努比亚技术有限公司 A kind of implementation method of mobile terminal and web crawlers
CN106803167A (en) * 2017-02-28 2017-06-06 深圳海带宝网络科技股份有限公司 A kind of cross-border electric business whole world goods clear customs system
CN106844475A (en) * 2016-12-23 2017-06-13 北京奇虎科技有限公司 It is determined that the method and device of hiding URL
CN106874487A (en) * 2017-02-21 2017-06-20 国信优易数据有限公司 A kind of distributed reptile management system and its method
CN106934027A (en) * 2017-03-14 2017-07-07 深圳市博信诺达经贸咨询有限公司 Distributed reptile realization method and system
CN107066530A (en) * 2017-03-01 2017-08-18 苏州朗动网络科技有限公司 A kind of data refresh system and method for refreshing data
CN107092826A (en) * 2017-03-24 2017-08-25 北京国舜科技股份有限公司 Web page contents real-time safety monitoring method
CN107180050A (en) * 2016-03-11 2017-09-19 精硕科技(北京)股份有限公司 A kind of data grabber system and method
CN107241319A (en) * 2017-05-26 2017-10-10 山东省科学院情报研究所 Distributed network crawler system and dispatching method based on VPN
CN107423382A (en) * 2017-07-13 2017-12-01 中国物品编码中心 network crawling method and device
CN107562956A (en) * 2017-09-30 2018-01-09 麦格创科技(深圳)有限公司 Distributed reptile method for allocating tasks and system
CN108959524A (en) * 2018-06-28 2018-12-07 中译语通科技股份有限公司 A kind of method, system and information data processing terminal identifying data crawler
CN109101521A (en) * 2018-06-12 2018-12-28 江苏开拓信息与系统有限公司 The automatic extraction system of data based on big data
CN109548752A (en) * 2018-11-16 2019-04-02 赵妍 A kind of multi-functional storing unit when the night worm scorpion capture based on forestry
CN109902220A (en) * 2019-02-27 2019-06-18 腾讯科技(深圳)有限公司 Webpage information acquisition methods, device and computer readable storage medium
CN110245280A (en) * 2019-05-06 2019-09-17 北京三快在线科技有限公司 Identify method, apparatus, storage medium and the electronic equipment of web crawlers
WO2019174613A1 (en) * 2018-03-14 2019-09-19 Beijing Didi Infinity Technology And Development Co., Ltd. Systems and methods for cloud computing
CN110532453A (en) * 2019-08-12 2019-12-03 北京智游网安科技有限公司 A kind of method, storage medium and crawler server adjusting crawler renewal frequency
CN110598073A (en) * 2018-05-25 2019-12-20 微软技术许可有限责任公司 Technology for acquiring entity webpage link based on topological relation graph
CN111092921A (en) * 2018-10-24 2020-05-01 北大方正集团有限公司 Data acquisition method, device and storage medium
CN111382332A (en) * 2019-04-02 2020-07-07 江苏省地震局 Earthquake disaster information processing method and system
CN111428115A (en) * 2020-04-16 2020-07-17 行吟信息科技(上海)有限公司 Webpage information processing method and device
CN111488507A (en) * 2020-04-09 2020-08-04 西安影视数据评估中心有限公司 Network agent optimization method
CN112035721A (en) * 2020-07-22 2020-12-04 大箴(杭州)科技有限公司 Crawler cluster monitoring method and device, storage medium and computer equipment
CN114095207A (en) * 2021-10-26 2022-02-25 北京连星科技有限公司 IPv6 website detection method based on distributed scheduling

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080243812A1 (en) * 2007-03-30 2008-10-02 Microsoft Corporation Ranking method using hyperlinks in blogs
CN101561814A (en) * 2009-05-08 2009-10-21 华中科技大学 Topic crawler system based on social labels
CN101630327A (en) * 2009-08-14 2010-01-20 昆明理工大学 Design method of theme network crawler system
CN102646129A (en) * 2012-03-09 2012-08-22 武汉大学 Topic-relative distributed web crawler system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080243812A1 (en) * 2007-03-30 2008-10-02 Microsoft Corporation Ranking method using hyperlinks in blogs
CN101561814A (en) * 2009-05-08 2009-10-21 华中科技大学 Topic crawler system based on social labels
CN101630327A (en) * 2009-08-14 2010-01-20 昆明理工大学 Design method of theme network crawler system
CN102646129A (en) * 2012-03-09 2012-08-22 武汉大学 Topic-relative distributed web crawler system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
詹恒飞等: "Nutch分布式网络爬虫研究与优化", 《计算机科学与探索》 *

Cited By (65)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103559219A (en) * 2013-10-18 2014-02-05 北京京东尚科信息技术有限公司 Distributed web crawler capture task dispatching method, dispatching-side device and capture nodes
CN103559219B (en) * 2013-10-18 2016-12-07 北京京东尚科信息技术有限公司 Distributed network crawler capturing method for scheduling task, dispatching terminal equipment and crawl node
CN103605670A (en) * 2013-10-29 2014-02-26 北京奇虎科技有限公司 Method and device for determining grabbing frequency of network resource points
CN103605670B (en) * 2013-10-29 2017-03-29 北京奇虎科技有限公司 A kind of method and apparatus for determining the crawl frequency of network resource point
CN104778164B (en) * 2014-01-09 2018-01-30 中国银联股份有限公司 Detection repeats URL method and device
CN104778164A (en) * 2014-01-09 2015-07-15 中国银联股份有限公司 Method and device for detecting repeated URL (Uniform Resource Locator)
CN103997524A (en) * 2014-05-21 2014-08-20 浪潮电子信息产业股份有限公司 Distributed type modularized web crawler with high availability and extendibility
CN104199893A (en) * 2014-08-25 2014-12-10 成都索贝数码科技股份有限公司 System and method for publishing omnimedia contents fast
CN104199893B (en) * 2014-08-25 2018-01-30 成都华栖云科技有限公司 A kind of system and method for quickly issuing full media content
CN105656707B (en) * 2014-11-18 2019-03-26 阿里巴巴集团控股有限公司 A kind of method and system of test network crawler
CN105656707A (en) * 2014-11-18 2016-06-08 阿里巴巴集团控股有限公司 Method and system for testing web crawler
CN104572901B (en) * 2014-12-25 2018-12-18 小米科技有限责任公司 The method for down loading and device of web data
CN104572901A (en) * 2014-12-25 2015-04-29 小米科技有限责任公司 Method and device for downloading webpage data
CN104699757A (en) * 2015-01-15 2015-06-10 南京邮电大学 Distributed network information acquisition method in cloud environment
CN104699757B (en) * 2015-01-15 2018-03-13 南京邮电大学 Distributed network information acquisition method under cloud environment
CN105989151B (en) * 2015-03-02 2019-09-06 阿里巴巴集团控股有限公司 Webpage capture method and device
CN105989151A (en) * 2015-03-02 2016-10-05 阿里巴巴集团控股有限公司 Webpage crawling method and apparatus
CN104866555A (en) * 2015-05-15 2015-08-26 浪潮软件集团有限公司 Automatic acquisition method based on web crawler
CN106294393A (en) * 2015-05-20 2017-01-04 天脉聚源(北京)科技有限公司 A kind of method and system of web search
CN105045838A (en) * 2015-07-01 2015-11-11 华东师范大学 Network crawler system based on distributed storage system
CN106570011A (en) * 2015-10-09 2017-04-19 北京京东尚科信息技术有限公司 Distributed crawler URL seed distribution method, dispatching node, and grabbing node
CN106570011B (en) * 2015-10-09 2021-01-26 北京京东尚科信息技术有限公司 Distributed crawler URL seed distribution method, scheduling node and capturing node
CN105577684A (en) * 2016-01-25 2016-05-11 北京京东尚科信息技术有限公司 Anti-crawling methods, server, client and system
CN105577684B (en) * 2016-01-25 2018-09-28 北京京东尚科信息技术有限公司 Method, server-side, client and the system of anti-crawler capturing
CN107180050A (en) * 2016-03-11 2017-09-19 精硕科技(北京)股份有限公司 A kind of data grabber system and method
CN105843965B (en) * 2016-04-20 2019-06-04 广东精点数据科技股份有限公司 A kind of Deep Web Crawler form filling method and apparatus based on URL subject classification
CN105843965A (en) * 2016-04-20 2016-08-10 广州精点计算机科技有限公司 Deep web crawler form filling method and device based on URL (uniform resource locator) subject classification
CN106294822A (en) * 2016-08-17 2017-01-04 国网上海市电力公司 A kind of electric power data visualization system
CN106572026A (en) * 2016-10-28 2017-04-19 上海斐讯数据通信技术有限公司 SDN-based load balancing method, device and system
CN106572026B (en) * 2016-10-28 2020-04-10 上海斐讯数据通信技术有限公司 SDN-based load balancing method, device and system
CN106776934A (en) * 2016-11-30 2017-05-31 努比亚技术有限公司 A kind of implementation method of mobile terminal and web crawlers
CN106776934B (en) * 2016-11-30 2021-03-26 努比亚技术有限公司 Mobile terminal and implementation method of web crawler
CN106844475A (en) * 2016-12-23 2017-06-13 北京奇虎科技有限公司 It is determined that the method and device of hiding URL
CN106874487B (en) * 2017-02-21 2020-08-18 国信优易数据有限公司 Distributed crawler management system and method thereof
CN106874487A (en) * 2017-02-21 2017-06-20 国信优易数据有限公司 A kind of distributed reptile management system and its method
CN106803167A (en) * 2017-02-28 2017-06-06 深圳海带宝网络科技股份有限公司 A kind of cross-border electric business whole world goods clear customs system
CN107066530A (en) * 2017-03-01 2017-08-18 苏州朗动网络科技有限公司 A kind of data refresh system and method for refreshing data
CN106934027A (en) * 2017-03-14 2017-07-07 深圳市博信诺达经贸咨询有限公司 Distributed reptile realization method and system
CN107092826A (en) * 2017-03-24 2017-08-25 北京国舜科技股份有限公司 Web page contents real-time safety monitoring method
CN107092826B (en) * 2017-03-24 2020-02-21 北京国舜科技股份有限公司 Webpage content safety real-time monitoring method
CN107241319B (en) * 2017-05-26 2020-06-02 山东省科学院情报研究所 Distributed network crawler system based on VPN and scheduling method
CN107241319A (en) * 2017-05-26 2017-10-10 山东省科学院情报研究所 Distributed network crawler system and dispatching method based on VPN
CN107423382A (en) * 2017-07-13 2017-12-01 中国物品编码中心 network crawling method and device
CN107562956A (en) * 2017-09-30 2018-01-09 麦格创科技(深圳)有限公司 Distributed reptile method for allocating tasks and system
WO2019174613A1 (en) * 2018-03-14 2019-09-19 Beijing Didi Infinity Technology And Development Co., Ltd. Systems and methods for cloud computing
CN110598073A (en) * 2018-05-25 2019-12-20 微软技术许可有限责任公司 Technology for acquiring entity webpage link based on topological relation graph
CN110598073B (en) * 2018-05-25 2024-04-26 微软技术许可有限责任公司 Acquisition technology of entity webpage links based on topological relation diagram
CN109101521A (en) * 2018-06-12 2018-12-28 江苏开拓信息与系统有限公司 The automatic extraction system of data based on big data
CN108959524A (en) * 2018-06-28 2018-12-07 中译语通科技股份有限公司 A kind of method, system and information data processing terminal identifying data crawler
CN111092921B (en) * 2018-10-24 2022-05-10 北大方正集团有限公司 Data acquisition method, device and storage medium
CN111092921A (en) * 2018-10-24 2020-05-01 北大方正集团有限公司 Data acquisition method, device and storage medium
CN109548752A (en) * 2018-11-16 2019-04-02 赵妍 A kind of multi-functional storing unit when the night worm scorpion capture based on forestry
CN109548752B (en) * 2018-11-16 2021-09-07 深圳市鑫稻田农业技术科技有限公司 Forestry-based multifunctional storage device used during night scorpion catching
CN109902220B (en) * 2019-02-27 2023-11-24 腾讯科技(深圳)有限公司 Webpage information acquisition method, device and computer readable storage medium
CN109902220A (en) * 2019-02-27 2019-06-18 腾讯科技(深圳)有限公司 Webpage information acquisition methods, device and computer readable storage medium
CN111382332A (en) * 2019-04-02 2020-07-07 江苏省地震局 Earthquake disaster information processing method and system
CN111382332B (en) * 2019-04-02 2021-12-17 江苏省地震局 Earthquake disaster information processing method and system
CN110245280A (en) * 2019-05-06 2019-09-17 北京三快在线科技有限公司 Identify method, apparatus, storage medium and the electronic equipment of web crawlers
CN110532453B (en) * 2019-08-12 2022-07-22 北京智游网安科技有限公司 Method for adjusting crawler updating frequency, storage medium and crawler server
CN110532453A (en) * 2019-08-12 2019-12-03 北京智游网安科技有限公司 A kind of method, storage medium and crawler server adjusting crawler renewal frequency
CN111488507A (en) * 2020-04-09 2020-08-04 西安影视数据评估中心有限公司 Network agent optimization method
CN111488507B (en) * 2020-04-09 2023-05-23 西安影视数据评估中心有限公司 Optimization method of network proxy
CN111428115A (en) * 2020-04-16 2020-07-17 行吟信息科技(上海)有限公司 Webpage information processing method and device
CN112035721A (en) * 2020-07-22 2020-12-04 大箴(杭州)科技有限公司 Crawler cluster monitoring method and device, storage medium and computer equipment
CN114095207A (en) * 2021-10-26 2022-02-25 北京连星科技有限公司 IPv6 website detection method based on distributed scheduling

Also Published As

Publication number Publication date
CN103310012B (en) 2016-09-28

Similar Documents

Publication Publication Date Title
CN103310012A (en) Distributed web crawler system
CN110597981B (en) Network news summary system for automatically generating summary by adopting multiple strategies
Xu et al. Exploring folksonomy for personalized search
Yu et al. Summary of web crawler technology research
CN109543086A (en) A kind of network data acquisition and methods of exhibiting towards multi-data source
CN106776768B (en) A kind of URL grasping means of distributed reptile engine and system
CN108446367A (en) A kind of the packaging industry data search method and equipment of knowledge based collection of illustrative plates
CN108292323A (en) Use the database manipulation of the metadata of data source
CN107038207A (en) A kind of data query method, data processing method and device
CN102073725A (en) Method for searching structured data and search engine system for implementing same
CN106055621A (en) Log retrieval method and device
CN102402539A (en) Design technology for object-level personalized vertical search engine
CN106776878A (en) A kind of method for carrying out facet retrieval to MOOC courses based on ElasticSearch
CN108228743A (en) Real-time big data search engine system
CN103744889B (en) A kind of method and apparatus for problem progress clustering processing
Wang et al. A novel blockchain oracle implementation scheme based on application specific knowledge engines
CN105389330B (en) Across the community open source resources of one kind match correlating method
Khodaei et al. Temporal-textual retrieval: Time and keyword search in web documents
Kong Real-time processing system and Internet of Things application in the cultural tourism industry development
CN106777395A (en) A kind of topic based on community's text data finds system
Cui Research on the application of social network service in resource sharing of ideological and political education in colleges
Ren et al. A summary of research on web data acquisition methods based on distributed crawler
Li [Retracted] Research on the Social Security and Elderly Care System under the Background of Big Data
Malik et al. Ontology and Web Usage Mining towards an Intelligent Web focusing web logs
CN106649462A (en) Implementation method for mass data full-text retrieval scene

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20220624

Address after: Room 2101, block D, Zhizhen building, No. 7, Zhichun Road, Haidian District, Beijing 100191

Patentee after: HUIKE EDUCATION TECHNOLOGY GROUP Co.,Ltd.

Address before: 100191 No. 37, Haidian District, Beijing, Xueyuan Road

Patentee before: BEIHANG University