CN110188258B - Method and device for acquiring external data by using crawler - Google Patents
Method and device for acquiring external data by using crawler Download PDFInfo
- Publication number
- CN110188258B CN110188258B CN201910320214.XA CN201910320214A CN110188258B CN 110188258 B CN110188258 B CN 110188258B CN 201910320214 A CN201910320214 A CN 201910320214A CN 110188258 B CN110188258 B CN 110188258B
- Authority
- CN
- China
- Prior art keywords
- crawler
- data
- page
- result data
- program
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 39
- 238000012545 processing Methods 0.000 claims description 23
- 238000004590 computer program Methods 0.000 claims description 17
- 238000012216 screening Methods 0.000 claims description 7
- 238000004140 cleaning Methods 0.000 claims description 6
- 238000010586 diagram Methods 0.000 description 8
- 230000005540 biological transmission Effects 0.000 description 6
- 230000006870 function Effects 0.000 description 6
- 238000006243 chemical reaction Methods 0.000 description 4
- 238000004891 communication Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 230000008569 process Effects 0.000 description 4
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 230000000737 periodic effect Effects 0.000 description 3
- 230000002688 persistence Effects 0.000 description 3
- 241000239290 Araneae Species 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 230000009193 crawling Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000001502 supplementing effect Effects 0.000 description 1
- 230000001960 triggered effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The embodiment of the invention provides a method and a device for acquiring external data by using a crawler. In one aspect, the method comprises: acquiring a data acquisition instruction according to the triggering condition; invoking a crawler program according to the data acquisition instruction; receiving a crawler page grabbed by the crawler program; and analyzing the crawler page to obtain result data, and storing the result data into a mysql database. According to the invention, the technical problem that the crawler program cannot be automatically called to acquire data in the prior art is solved, the efficiency of capturing data by using the crawler is improved, and the manual operation is reduced.
Description
[ Field of technology ]
The present invention relates to the field of computers, and in particular, to a method and apparatus for obtaining external data using a crawler.
[ Background Art ]
In the prior art, a crawler is a program or script for automatically capturing web information according to a certain rule, is the most common and most important means for all companies to acquire external data at present, and can play a good role in supplementing data in business.
In the prior art, more technologies exist in the field of crawlers, but the functions of each technology are too single, automation of the crawlers and data persistence of the crawlers are relatively lacking, users are required to further screen and process after the crawlers acquire the data, efficiency is low, and a large amount of manpower is required to be consumed when the technology is applied to large-scale database establishment and periodic tasks.
In view of the above problems in the related art, no effective solution has been found yet.
[ Invention ]
In view of this, the embodiment of the invention provides a method and a device for acquiring external data by using a crawler.
In one aspect, an embodiment of the present invention provides a method for acquiring external data using a crawler, the method including: acquiring a data acquisition instruction according to the triggering condition; invoking a crawler program according to the data acquisition instruction; receiving a crawler page grabbed by the crawler program; and analyzing the crawler page to obtain result data, and storing the result data into a mysql database.
Optionally, invoking the crawler according to the data acquisition instruction includes: converting the data acquisition instruction into a crawler task; determining a difficulty coefficient of the crawler task; and determining the number of the crawler programs and the crawler request mode of the crawler programs according to the difficulty coefficient.
Optionally, determining the difficulty coefficient of the crawler task includes: the difficulty coefficient of the crawler task is according to at least one of the following: the number of data sources, the size of the data distribution area, the complexity of the link address.
Optionally, determining the number of the crawler programs and the crawler request mode of the crawler programs according to the difficulty coefficient includes: when the difficulty coefficient is lower than a preset threshold, selecting a crawler program and a first type of crawler request mode; when the difficulty coefficient is larger than or equal to the preset threshold, selecting a plurality of crawler programs and a plurality of corresponding crawler request modes of a second type; wherein the first type of crawler request mode includes one of the following: directly acquiring a Uniform Resource Locator (URL) and utilizing an agent request; the second type of crawler request mode comprises one of the following modes: model browser requests are employed, and real browser kernel requests are employed.
Optionally, invoking the crawler according to the data acquisition instruction includes: converting the data acquisition instruction into a crawler task; invoking a plurality of crawler nodes in the distributed network, wherein crawler programs are distributed on each crawler node, and the crawler nodes are arranged in a server of the distributed network; acquiring the processing capacity of each crawler node in the distributed network; and distributing crawler subtasks to each crawler node according to the processing capacity of each crawler node, wherein the crawler tasks comprise a plurality of crawler subtasks.
Optionally, when the crawler page is analyzed in a layered manner, analyzing the crawler page to obtain the result data includes: receiving a call request of an upper layer to a current layer; determining a target entity inherited by a target operation object according to metadata carried in the call request, wherein the target operation object is an object to be analyzed in the current layer, and the target entity is data defined by the metadata; and executing analysis operation on the operation object according to the target entity.
Optionally, parsing the crawler page to obtain result data includes: analyzing the crawler page to obtain original data corresponding to the crawler page; performing data cleaning and screening processing on the original data, deleting a data packet containing a blacklist word stock, and obtaining first result data; and selecting a data packet containing the keywords from the first result data to obtain second result data.
In another aspect, an embodiment of the present invention provides an apparatus for acquiring external data using a crawler, where the apparatus includes: the acquisition module is used for acquiring a data acquisition instruction according to the triggering condition; the calling module is used for calling a crawler program according to the data acquisition instruction; the receiving module is used for receiving the crawler pages grabbed by the crawler program; and the analysis module is used for analyzing the crawler page to obtain result data and storing the result data into a mysql database.
Optionally, the calling module includes: the conversion unit is used for converting the data acquisition instruction into a crawler task; the first determining unit is used for determining the difficulty coefficient of the crawler task; and the second determining unit is used for determining the number of the crawler programs and the crawler request mode of the crawler programs according to the difficulty coefficient.
Optionally, the first determining unit includes: a determining subunit, configured to determine a difficulty coefficient of the crawler task according to at least one of the following: the number of data sources, the size of the data distribution area, the complexity of the link address.
Optionally, the second determining unit includes: the selecting subunit is used for selecting a crawler program and a first type of crawler request mode when the difficulty coefficient is lower than a preset threshold value; when the difficulty coefficient is larger than or equal to the preset threshold, selecting a plurality of crawler programs and a plurality of corresponding crawler request modes of a second type; wherein the first type of crawler request mode includes one of the following: directly acquiring a Uniform Resource Locator (URL) and utilizing an agent request; the second type of crawler request mode comprises one of the following modes: model browser requests are employed, and real browser kernel requests are employed.
Optionally, the calling module includes: the conversion unit is used for converting the data acquisition instruction into a crawler task; the system comprises a calling unit, a server and a storage unit, wherein the calling unit is used for calling a plurality of crawler nodes in a distributed network, wherein crawler programs are distributed on each crawler node, and the crawler nodes are arranged in a server of the distributed network; the acquisition unit is used for acquiring the processing capacity of each crawler node in the distributed network; and the distribution unit is used for distributing the crawler subtasks to each crawler node according to the processing capacity of each crawler node, wherein the crawler tasks comprise a plurality of crawler subtasks.
Optionally, when the crawler page is parsed hierarchically, the parsing module includes: the receiving unit is used for receiving a call request of an upper layer to a current layer; the determining unit is used for determining a target entity inherited by a target operation object according to metadata carried in the calling request, wherein the target operation object is an object to be analyzed in the current layer, and the target entity is data defined by the metadata; and the analysis unit is used for executing analysis operation on the operation object according to the target entity.
Optionally, the parsing module includes: the analysis unit is used for analyzing the crawler page to obtain original data corresponding to the crawler page; the screening unit is used for carrying out data cleaning and screening processing on the original data, deleting the data packet containing the blacklist word stock and obtaining first result data; and the selection unit is used for selecting the data packet containing the keyword from the first result data to obtain second result data.
According to a further embodiment of the invention, there is also provided a storage medium having stored therein a computer program, wherein the computer program is arranged to perform the steps of any of the method embodiments described above when run.
According to a further embodiment of the invention, there is also provided an electronic device comprising a memory having stored therein a computer program and a processor arranged to run the computer program to perform the steps of any of the method embodiments described above.
By the method and the device, the automatic scheduling of the crawler task and the automatic storage of the crawler result data can be realized. The technical problem that the crawler program cannot be automatically called to acquire data in the prior art is solved, the efficiency of capturing the data by using the crawler is improved, and manual operation is reduced.
[ Description of the drawings ]
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a block diagram of a hardware architecture of a server for obtaining external data using a crawler according to an embodiment of the present invention;
FIG. 2 is a flow chart of a method for obtaining external data using a crawler in accordance with an embodiment of the present invention;
FIG. 3 is an application framework diagram of an embodiment of the present invention;
FIG. 4 is an overall workflow diagram of an embodiment of the present invention including data cleansing;
Fig. 5 is a block diagram of an apparatus for acquiring external data using a crawler according to an embodiment of the present invention.
[ Detailed description ] of the invention
The application will be described in detail hereinafter with reference to the drawings in conjunction with embodiments. It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other.
It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order.
Example 1
The method according to the first embodiment of the present application may be implemented in a mobile terminal, a server, a computer terminal, or a similar computing device. Taking the example of running on a server, fig. 1 is a block diagram of a hardware structure of a server for acquiring external data using a crawler according to an embodiment of the present application. As shown in fig. 1, the server 10 may include one or more (only one is shown in fig. 1) processors 102 (the processor 102 may include, but is not limited to, a microprocessor MCU or a processing device such as a programmable logic device FPGA) and a memory 104 for storing data, and optionally, a transmission device 106 for communication functions and an input-output device 108. It will be appreciated by those skilled in the art that the structure shown in fig. 1 is merely illustrative, and is not intended to limit the structure of the server described above. For example, the server 10 may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.
The memory 104 may be used to store a computer program, for example, a software program of application software and a module, such as a computer program corresponding to a method for acquiring external data using a crawler in an embodiment of the present invention, and the processor 102 executes the computer program stored in the memory 104 to perform various functional applications and data processing, that is, implement the above-mentioned method. Memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory remotely located with respect to the processor 102, which may be connected to the server 10 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The transmission means 106 is arranged to receive or transmit data via a network. The specific example of the network described above may include a wireless network provided by a communication provider of the server 10. In one example, the transmission device 106 includes a network adapter (Network Interface Controller, simply referred to as a NIC) that can connect to other network devices through a base station to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is used to communicate with the internet wirelessly.
In this embodiment, a method for obtaining external data using a crawler is provided, and fig. 2 is a flowchart of a method for obtaining external data using a crawler according to an embodiment of the present invention, as shown in fig. 2, where the flowchart includes the following steps:
Step S202, acquiring a data acquisition instruction according to a trigger condition;
the triggering condition of the embodiment may be an acquisition instruction sent by the user in real time, or may be triggered automatically according to a period, for example, when the external data is the volume of the stock market, after the stock market is broken at the trade date (for example, 15:30), the volume data of the large disk is acquired automatically.
Step S204, invoking a crawler program according to the data acquisition instruction;
step S206, receiving a crawler page grabbed by the crawler program;
step S208, analyzing the crawler page to obtain result data, and storing the result data into a mysql database (a relational database management system).
By the scheme of the embodiment, the automatic scheduling of the crawler task and the automatic storage of the crawler result data can be realized. The technical problem that the crawler program cannot be automatically called to acquire data in the prior art is solved, the efficiency of capturing the data by using the crawler is improved, and manual operation is reduced.
FIG. 3 is a diagram of an application framework of an embodiment of the present invention, where, as shown in FIG. 3, the functions in the framework are modularized, and the application framework includes: apscheduler (crawler task manager), spider (crawler program), mysql database. Apscheduler manage and schedule the crawler and mysql database, the crawler crawling external data according to the task, mysql storing external data. Specifically, apscheduler is used for managing, scheduling and controlling the period of the crawler task, including setting, suspending, removing, scheduling and the like of the task; and (3) carrying out periodic scheduling control on the crawler task, and carrying out periodic task triggering reminding according to the triggering conditions set by the user. The tasks are thread tasks, each of which is processed in a background thread. Subsequently, if a new crawler Task exists, only the corresponding Task class is inherited, then a specific Task method of the Task class is realized, and the Task is added into a Task manager. The Spider is used for being responsible for specific realization of a crawler task and comprises a crawler request module, a crawler page analysis module and a crawler result data cleaning and sorting module. Mysql is used to be responsible for crawler final result data persistence, i.e. storing crawler result data.
In this embodiment, the data acquisition instruction is converted into a crawler task, and a corresponding crawler request mode is selected according to the difficulty of the crawler task, and the crawler program is called to complete the crawler task. Each crawler task needs one or more crawler programs to be completed, the number of the crawler programs can be determined according to the difficulty of the crawler tasks, the crawler difficulty is classified, one crawler program is allocated to the crawler task with the lowest difficulty level, and a plurality of crawler programs are allocated to the crawler task with high difficulty level. The method for calling the crawler according to the data acquisition instruction in the embodiment includes:
s11, converting the data acquisition instruction into a crawler task;
s12, determining a difficulty coefficient of the crawler task;
Optionally, the difficulty of the crawler task is determined according to the number of data sources (such as web page data and database number), the size of data distribution area (such as province, abroad), the complexity of the link address, and the like.
S13, determining the number of the crawler programs and the crawler request mode of the crawler programs according to the difficulty coefficient.
In an optional implementation manner of this embodiment, determining a difficulty level of a crawler task, and further allocating a corresponding number of crawler programs and corresponding crawler request modes according to the difficulty level, and determining the number of the crawler programs and the crawler request modes of the crawler programs according to the difficulty coefficient includes: when the difficulty coefficient is lower than a preset threshold, selecting a crawler program and a first type of crawler request mode; when the difficulty coefficient is larger than or equal to the preset threshold, selecting a plurality of crawler programs and a plurality of corresponding crawler request modes of a second type; wherein the first type of crawler request mode includes one of the following: directly acquiring a Uniform Resource Locator (URL) and utilizing an agent request; the second type of crawler request mode comprises one of the following modes: model browser requests are employed, and real browser kernel requests are employed.
When the system realizes a specific crawler task, the system only needs to be called independently according to the difficulty of the crawler task. For example, when the difficulty of the crawler task is low, a crawler request mode of directly acquiring the URL mode can be selected to crawl the data, and because the different request modes have different capacities of acquiring the data (correspondingly, the larger the capacity is, the larger the resource and the cost to be called are), the crawler request mode is required to be corresponding to the difficulty of the crawler task, so that reasonable allocation of the resource is realized. Because the number of crawler tasks corresponds to the number of crawler programs and the crawler request mode of the crawler programs, the number of crawler programs and the crawler request mode can be distributed according to the difficulty of the crawler tasks, and other optional implementation manners of this embodiment further include: distributing a corresponding number of crawler programs and fixed crawler request modes according to the difficulty level; or allocating a fixed number of crawler programs and corresponding crawler request modes according to the difficulty level. Each crawler uses the same crawler request approach.
In an application scenario of the present embodiment, when applied to a distributed network, invoking a crawler according to the data acquisition instruction includes: converting the data acquisition instruction into a crawler task; invoking a plurality of crawler nodes in the distributed network, wherein crawler programs are distributed on each crawler node, and the crawler nodes are arranged in a server of the distributed network; acquiring the processing capacity of each crawler node in the distributed network; and distributing crawler subtasks to each crawler node according to the processing capacity of each crawler node, wherein the crawler tasks comprise a plurality of crawler subtasks.
In one example, priority of the crawler programs can be introduced to call, each crawler program is distributed on crawler nodes, a plurality of crawler programs with different priorities are distributed on each crawler node, the crawler nodes are arranged in a server of the distributed network, and processing capacity of each crawler node in the distributed network is acquired; and distributing the crawler tasks to each crawler node according to the preset priority order and the processing capacity of each crawler node so that the crawler node processes the distributed crawler tasks. Determining the maximum access amount of a single crawler program according to the processing capacity of each crawler node; if the amount of the crawler tasks distributed to the crawler nodes is larger than or equal to the maximum access amount, adopting a plurality of crawler programs to process; or if the amount of the crawler tasks allocated to the crawler nodes is smaller than the maximum access amount of the single crawler program, processing by using the crawler program with the highest priority. Of course, the crawler tasks may be equally distributed to crawler programs of all nodes. In the distributed network, the crawler nodes are fixed, and crawler programs on each crawler node are distributed in advance and are arranged in a server of the distributed network. And distributing crawler tasks to each crawler node according to the priority and the processing capacity, and determining the crawler program according to the distributed tasks of each crawler node and the maximum access quantity of each crawler program on the crawler node.
In this embodiment, because the various crawler pages are diversified, it is impossible to implement the method in an parsing manner, so only the top-level parent class of the crawler parsing logic is preset, and the crawler parsing logic of a specific page is implemented by each layer inheriting the individualization of the top-level parent class. The receiving process comprises the following steps: when a call request of an upper layer to a current layer is received, determining a target entity inherited by a target operation object according to metadata carried in the request, and analyzing the target operation object according to the target entity.
The analysis of the crawler page requires multiple steps, layer-by-layer analysis, each layer is responsible for different analysis operations, and analyzes different metadata until all the data of the crawler page are acquired. The target operation object is an object to be analyzed in the current layer, and the target entity is data described by metadata, wherein the metadata is data (descriptive information of data and information resources) used for defining the data. For example, there is a student information record including fields of name (name), age (age), gender (size), class (class), etc., then name, age, male, class is metadata, each layer is at least responsible for parsing one metadata, and the metadata corresponds to the described data (for example, name corresponds to Zhang three, lifour) as a target entity. And after all the layers analyze all the metadata, obtaining the result data of the crawler page.
In one implementation of this embodiment, parsing the crawler page to obtain result data includes: analyzing the crawler page to obtain original data corresponding to the crawler page; performing data cleaning and screening processing on the original data, deleting a data packet containing a blacklist word stock, and obtaining first result data; and selecting a data packet containing the keywords from the first result data to obtain second result data. And taking the second result data as final obtained result data, and automatically storing the result data of the crawler task into a mysql database to complete the persistence of the crawler result. Meanwhile, important step program logs of the crawler tasks can be stored in mysql, and each crawler task can be effectively monitored by checking the program logs. FIG. 4 is an overall workflow diagram of an embodiment of the present invention including data cleansing, wherein the various functions are packaged in modules.
From the description of the above embodiments, it will be clear to a person skilled in the art that the method according to the above embodiments may be implemented by means of software plus the necessary general hardware platform, but of course also by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method according to the embodiments of the present invention.
Example 2
In this embodiment, a device for obtaining external data by using a crawler is further provided, and the device is used to implement the foregoing embodiments and preferred embodiments, which are not described in detail. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.
Fig. 5 is a block diagram of an apparatus for acquiring external data using a crawler according to an embodiment of the present invention, as shown in fig. 5, the apparatus including:
an acquisition module 50, configured to acquire a data acquisition instruction according to a trigger condition;
A calling module 52, configured to call a crawler according to the data acquisition instruction;
A receiving module 54, configured to receive a crawler page grabbed by the crawler program;
And the parsing module 56 is used for parsing the crawler page to obtain result data and storing the result data into the mysql database.
Optionally, the calling module includes: the conversion unit is used for converting the data acquisition instruction into a crawler task; the first determining unit is used for determining the difficulty coefficient of the crawler task; and the second determining unit is used for determining the number of the crawler programs and the crawler request mode of the crawler programs according to the difficulty coefficient.
Optionally, the first determining unit includes: a determining subunit, configured to determine a difficulty coefficient of the crawler task according to at least one of the following: the number of data sources, the size of the data distribution area, the complexity of the link address.
Optionally, the second determining unit includes: the selecting subunit is used for selecting a crawler program and a first type of crawler request mode when the difficulty coefficient is lower than a preset threshold value; when the difficulty coefficient is larger than or equal to the preset threshold, selecting a plurality of crawler programs and a plurality of corresponding crawler request modes of a second type; wherein the first type of crawler request mode includes one of the following: directly acquiring a Uniform Resource Locator (URL) and utilizing an agent request; the second type of crawler request mode comprises one of the following modes: model browser requests are employed, and real browser kernel requests are employed.
Optionally, the calling module includes: the conversion unit is used for converting the data acquisition instruction into a crawler task; the system comprises a calling unit, a server and a storage unit, wherein the calling unit is used for calling a plurality of crawler nodes in a distributed network, wherein crawler programs are distributed on each crawler node, and the crawler nodes are arranged in a server of the distributed network; the acquisition unit is used for acquiring the processing capacity of each crawler node in the distributed network; and the distribution unit is used for distributing the crawler subtasks to each crawler node according to the processing capacity of each crawler node, wherein the crawler tasks comprise a plurality of crawler subtasks.
Optionally, when the crawler page is parsed hierarchically, the parsing module includes: the receiving unit is used for receiving a call request of an upper layer to a current layer; the determining unit is used for determining a target entity inherited by a target operation object according to metadata carried in the calling request, wherein the target operation object is an object to be analyzed in the current layer, and the target entity is data defined by the metadata; and the analysis unit is used for executing analysis operation on the operation object according to the target entity.
Optionally, the parsing module includes: the analysis unit is used for analyzing the crawler page to obtain original data corresponding to the crawler page; the screening unit is used for carrying out data cleaning and screening processing on the original data, deleting the data packet containing the blacklist word stock and obtaining first result data; and the selection unit is used for selecting the data packet containing the keyword from the first result data to obtain second result data.
It should be noted that each of the above modules may be implemented by software or hardware, and for the latter, it may be implemented by, but not limited to: the modules are all located in the same processor; or the above modules may be located in different processors in any combination.
Example 3
In the several embodiments provided in the present invention, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the elements is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple elements or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in hardware plus software functional units.
The integrated units implemented in the form of software functional units described above may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium, and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a Processor (Processor) to perform part of the steps of the methods according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
An embodiment of the invention also provides a storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the method embodiments described above when run.
Alternatively, in the present embodiment, the above-described storage medium may be configured to store a computer program for performing the steps of:
s1, acquiring a data acquisition instruction according to a trigger condition;
s2, invoking a crawler program according to the data acquisition instruction;
s3, receiving a crawler page grabbed by the crawler program;
S4, analyzing the crawler page to obtain result data, and storing the result data into a mysql database.
Alternatively, in the present embodiment, the storage medium may include, but is not limited to: a usb disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory RAM), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing a computer program.
An embodiment of the invention also provides an electronic device comprising a memory having stored therein a computer program and a processor arranged to run the computer program to perform the steps of any of the method embodiments described above.
Optionally, the electronic apparatus may further include a transmission device and an input/output device, where the transmission device is connected to the processor, and the input/output device is connected to the processor.
Alternatively, in the present embodiment, the above-described processor may be configured to execute the following steps by a computer program:
s1, acquiring a data acquisition instruction according to a trigger condition;
s2, invoking a crawler program according to the data acquisition instruction;
s3, receiving a crawler page grabbed by the crawler program;
S4, analyzing the crawler page to obtain result data, and storing the result data into a mysql database.
The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather to enable any modification, equivalent replacement, improvement or the like to be made within the spirit and principles of the invention.
Claims (9)
1. A method of obtaining external data using a crawler, the method comprising:
Acquiring a data acquisition instruction according to the triggering condition;
Invoking a crawler program according to the data acquisition instruction;
Receiving a crawler page grabbed by the crawler program;
Analyzing the crawler page to obtain result data, and storing the result data into a mysql database;
When the crawler page is analyzed in a layering manner, analyzing the crawler page to obtain result data comprises the following steps: receiving a call request of an upper layer to a current layer; determining a target entity inherited by a target operation object according to metadata carried in the call request, wherein the target operation object is an object to be analyzed in the current layer, and the target entity is data defined by the metadata; and executing analysis operation on the operation object according to the target entity.
2. The method of claim 1, wherein invoking a crawler in accordance with the data acquisition instruction comprises:
converting the data acquisition instruction into a crawler task;
Determining a difficulty coefficient of the crawler task;
and determining the number of the crawler programs and the crawler request mode of the crawler programs according to the difficulty coefficient.
3. The method of claim 2, wherein determining a difficulty factor for the crawler task comprises:
the difficulty coefficient of the crawler task is according to at least one of the following: the number of data sources, the size of the data distribution area, the complexity of the link address.
4. The method of claim 2, wherein determining the number of crawlers and the crawler request mode of the crawlers according to the difficulty coefficient comprises:
when the difficulty coefficient is lower than a preset threshold, selecting a crawler program and a first type of crawler request mode; when the difficulty coefficient is larger than or equal to the preset threshold, selecting a plurality of crawler programs and a plurality of corresponding crawler request modes of a second type;
Wherein the first type of crawler request mode includes one of the following: directly acquiring a Uniform Resource Locator (URL) and utilizing an agent request; the second type of crawler request mode comprises one of the following modes: model browser requests are employed, and real browser kernel requests are employed.
5. The method of claim 2, wherein invoking a crawler in accordance with the data acquisition instruction comprises:
invoking a plurality of crawler nodes in the distributed network, wherein crawler programs are distributed on each crawler node, and the crawler nodes are arranged in a server of the distributed network;
Acquiring the processing capacity of each crawler node in the distributed network;
And distributing crawler subtasks to each crawler node according to the processing capacity of each crawler node, wherein the crawler tasks comprise a plurality of crawler subtasks.
6. The method of claim 1, wherein parsing the crawler page to obtain result data comprises:
analyzing the crawler page to obtain original data corresponding to the crawler page;
Performing data cleaning and screening processing on the original data, deleting a data packet containing a blacklist word stock, and obtaining first result data;
and selecting a data packet containing the keywords from the first result data to obtain second result data.
7. An apparatus for obtaining external data using a crawler, the apparatus comprising:
the acquisition module is used for acquiring a data acquisition instruction according to the triggering condition;
the calling module is used for calling a crawler program according to the data acquisition instruction;
the receiving module is used for receiving the crawler pages grabbed by the crawler program;
the analysis module is used for analyzing the crawler page to obtain result data and storing the result data into a mysql database;
When the crawler page is analyzed in a layering manner, analyzing the crawler page to obtain result data comprises the following steps: receiving a call request of an upper layer to a current layer; determining a target entity inherited by a target operation object according to metadata carried in the call request, wherein the target operation object is an object to be analyzed in the current layer, and the target entity is data defined by the metadata; and executing analysis operation on the operation object according to the target entity.
8. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 6 when the computer program is executed.
9. A computer storage medium having stored thereon a computer program, which when executed by a processor realizes the steps of the method according to any of claims 1 to 6.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910320214.XA CN110188258B (en) | 2019-04-19 | 2019-04-19 | Method and device for acquiring external data by using crawler |
PCT/CN2019/117722 WO2020211351A1 (en) | 2019-04-19 | 2019-11-12 | Method and device for obtaining external data by using crawler |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910320214.XA CN110188258B (en) | 2019-04-19 | 2019-04-19 | Method and device for acquiring external data by using crawler |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110188258A CN110188258A (en) | 2019-08-30 |
CN110188258B true CN110188258B (en) | 2024-05-24 |
Family
ID=67714829
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910320214.XA Active CN110188258B (en) | 2019-04-19 | 2019-04-19 | Method and device for acquiring external data by using crawler |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN110188258B (en) |
WO (1) | WO2020211351A1 (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110188258B (en) * | 2019-04-19 | 2024-05-24 | 平安科技(深圳)有限公司 | Method and device for acquiring external data by using crawler |
CN113076457B (en) * | 2021-04-09 | 2024-08-16 | 航天信息(广东)有限公司 | Crawler action processing method and device |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6697801B1 (en) * | 2000-08-31 | 2004-02-24 | Novell, Inc. | Methods of hierarchically parsing and indexing text |
CN107015986A (en) * | 2016-01-27 | 2017-08-04 | 北京国双科技有限公司 | A kind of reptile crawls the method and device of webpage |
CN107895009A (en) * | 2017-11-10 | 2018-04-10 | 北京国信宏数科技有限责任公司 | One kind is based on distributed internet data acquisition method and system |
JP2019053469A (en) * | 2017-09-14 | 2019-04-04 | ヤフー株式会社 | Database creating device, database creating method, and program |
Family Cites Families (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070050445A1 (en) * | 2005-08-31 | 2007-03-01 | Hugh Hyndman | Internet content analysis |
US7693804B2 (en) * | 2005-11-28 | 2010-04-06 | Fatlens Inc. | Method, system and computer program product for identifying primary product objects |
US8229911B2 (en) * | 2008-05-13 | 2012-07-24 | Enpulz, Llc | Network search engine utilizing client browser activity information |
US8131753B2 (en) * | 2008-05-18 | 2012-03-06 | Rybak Ilya | Apparatus and method for accessing and indexing dynamic web pages |
CN101826110B (en) * | 2010-04-13 | 2011-12-21 | 北京大学 | Method for crawling BitTorrent torrent files |
US8799262B2 (en) * | 2011-04-11 | 2014-08-05 | Vistaprint Schweiz Gmbh | Configurable web crawler |
CN102890692A (en) * | 2011-07-22 | 2013-01-23 | 阿里巴巴集团控股有限公司 | Webpage information extraction method and webpage information extraction system |
CN105653599A (en) * | 2015-12-23 | 2016-06-08 | 浪潮软件集团有限公司 | Data acquisition method and device |
AU2017322114B8 (en) * | 2016-09-02 | 2022-09-08 | FutureVault Inc. | Real-time document filtering systems and methods |
CN108021369B (en) * | 2017-12-21 | 2020-10-16 | 马上消费金融股份有限公司 | Data integration processing method and related device |
CN109614539A (en) * | 2019-01-16 | 2019-04-12 | 重庆金融资产交易所有限责任公司 | Data grab method, device and computer readable storage medium |
CN110188258B (en) * | 2019-04-19 | 2024-05-24 | 平安科技(深圳)有限公司 | Method and device for acquiring external data by using crawler |
-
2019
- 2019-04-19 CN CN201910320214.XA patent/CN110188258B/en active Active
- 2019-11-12 WO PCT/CN2019/117722 patent/WO2020211351A1/en active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6697801B1 (en) * | 2000-08-31 | 2004-02-24 | Novell, Inc. | Methods of hierarchically parsing and indexing text |
CN107015986A (en) * | 2016-01-27 | 2017-08-04 | 北京国双科技有限公司 | A kind of reptile crawls the method and device of webpage |
JP2019053469A (en) * | 2017-09-14 | 2019-04-04 | ヤフー株式会社 | Database creating device, database creating method, and program |
CN107895009A (en) * | 2017-11-10 | 2018-04-10 | 北京国信宏数科技有限责任公司 | One kind is based on distributed internet data acquisition method and system |
Also Published As
Publication number | Publication date |
---|---|
WO2020211351A1 (en) | 2020-10-22 |
CN110188258A (en) | 2019-08-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107729139B (en) | Method and device for concurrently acquiring resources | |
CN111552838A (en) | Data processing method and device, computer equipment and storage medium | |
CN110781180B (en) | Data screening method and data screening device | |
CN110188258B (en) | Method and device for acquiring external data by using crawler | |
CN113391901A (en) | RPA robot management method, device, equipment and storage medium | |
CN112162852A (en) | Multi-architecture CPU node management method, device and related components | |
CN113886069A (en) | Resource allocation method and device, electronic equipment and storage medium | |
CN109902028A (en) | Automated testing method, device, equipment and the storage medium of ACL characteristic | |
CN1783121A (en) | Method and system for executing design automation | |
US12035156B2 (en) | Communication method and apparatus for plurality of administrative domains | |
CN109257256A (en) | Apparatus monitoring method, device, computer equipment and storage medium | |
CN116483546B (en) | Distributed training task scheduling method, device, equipment and storage medium | |
CN111026945B (en) | Multi-platform crawler scheduling method, device and storage medium | |
US8042160B1 (en) | Identity management for application access | |
CN111190731A (en) | Cluster task scheduling system based on weight | |
CN115098252A (en) | Resource scheduling method, device and computer readable medium | |
CN114443293A (en) | Deployment system and method for big data platform | |
CN114816735A (en) | System and method for executing data analysis task based on Nacos distributed cluster | |
CN114564249A (en) | Recommendation scheduling engine, recommendation scheduling method, and computer-readable storage medium | |
CN114090201A (en) | Resource scheduling method, device, equipment and storage medium | |
CN113722141A (en) | Method and device for determining delay reason of data task, electronic equipment and medium | |
CN113296913A (en) | Data processing method, device and equipment based on single cluster and storage medium | |
CN103856359A (en) | Method and system for obtaining information | |
CN116954897A (en) | Asynchronous task execution method and device | |
CN116204531A (en) | Capacity expansion method and device of HIVE library, processor and electronic equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |