CN110147362A

CN110147362A - One kind is based on the acquisition of event driven DOC DATA and processing system and its method

Info

Publication number: CN110147362A
Application number: CN201910271964.2A
Authority: CN
Inventors: 马新凡; 王鹏; 刘福强; 李泽松
Original assignee: Division Big Data Research Institute Co Ltd
Current assignee: Division Big Data Research Institute Co Ltd
Priority date: 2019-04-04
Filing date: 2019-04-04
Publication date: 2019-08-20

Abstract

The invention discloses one kind based on the acquisition of event driven DOC DATA and processing system and its method, belongs to big data technical field, comprising: data acquisition service module, data cleansing service module, data pick-up calculates service module, data directory service module, log module；It is acquired and processing method includes: DOC DATA of the distributed reptile crawl publication on website in distributed data acquisition mode, it is sent to data acquisition service module processing, extraction calculating is carried out calculating service module by data pick-up, then it is deposited in database by data directory service module storage, the entire acquisition of log module record and treatment process.The present invention carries out each official document issuing web site using distributed reptile and grabs, and effectively solves the problems, such as the acquisition and processing timeliness of massive multi-source DOC DATA.

Description

One kind is based on the acquisition of event driven DOC DATA and processing system and its method

Technical field

The invention belongs to big data technical fields, more particularly to one kind to be acquired and be handled based on event driven DOC DATA System and method.

Background technique

21 century China is extensive information-based, and internet brings the great change of government information disclosure, the political affairs more than increasingly d Mansion tissue is issued public information by internet, and how the DOC DATA of magnanimity, which efficiently acquires and handle, is given existing information system System framework brings challenge.Recent years, micro services framework was becoming increasingly popular, will be original multiple in the way of small fractionation Miscellaneous system is decoupled, the liberation of bring operation flow, and this junior unit architecture mode can be complete with relatively high fitness At sophisticated functions.Data acquisition in, at present for internet data acquisition have the characteristics that measure in short-term it is big, with the side of timed task Formula timely and effective can not be cleaned and be calculated to DOC DATA, and in addition a large amount of semi-structured text datas increase cleaning How complexity accomplishes increment cleaning and calculates to be also current problems faced.

In conclusion traditional data acquisition and processing (DAP) mode based on timed task has been unable to meet complicated business scene Under data acquisition and processing (DAP), main problems faced has:

1, multi-source heterogeneous internet crawler data have the characteristics that measure in short-term big, and existing acquisition and tupe can not be fast Speed is acquired and handles to data.

2, data acquisition scenarios are changeable, and each process flow is complicated, can not accomplish the task execution in the case of N-free diet method, lead After causing the error of part process, calculates and cleaning expends system resource；

3, DOC DATA is related to acquiring, cleaning, a series of complex process such as extraction and training, volatile for partial data The characteristics of effect, can not timely update this corresponding operation system latest data.

Summary of the invention

It is an object of the invention to: in view of the above problems, the present invention, which provides one kind and is able to solve DOC DATA, to adopt Collection and processing in efficiency and automation issues based on event driven DOC DATA acquisition and processing system and its side Method.

To achieve the goals above, the present invention adopts the following technical scheme:

One kind is based on the acquisition of event driven DOC DATA and processing system, comprising:

Data acquisition service module is acquired and stores for acquiring multi-source heterogeneous DOC DATA, is simultaneously emitted by clear Wash instruction；

Data cleansing service module acquires the clear instruction that service module issues for receiving data and is parsed, sentenced Disconnected cleaning demand is increment cleaning or full dose cleaning, and issues and calculate message body；

Data pick-up calculates service module, and the calculating message body that cleaning service issues for receiving data is simultaneously analyzed, Extraction calculating is carried out to data, and gives and feeds back, while sending data directory message body；

Data directory service module extracts for receiving data and calculates service module sending data directory message body and carry out Parsing, judgement are increment storage or full dose storage；

Log module, for recording by above-mentioned modules to the acquisition of DOC DATA and processing whole process.

Further, the data acquisition service module is acquired DOC DATA using distributed reptile.

Further, the data acquisition module passes through the title of DOC DATA, URL and issuing time in collection process Establish unique index.

Further, the data acquisition module also generates interim table Crawler TMP during acquiring DOC DATA, For storing incremental data.

Further, calculating cleaning service module complete to generate after work interim table Clean TMP for store increasing Measure the data of cleaning.

Further, the extraction calculate can be generated after service module is completed Calculate TMP by incremental data into Row storage.

One kind is based on the acquisition of event driven DOC DATA and processing method, comprising the following steps:

S1, data acquisition service module are by the distributed reptile crawl publication in distributed data acquisition mode in website On DOC DATA, and distributed storage unstructured database, then unique rope is established by title, URL and the time of publication Draw, while Bloom filter record is carried out to the URL grabbed, and interim table is stored in the DOC DATA grabbed In Crawler TMP, in the database and set where the increase data after distributed reptile periodicity crawl event terminates Message body or building full dose data cleansing message body, form clear instruction and send；It is recorded using Bloom filter, the grand mistake of cloth Filter can be used for retrieving an element whether in a set, solve extensive repetition crawl bring network bandwidth consumption

The clear instruction of S2, data cleansing service module receiving step S1, parse clear instruction, judge that cleaning needs The cleaning of Seeking Truth full dose or increment cleaning；After purge event is completed, while sending corresponding calculating message body；

S3, data pick-up calculate the calculating message body that service module receiving step S2 is sent and are parsed, and judge to extract meter It is full dose calculating or incremental computations, finally sends data directory message body；

The data directory service message body sent in S4, data directory service module receiving step S3 is parsed, judgement It is increment storage or full dose storage, is stored in Elasticsearch index data base；

S5, by step S1-S5 process generate sequence of events in the form of log record storage in log module.

Further, the data cleansing service module in the step S2 handles data specific steps are as follows:

1., increment cleaning: parsing database and set message body clear instruction, triggering incremental data cleaning service, increment After the completion of cleaning, by the interim table Crawler TMP data in non-relationship graphic data distributed where former distributed reptile data It deletes, while replying distributed reptile micro services message and consumption is completed；Cleaning error is encountered, then does not delete interim table Data in Crawler TMP, while replying distributed reptile and retransmitting cleaning message；Finally the data that increment cleans are deposited Storage finally sends calculating message body and gives step S3 in interim table Clean TMP；

2., full dose cleaning, parse full dose data cleansing message body clear instruction, trigger full dose data cleansing service, finally Step S3 is sent to message body is calculated；

Further, the step S3 data pick-up calculates the specific steps of service module processing data are as follows:

A, after extraction calculating service module receives a full dose extraction calculating message body, the message is consumed first, Triggering, which is extracted, to be calculated:

If the field extracted is not related to retrieval service, data directory message body is directly transmitted, rope is carried out to data Library is introduced, while sending feedback message and giving data cleansing service module；

If the field extracted is related to retrieval service, data directory message body is not sent, extraction calculating is carried out, until Extracting calculating completion, retransmiting full dose data directory message body gives step S4 later；

B, after extraction calculating service module receives an increment extraction calculating message body, the message is parsed first, The database and name set where incremental data are obtained, triggering, which is extracted, to be calculated；

If encountering exception in extracting calculating process, feeds back to data cleansing service retransmission extraction calculating and disappear Breath, extract calculate service complete after can by incremental data carry out Calculate TMP storage, and delete step 1. in generate Clean TMP table finally sends incremental data index messages body and gives step S4.

Further, the specific service step in the step S4 data directory service module are as follows:

A, increment is put in storage, and database and set where parsing the incremental data in message body carry out data on original index Insertion updates, and after completing data directory, deletes the interim Calculate TMP generated in step b.

B, full dose is put in storage, and is parsed the full dose data cleansing instruction in message body, is created new index, carries out full dose to data Index.

Basic functional principle of the invention are as follows:

Based on event-driven, domestic each official document issuing web site is automatically grabbed by internet crawler, passes through utilization Distributed reptile, takes the relevant field in official document, text, picture and attachment, accomplishes incremental update using Bloom filter；It is logical It crosses and accomplishes to orient duplicate removal and mass memory using distributed unstructured database MongoDB；By utilizing distributed message, solution The problem of certainly handling and calculate in real time；By utilizing chart database and distributed search engine, search and displaying feature content.

The beneficial effects of the present invention are:

It is grabbed 1. the present invention carries out each official document issuing web site using distributed reptile, it is different effectively to solve magnanimity multi-source The acquisition and processing timeliness problem of structure DOC DATA.

2. timely clearing up the various ephemeral data tables generated in file is acquired and handled in the present invention, solve because adopting Because unknown exception leads to system problem, server resource waste in collection or process flow, data read the problems such as dirty.

3. of the invention effectively solve the problems, such as the timely and effective storage of DOC DATA.

Detailed description of the invention

Fig. 1 is in the present invention based on the acquisition of event driven DOC DATA and processing system structural schematic diagram.

Fig. 2 is work flow diagram in the present invention.

Fig. 3 is detailed operational flow diagrams in the present invention.

Specific embodiment

Be described further below technical solution of the present invention, but claimed range be not limited to it is described.

Embodiment 1:

As shown in Figure 1, a kind of based on the acquisition of event driven DOC DATA and processing system, comprising:

The data acquisition service module is acquired DOC DATA using distributed reptile.

The data acquisition module is established uniquely in collection process by the title of DOC DATA, URL and issuing time Index.

The data acquisition module also generates interim table Crawler TMP during acquiring DOC DATA, for storing Incremental data.

Described calculate after cleaning service module completes work generates interim table Clean TMP for storing the number of increment cleaning According to.

The extraction calculating service module can generate Calculate TMP after completing and store incremental data.

As shown in Fig. 2, a kind of based on the acquisition of event driven DOC DATA and processing method, comprising the following steps:

S1, data acquisition service module are by the distributed reptile crawl publication in distributed data acquisition mode in website On DOC DATA, and distributed storage unstructured database, then unique rope is established by title, URL and the time of publication Draw, while Bloom filter record is carried out to the URL grabbed, and interim table is stored in the DOC DATA grabbed In Crawler TMP, in the database and set where the increase data after distributed reptile periodicity crawl event terminates Message body or building full dose data cleansing message body, form clear instruction and send；It is recorded using Bloom filter, the grand mistake of cloth Filter can be used for retrieving an element whether in a set, solves the extensive crawl bring network bandwidth that repeats and disappears Consumption.

Data cleansing service module in the step S2 handles data specific steps are as follows:

The step S3 data pick-up calculates the specific steps of service module processing data are as follows:

Specific service step in the step S4 data directory service module are as follows:

The working principle of the present embodiment are as follows: event-driven is based on, by internet crawler to domestic each official document issuing web site It is automatically grabbed, it is grand using cloth by taking the relevant field in official document, text, picture and attachment using distributed reptile Filter accomplishes incremental update；By accomplishing to orient duplicate removal and mass memory using distributed unstructured database MongoDB； By utilizing distributed message, solve the problems, such as to handle and calculate in real time；By utilizing chart database and distributed search engine, Search and displaying feature content.

Claims

1. one kind is based on the acquisition of event driven DOC DATA and processing system characterized by comprising

Data acquisition service module is acquired and stores for acquiring multi-source heterogeneous DOC DATA, is simultaneously emitted by cleaning and refers to It enables；

Data cleansing service module acquires the clear instruction that service module issues for receiving data and is parsed, and judgement is clear Washing demand is increment cleaning or full dose cleaning, and issues and calculate message body；

Data pick-up calculates service module, and the calculating message body that cleaning service issues for receiving data is simultaneously analyzed, logarithm According to carrying out extraction calculating, and gives and feed back, while sending data directory message body；

Data directory service module extracts for receiving data and calculates service module sending data directory message body and solved Analysis, judgement are increment storage or full dose storage；

2. according to claim 1 a kind of based on the acquisition of event driven DOC DATA and processing system, it is characterised in that: The data acquisition service module is acquired DOC DATA using distributed reptile.

3. according to claim 1 a kind of based on the acquisition of event driven DOC DATA and processing system, it is characterised in that: The data acquisition module establishes unique index by the title of DOC DATA, URL and issuing time in collection process.

4. according to claim 1 a kind of based on the acquisition of event driven DOC DATA and processing system, it is characterised in that: The data acquisition module also generates interim table Crawler TMP during acquiring DOC DATA, for storing incremental data.

5. according to claim 1 a kind of based on the acquisition of event driven DOC DATA and processing system, it is characterised in that: Described calculate after cleaning service module completes work generates interim table Clean TMP for storing the data of increment cleaning.

6. according to claim 1 a kind of based on the acquisition of event driven DOC DATA and processing system, it is characterised in that: The extraction calculating service module can generate Calculate TMP after completing and store incremental data.

7. one kind is based on described in claim 1 based on the acquisition of event driven DOC DATA and processing method, which is characterized in that packet Include following steps:

S1, data acquisition service module are by the distributed reptile crawl publication in distributed data acquisition mode on website DOC DATA, and distributed storage unstructured database, then unique index is established by title, URL and the time of publication, together When Bloom filter record is carried out to the URL that had grabbed, and interim table Crawler is stored in the DOC DATA grabbed In TMP, where the increase data after distributed reptile periodicity crawl event terminates database and set message body or Person constructs full dose data cleansing message body, forms clear instruction and sends；

The clear instruction of S2, data cleansing service module receiving step S1, parse clear instruction, judge that cleaning demand is Full dose cleaning or increment cleaning；After purge event is completed, while sending corresponding calculating message body；

S3, data pick-up calculate the calculating message body that service module receiving step S2 is sent and are parsed, and judging to extract to calculate is Full dose calculates or incremental computations, finally sends data directory message body；

The data directory service message body sent in S4, data directory service module receiving step S3 is parsed, and judgement is to increase Amount storage or full dose storage, are stored in Elasticsearch index data base；

8. a kind of DOC DATA acquisition according to claim 7 and processing method, which is characterized in that in the step S2 Data cleansing service module handles data specific steps are as follows:

1., increment cleaning: parsing database and set message body clear instruction, triggering incremental data cleaning service, increment cleaning After the completion, the interim table Crawler TMP data in non-relationship graphic data distributed where former distributed reptile data are deleted, Replying distributed reptile micro services message is completed consumption simultaneously；Cleaning error is encountered, then does not delete interim table Crawler TMP In data, while reply distributed reptile retransmit cleaning message；The data that increment cleans finally are stored in interim table In Clean TMP, finally sends calculating message body and give step S3；

2., full dose cleaning, parse full dose data cleansing message body clear instruction, trigger full dose data cleansing service, finally meter It calculates message body and is sent to step S3.

9. a kind of DOC DATA acquisition according to claim 8 and processing method, which is characterized in that the step S3 data Extract the specific steps for calculating service module processing data are as follows:

A, after extraction calculating service module receives a full dose extraction calculating message body, the message is consumed first, is triggered It extracts and calculates:

If extract field be not related to retrieval service, directly transmit data directory message body, to data be indexed into Library, while sending feedback message and giving data cleansing service module；

If the field extracted is related to retrieval service, data directory message body is not sent, extraction calculating is carried out, until extracting Calculating completion, retransmiting full dose Data Data index messages body gives step S4 later；

B, after extraction calculating service module receives an increment extraction calculating message body, the message is parsed first, is obtained Database and name set where incremental data, triggering, which is extracted, to be calculated；

If encountering exception in extracting calculating process, feeds back to data cleansing service and retransmit extraction calculating message, take out Incremental data can be subjected to Calculate TMP storage after taking calculating service to complete, and delete the step 1. middle Clean generated TMP table finally sends incremental data index messages body and gives step S4.

10. a kind of DOC DATA acquisition according to claim 9 and processing method, which is characterized in that the step S4 number According to the specific service step in index service module are as follows:

A, increment is put in storage, and database and set where parsing the incremental data in message body carry out data insertion on original index Or update, after completing data directory, delete the interim Calculate TMP generated in step b.

B, full dose is put in storage, and is parsed the full dose data cleansing instruction in message body, is created new index, carries out full dose rope to data Draw.