CN110147362A - One kind is based on the acquisition of event driven DOC DATA and processing system and its method - Google Patents
One kind is based on the acquisition of event driven DOC DATA and processing system and its method Download PDFInfo
- Publication number
- CN110147362A CN110147362A CN201910271964.2A CN201910271964A CN110147362A CN 110147362 A CN110147362 A CN 110147362A CN 201910271964 A CN201910271964 A CN 201910271964A CN 110147362 A CN110147362 A CN 110147362A
- Authority
- CN
- China
- Prior art keywords
- data
- service module
- acquisition
- doc
- message body
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 24
- 238000000605 extraction Methods 0.000 claims abstract description 26
- 241000270322 Lepidosauria Species 0.000 claims abstract description 23
- 238000003672 processing method Methods 0.000 claims abstract description 7
- 238000004140 cleaning Methods 0.000 claims description 49
- 239000000284 extract Substances 0.000 claims description 8
- 238000003780 insertion Methods 0.000 claims description 3
- 230000037431 insertion Effects 0.000 claims description 3
- 238000010926 purge Methods 0.000 claims description 3
- 238000004458 analytical method Methods 0.000 claims 1
- 230000001960 triggered effect Effects 0.000 claims 1
- 238000005406 washing Methods 0.000 claims 1
- 238000010586 diagram Methods 0.000 description 3
- 239000004744 fabric Substances 0.000 description 3
- 241001269238 Data Species 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000037213 diet Effects 0.000 description 1
- 235000005911 diet Nutrition 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005194 fractionation Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Quality & Reliability (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Computer And Data Communications (AREA)
Abstract
The invention discloses one kind based on the acquisition of event driven DOC DATA and processing system and its method, belongs to big data technical field, comprising: data acquisition service module, data cleansing service module, data pick-up calculates service module, data directory service module, log module;It is acquired and processing method includes: DOC DATA of the distributed reptile crawl publication on website in distributed data acquisition mode, it is sent to data acquisition service module processing, extraction calculating is carried out calculating service module by data pick-up, then it is deposited in database by data directory service module storage, the entire acquisition of log module record and treatment process.The present invention carries out each official document issuing web site using distributed reptile and grabs, and effectively solves the problems, such as the acquisition and processing timeliness of massive multi-source DOC DATA.
Description
Technical field
The invention belongs to big data technical fields, more particularly to one kind to be acquired and be handled based on event driven DOC DATA
System and method.
Background technique
21 century China is extensive information-based, and internet brings the great change of government information disclosure, the political affairs more than increasingly d
Mansion tissue is issued public information by internet, and how the DOC DATA of magnanimity, which efficiently acquires and handle, is given existing information system
System framework brings challenge.Recent years, micro services framework was becoming increasingly popular, will be original multiple in the way of small fractionation
Miscellaneous system is decoupled, the liberation of bring operation flow, and this junior unit architecture mode can be complete with relatively high fitness
At sophisticated functions.Data acquisition in, at present for internet data acquisition have the characteristics that measure in short-term it is big, with the side of timed task
Formula timely and effective can not be cleaned and be calculated to DOC DATA, and in addition a large amount of semi-structured text datas increase cleaning
How complexity accomplishes increment cleaning and calculates to be also current problems faced.
In conclusion traditional data acquisition and processing (DAP) mode based on timed task has been unable to meet complicated business scene
Under data acquisition and processing (DAP), main problems faced has:
1, multi-source heterogeneous internet crawler data have the characteristics that measure in short-term big, and existing acquisition and tupe can not be fast
Speed is acquired and handles to data.
2, data acquisition scenarios are changeable, and each process flow is complicated, can not accomplish the task execution in the case of N-free diet method, lead
After causing the error of part process, calculates and cleaning expends system resource;
3, DOC DATA is related to acquiring, cleaning, a series of complex process such as extraction and training, volatile for partial data
The characteristics of effect, can not timely update this corresponding operation system latest data.
Summary of the invention
It is an object of the invention to: in view of the above problems, the present invention, which provides one kind and is able to solve DOC DATA, to adopt
Collection and processing in efficiency and automation issues based on event driven DOC DATA acquisition and processing system and its side
Method.
To achieve the goals above, the present invention adopts the following technical scheme:
One kind is based on the acquisition of event driven DOC DATA and processing system, comprising:
Data acquisition service module is acquired and stores for acquiring multi-source heterogeneous DOC DATA, is simultaneously emitted by clear
Wash instruction;
Data cleansing service module acquires the clear instruction that service module issues for receiving data and is parsed, sentenced
Disconnected cleaning demand is increment cleaning or full dose cleaning, and issues and calculate message body;
Data pick-up calculates service module, and the calculating message body that cleaning service issues for receiving data is simultaneously analyzed,
Extraction calculating is carried out to data, and gives and feeds back, while sending data directory message body;
Data directory service module extracts for receiving data and calculates service module sending data directory message body and carry out
Parsing, judgement are increment storage or full dose storage;
Log module, for recording by above-mentioned modules to the acquisition of DOC DATA and processing whole process.
Further, the data acquisition service module is acquired DOC DATA using distributed reptile.
Further, the data acquisition module passes through the title of DOC DATA, URL and issuing time in collection process
Establish unique index.
Further, the data acquisition module also generates interim table Crawler TMP during acquiring DOC DATA,
For storing incremental data.
Further, calculating cleaning service module complete to generate after work interim table Clean TMP for store increasing
Measure the data of cleaning.
Further, the extraction calculate can be generated after service module is completed Calculate TMP by incremental data into
Row storage.
One kind is based on the acquisition of event driven DOC DATA and processing method, comprising the following steps:
S1, data acquisition service module are by the distributed reptile crawl publication in distributed data acquisition mode in website
On DOC DATA, and distributed storage unstructured database, then unique rope is established by title, URL and the time of publication
Draw, while Bloom filter record is carried out to the URL grabbed, and interim table is stored in the DOC DATA grabbed
In Crawler TMP, in the database and set where the increase data after distributed reptile periodicity crawl event terminates
Message body or building full dose data cleansing message body, form clear instruction and send;It is recorded using Bloom filter, the grand mistake of cloth
Filter can be used for retrieving an element whether in a set, solve extensive repetition crawl bring network bandwidth consumption
The clear instruction of S2, data cleansing service module receiving step S1, parse clear instruction, judge that cleaning needs
The cleaning of Seeking Truth full dose or increment cleaning;After purge event is completed, while sending corresponding calculating message body;
S3, data pick-up calculate the calculating message body that service module receiving step S2 is sent and are parsed, and judge to extract meter
It is full dose calculating or incremental computations, finally sends data directory message body;
The data directory service message body sent in S4, data directory service module receiving step S3 is parsed, judgement
It is increment storage or full dose storage, is stored in Elasticsearch index data base;
S5, by step S1-S5 process generate sequence of events in the form of log record storage in log module.
Further, the data cleansing service module in the step S2 handles data specific steps are as follows:
1., increment cleaning: parsing database and set message body clear instruction, triggering incremental data cleaning service, increment
After the completion of cleaning, by the interim table Crawler TMP data in non-relationship graphic data distributed where former distributed reptile data
It deletes, while replying distributed reptile micro services message and consumption is completed;Cleaning error is encountered, then does not delete interim table
Data in Crawler TMP, while replying distributed reptile and retransmitting cleaning message;Finally the data that increment cleans are deposited
Storage finally sends calculating message body and gives step S3 in interim table Clean TMP;
2., full dose cleaning, parse full dose data cleansing message body clear instruction, trigger full dose data cleansing service, finally
Step S3 is sent to message body is calculated;
Further, the step S3 data pick-up calculates the specific steps of service module processing data are as follows:
A, after extraction calculating service module receives a full dose extraction calculating message body, the message is consumed first,
Triggering, which is extracted, to be calculated:
If the field extracted is not related to retrieval service, data directory message body is directly transmitted, rope is carried out to data
Library is introduced, while sending feedback message and giving data cleansing service module;
If the field extracted is related to retrieval service, data directory message body is not sent, extraction calculating is carried out, until
Extracting calculating completion, retransmiting full dose data directory message body gives step S4 later;
B, after extraction calculating service module receives an increment extraction calculating message body, the message is parsed first,
The database and name set where incremental data are obtained, triggering, which is extracted, to be calculated;
If encountering exception in extracting calculating process, feeds back to data cleansing service retransmission extraction calculating and disappear
Breath, extract calculate service complete after can by incremental data carry out Calculate TMP storage, and delete step 1. in generate
Clean TMP table finally sends incremental data index messages body and gives step S4.
Further, the specific service step in the step S4 data directory service module are as follows:
A, increment is put in storage, and database and set where parsing the incremental data in message body carry out data on original index
Insertion updates, and after completing data directory, deletes the interim Calculate TMP generated in step b.
B, full dose is put in storage, and is parsed the full dose data cleansing instruction in message body, is created new index, carries out full dose to data
Index.
Basic functional principle of the invention are as follows:
Based on event-driven, domestic each official document issuing web site is automatically grabbed by internet crawler, passes through utilization
Distributed reptile, takes the relevant field in official document, text, picture and attachment, accomplishes incremental update using Bloom filter;It is logical
It crosses and accomplishes to orient duplicate removal and mass memory using distributed unstructured database MongoDB;By utilizing distributed message, solution
The problem of certainly handling and calculate in real time;By utilizing chart database and distributed search engine, search and displaying feature content.
The beneficial effects of the present invention are:
It is grabbed 1. the present invention carries out each official document issuing web site using distributed reptile, it is different effectively to solve magnanimity multi-source
The acquisition and processing timeliness problem of structure DOC DATA.
2. timely clearing up the various ephemeral data tables generated in file is acquired and handled in the present invention, solve because adopting
Because unknown exception leads to system problem, server resource waste in collection or process flow, data read the problems such as dirty.
3. of the invention effectively solve the problems, such as the timely and effective storage of DOC DATA.
Detailed description of the invention
Fig. 1 is in the present invention based on the acquisition of event driven DOC DATA and processing system structural schematic diagram.
Fig. 2 is work flow diagram in the present invention.
Fig. 3 is detailed operational flow diagrams in the present invention.
Specific embodiment
Be described further below technical solution of the present invention, but claimed range be not limited to it is described.
Embodiment 1:
As shown in Figure 1, a kind of based on the acquisition of event driven DOC DATA and processing system, comprising:
Data acquisition service module is acquired and stores for acquiring multi-source heterogeneous DOC DATA, is simultaneously emitted by clear
Wash instruction;
Data cleansing service module acquires the clear instruction that service module issues for receiving data and is parsed, sentenced
Disconnected cleaning demand is increment cleaning or full dose cleaning, and issues and calculate message body;
Data pick-up calculates service module, and the calculating message body that cleaning service issues for receiving data is simultaneously analyzed,
Extraction calculating is carried out to data, and gives and feeds back, while sending data directory message body;
Data directory service module extracts for receiving data and calculates service module sending data directory message body and carry out
Parsing, judgement are increment storage or full dose storage;
Log module, for recording by above-mentioned modules to the acquisition of DOC DATA and processing whole process.
The data acquisition service module is acquired DOC DATA using distributed reptile.
The data acquisition module is established uniquely in collection process by the title of DOC DATA, URL and issuing time
Index.
The data acquisition module also generates interim table Crawler TMP during acquiring DOC DATA, for storing
Incremental data.
Described calculate after cleaning service module completes work generates interim table Clean TMP for storing the number of increment cleaning
According to.
The extraction calculating service module can generate Calculate TMP after completing and store incremental data.
As shown in Fig. 2, a kind of based on the acquisition of event driven DOC DATA and processing method, comprising the following steps:
S1, data acquisition service module are by the distributed reptile crawl publication in distributed data acquisition mode in website
On DOC DATA, and distributed storage unstructured database, then unique rope is established by title, URL and the time of publication
Draw, while Bloom filter record is carried out to the URL grabbed, and interim table is stored in the DOC DATA grabbed
In Crawler TMP, in the database and set where the increase data after distributed reptile periodicity crawl event terminates
Message body or building full dose data cleansing message body, form clear instruction and send;It is recorded using Bloom filter, the grand mistake of cloth
Filter can be used for retrieving an element whether in a set, solves the extensive crawl bring network bandwidth that repeats and disappears
Consumption.
The clear instruction of S2, data cleansing service module receiving step S1, parse clear instruction, judge that cleaning needs
The cleaning of Seeking Truth full dose or increment cleaning;After purge event is completed, while sending corresponding calculating message body;
S3, data pick-up calculate the calculating message body that service module receiving step S2 is sent and are parsed, and judge to extract meter
It is full dose calculating or incremental computations, finally sends data directory message body;
The data directory service message body sent in S4, data directory service module receiving step S3 is parsed, judgement
It is increment storage or full dose storage, is stored in Elasticsearch index data base;
S5, by step S1-S5 process generate sequence of events in the form of log record storage in log module.
Data cleansing service module in the step S2 handles data specific steps are as follows:
1., increment cleaning: parsing database and set message body clear instruction, triggering incremental data cleaning service, increment
After the completion of cleaning, by the interim table Crawler TMP data in non-relationship graphic data distributed where former distributed reptile data
It deletes, while replying distributed reptile micro services message and consumption is completed;Cleaning error is encountered, then does not delete interim table
Data in Crawler TMP, while replying distributed reptile and retransmitting cleaning message;Finally the data that increment cleans are deposited
Storage finally sends calculating message body and gives step S3 in interim table Clean TMP;
2., full dose cleaning, parse full dose data cleansing message body clear instruction, trigger full dose data cleansing service, finally
Step S3 is sent to message body is calculated;
The step S3 data pick-up calculates the specific steps of service module processing data are as follows:
A, after extraction calculating service module receives a full dose extraction calculating message body, the message is consumed first,
Triggering, which is extracted, to be calculated:
If the field extracted is not related to retrieval service, data directory message body is directly transmitted, rope is carried out to data
Library is introduced, while sending feedback message and giving data cleansing service module;
If the field extracted is related to retrieval service, data directory message body is not sent, extraction calculating is carried out, until
Extracting calculating completion, retransmiting full dose data directory message body gives step S4 later;
B, after extraction calculating service module receives an increment extraction calculating message body, the message is parsed first,
The database and name set where incremental data are obtained, triggering, which is extracted, to be calculated;
If encountering exception in extracting calculating process, feeds back to data cleansing service retransmission extraction calculating and disappear
Breath, extract calculate service complete after can by incremental data carry out Calculate TMP storage, and delete step 1. in generate
Clean TMP table finally sends incremental data index messages body and gives step S4.
Specific service step in the step S4 data directory service module are as follows:
A, increment is put in storage, and database and set where parsing the incremental data in message body carry out data on original index
Insertion updates, and after completing data directory, deletes the interim Calculate TMP generated in step b.
B, full dose is put in storage, and is parsed the full dose data cleansing instruction in message body, is created new index, carries out full dose to data
Index.
The working principle of the present embodiment are as follows: event-driven is based on, by internet crawler to domestic each official document issuing web site
It is automatically grabbed, it is grand using cloth by taking the relevant field in official document, text, picture and attachment using distributed reptile
Filter accomplishes incremental update;By accomplishing to orient duplicate removal and mass memory using distributed unstructured database MongoDB;
By utilizing distributed message, solve the problems, such as to handle and calculate in real time;By utilizing chart database and distributed search engine,
Search and displaying feature content.
Claims (10)
1. one kind is based on the acquisition of event driven DOC DATA and processing system characterized by comprising
Data acquisition service module is acquired and stores for acquiring multi-source heterogeneous DOC DATA, is simultaneously emitted by cleaning and refers to
It enables;
Data cleansing service module acquires the clear instruction that service module issues for receiving data and is parsed, and judgement is clear
Washing demand is increment cleaning or full dose cleaning, and issues and calculate message body;
Data pick-up calculates service module, and the calculating message body that cleaning service issues for receiving data is simultaneously analyzed, logarithm
According to carrying out extraction calculating, and gives and feed back, while sending data directory message body;
Data directory service module extracts for receiving data and calculates service module sending data directory message body and solved
Analysis, judgement are increment storage or full dose storage;
Log module, for recording by above-mentioned modules to the acquisition of DOC DATA and processing whole process.
2. according to claim 1 a kind of based on the acquisition of event driven DOC DATA and processing system, it is characterised in that:
The data acquisition service module is acquired DOC DATA using distributed reptile.
3. according to claim 1 a kind of based on the acquisition of event driven DOC DATA and processing system, it is characterised in that:
The data acquisition module establishes unique index by the title of DOC DATA, URL and issuing time in collection process.
4. according to claim 1 a kind of based on the acquisition of event driven DOC DATA and processing system, it is characterised in that:
The data acquisition module also generates interim table Crawler TMP during acquiring DOC DATA, for storing incremental data.
5. according to claim 1 a kind of based on the acquisition of event driven DOC DATA and processing system, it is characterised in that:
Described calculate after cleaning service module completes work generates interim table Clean TMP for storing the data of increment cleaning.
6. according to claim 1 a kind of based on the acquisition of event driven DOC DATA and processing system, it is characterised in that:
The extraction calculating service module can generate Calculate TMP after completing and store incremental data.
7. one kind is based on described in claim 1 based on the acquisition of event driven DOC DATA and processing method, which is characterized in that packet
Include following steps:
S1, data acquisition service module are by the distributed reptile crawl publication in distributed data acquisition mode on website
DOC DATA, and distributed storage unstructured database, then unique index is established by title, URL and the time of publication, together
When Bloom filter record is carried out to the URL that had grabbed, and interim table Crawler is stored in the DOC DATA grabbed
In TMP, where the increase data after distributed reptile periodicity crawl event terminates database and set message body or
Person constructs full dose data cleansing message body, forms clear instruction and sends;
The clear instruction of S2, data cleansing service module receiving step S1, parse clear instruction, judge that cleaning demand is
Full dose cleaning or increment cleaning;After purge event is completed, while sending corresponding calculating message body;
S3, data pick-up calculate the calculating message body that service module receiving step S2 is sent and are parsed, and judging to extract to calculate is
Full dose calculates or incremental computations, finally sends data directory message body;
The data directory service message body sent in S4, data directory service module receiving step S3 is parsed, and judgement is to increase
Amount storage or full dose storage, are stored in Elasticsearch index data base;
S5, by step S1-S5 process generate sequence of events in the form of log record storage in log module.
8. a kind of DOC DATA acquisition according to claim 7 and processing method, which is characterized in that in the step S2
Data cleansing service module handles data specific steps are as follows:
1., increment cleaning: parsing database and set message body clear instruction, triggering incremental data cleaning service, increment cleaning
After the completion, the interim table Crawler TMP data in non-relationship graphic data distributed where former distributed reptile data are deleted,
Replying distributed reptile micro services message is completed consumption simultaneously;Cleaning error is encountered, then does not delete interim table Crawler TMP
In data, while reply distributed reptile retransmit cleaning message;The data that increment cleans finally are stored in interim table
In Clean TMP, finally sends calculating message body and give step S3;
2., full dose cleaning, parse full dose data cleansing message body clear instruction, trigger full dose data cleansing service, finally meter
It calculates message body and is sent to step S3.
9. a kind of DOC DATA acquisition according to claim 8 and processing method, which is characterized in that the step S3 data
Extract the specific steps for calculating service module processing data are as follows:
A, after extraction calculating service module receives a full dose extraction calculating message body, the message is consumed first, is triggered
It extracts and calculates:
If extract field be not related to retrieval service, directly transmit data directory message body, to data be indexed into
Library, while sending feedback message and giving data cleansing service module;
If the field extracted is related to retrieval service, data directory message body is not sent, extraction calculating is carried out, until extracting
Calculating completion, retransmiting full dose Data Data index messages body gives step S4 later;
B, after extraction calculating service module receives an increment extraction calculating message body, the message is parsed first, is obtained
Database and name set where incremental data, triggering, which is extracted, to be calculated;
If encountering exception in extracting calculating process, feeds back to data cleansing service and retransmit extraction calculating message, take out
Incremental data can be subjected to Calculate TMP storage after taking calculating service to complete, and delete the step 1. middle Clean generated
TMP table finally sends incremental data index messages body and gives step S4.
10. a kind of DOC DATA acquisition according to claim 9 and processing method, which is characterized in that the step S4 number
According to the specific service step in index service module are as follows:
A, increment is put in storage, and database and set where parsing the incremental data in message body carry out data insertion on original index
Or update, after completing data directory, delete the interim Calculate TMP generated in step b.
B, full dose is put in storage, and is parsed the full dose data cleansing instruction in message body, is created new index, carries out full dose rope to data
Draw.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910271964.2A CN110147362A (en) | 2019-04-04 | 2019-04-04 | One kind is based on the acquisition of event driven DOC DATA and processing system and its method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910271964.2A CN110147362A (en) | 2019-04-04 | 2019-04-04 | One kind is based on the acquisition of event driven DOC DATA and processing system and its method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110147362A true CN110147362A (en) | 2019-08-20 |
Family
ID=67589343
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910271964.2A Pending CN110147362A (en) | 2019-04-04 | 2019-04-04 | One kind is based on the acquisition of event driven DOC DATA and processing system and its method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110147362A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112650865A (en) * | 2021-01-27 | 2021-04-13 | 南威软件股份有限公司 | Method and system for solving multi-region license data conflict based on flexible rule |
CN113821573A (en) * | 2021-08-27 | 2021-12-21 | 济南浪潮数据技术有限公司 | Mass data rapid retrieval service construction method, system, terminal and storage medium |
Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102411569A (en) * | 2010-09-20 | 2012-04-11 | 上海众融信息技术有限公司 | Database conversion and cleaning information processing method |
CN103617174A (en) * | 2013-11-04 | 2014-03-05 | 同济大学 | Distributed searching method based on cloud computing |
CN104102737A (en) * | 2014-07-28 | 2014-10-15 | 中国农业银行股份有限公司 | Historical data storage method and system |
CN104951512A (en) * | 2015-05-27 | 2015-09-30 | 中国科学院信息工程研究所 | Public sentiment data collection method and system based on Internet |
CN105488187A (en) * | 2015-12-02 | 2016-04-13 | 北京四达时代软件技术股份有限公司 | Method and device for extracting multi-source heterogeneous data increment |
CN106682153A (en) * | 2016-12-23 | 2017-05-17 | 山东浪潮商用系统有限公司 | Data extraction tool on basis of data modeling and data increment implementation |
CN106776951A (en) * | 2016-12-02 | 2017-05-31 | 航天星图科技(北京)有限公司 | One kind cleaning contrast storage method |
CN107103067A (en) * | 2017-04-18 | 2017-08-29 | 北京思特奇信息技术股份有限公司 | A kind of method of data synchronization and system based on search engine |
CN107480858A (en) * | 2017-07-10 | 2017-12-15 | 武汉楚鼎信息技术有限公司 | A kind of Aided intelligent decision-making and method based on the analysis of stock big data |
CN107895009A (en) * | 2017-11-10 | 2018-04-10 | 北京国信宏数科技有限责任公司 | One kind is based on distributed internet data acquisition method and system |
CN107943991A (en) * | 2017-12-01 | 2018-04-20 | 成都嗨翻屋文化传播有限公司 | A kind of distributed reptile frame and implementation method based on memory database |
CN108121706A (en) * | 2016-11-28 | 2018-06-05 | 央视国际网络无锡有限公司 | A kind of optimization method of distributed reptile |
CN108228815A (en) * | 2017-12-29 | 2018-06-29 | 安徽迈普德康信息科技有限公司 | A kind of real estate data integrated system and method |
-
2019
- 2019-04-04 CN CN201910271964.2A patent/CN110147362A/en active Pending
Patent Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102411569A (en) * | 2010-09-20 | 2012-04-11 | 上海众融信息技术有限公司 | Database conversion and cleaning information processing method |
CN103617174A (en) * | 2013-11-04 | 2014-03-05 | 同济大学 | Distributed searching method based on cloud computing |
CN104102737A (en) * | 2014-07-28 | 2014-10-15 | 中国农业银行股份有限公司 | Historical data storage method and system |
CN104951512A (en) * | 2015-05-27 | 2015-09-30 | 中国科学院信息工程研究所 | Public sentiment data collection method and system based on Internet |
CN105488187A (en) * | 2015-12-02 | 2016-04-13 | 北京四达时代软件技术股份有限公司 | Method and device for extracting multi-source heterogeneous data increment |
CN108121706A (en) * | 2016-11-28 | 2018-06-05 | 央视国际网络无锡有限公司 | A kind of optimization method of distributed reptile |
CN106776951A (en) * | 2016-12-02 | 2017-05-31 | 航天星图科技(北京)有限公司 | One kind cleaning contrast storage method |
CN106682153A (en) * | 2016-12-23 | 2017-05-17 | 山东浪潮商用系统有限公司 | Data extraction tool on basis of data modeling and data increment implementation |
CN107103067A (en) * | 2017-04-18 | 2017-08-29 | 北京思特奇信息技术股份有限公司 | A kind of method of data synchronization and system based on search engine |
CN107480858A (en) * | 2017-07-10 | 2017-12-15 | 武汉楚鼎信息技术有限公司 | A kind of Aided intelligent decision-making and method based on the analysis of stock big data |
CN107895009A (en) * | 2017-11-10 | 2018-04-10 | 北京国信宏数科技有限责任公司 | One kind is based on distributed internet data acquisition method and system |
CN107943991A (en) * | 2017-12-01 | 2018-04-20 | 成都嗨翻屋文化传播有限公司 | A kind of distributed reptile frame and implementation method based on memory database |
CN108228815A (en) * | 2017-12-29 | 2018-06-29 | 安徽迈普德康信息科技有限公司 | A kind of real estate data integrated system and method |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112650865A (en) * | 2021-01-27 | 2021-04-13 | 南威软件股份有限公司 | Method and system for solving multi-region license data conflict based on flexible rule |
CN112650865B (en) * | 2021-01-27 | 2021-11-09 | 南威软件股份有限公司 | Method and system for solving multi-region license data conflict based on flexible rule |
CN113821573A (en) * | 2021-08-27 | 2021-12-21 | 济南浪潮数据技术有限公司 | Mass data rapid retrieval service construction method, system, terminal and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109543086B (en) | Network data acquisition and display method oriented to multiple data sources | |
CN100471121C (en) | Decoding method and decoder | |
CN103294732B (en) | Webpage capture method and reptile | |
US9552435B2 (en) | Method and system for incremental collection of forum replies | |
CN107895009A (en) | One kind is based on distributed internet data acquisition method and system | |
CN105320740B (en) | The acquisition methods and acquisition system of wechat article and public platform | |
CN102760151B (en) | Implementation method of open source software acquisition and searching system | |
CN105956175A (en) | Webpage content crawling method and device | |
Qiu et al. | Analysis of user web traffic with a focus on search activities. | |
CN110147362A (en) | One kind is based on the acquisition of event driven DOC DATA and processing system and its method | |
CN103177380A (en) | Method and device for optimizing advertisement delivery effect by combining user groups and pre-delivery | |
CN101727486A (en) | Web forum information extraction system | |
CN108924199A (en) | Crawlers obtain the method, apparatus, computer storage medium and terminal device of network proxy server automatically | |
CN108959539B (en) | Rule-configurable webpage data analysis method | |
CN108133041A (en) | Data collecting system and method based on web crawlers and data transfer technology | |
CN109150585A (en) | A kind of network O&M failure solution, system, device and storage medium | |
CN107766234A (en) | A kind of assessment method, the apparatus and system of the webpage health degree based on mobile device | |
CN110417873A (en) | A kind of network information extraction system for realizing record webpage interactive operation | |
CN113259467A (en) | Webpage asset fingerprint tag identification and discovery method based on big data | |
CN110059085A (en) | A kind of parsing of JSON data and modeling method of web oriented 2.0 | |
CN113157521A (en) | Monitoring method and monitoring system for whole life cycle of block chain | |
CN105825399A (en) | Internet based B2B e-commerce information collecting method | |
CN113760878A (en) | Micro-service architecture log analysis method and system based on domestic CPU and operating system | |
CN103279527B (en) | A kind of user interest network address method for digging and device | |
CN103902707B (en) | Expert system URL cleans " rubbish " content filtering method of knowledge base |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190820 |
|
RJ01 | Rejection of invention patent application after publication |