Nothing Special   »   [go: up one dir, main page]

CN110147362A - One kind is based on the acquisition of event driven DOC DATA and processing system and its method - Google Patents

One kind is based on the acquisition of event driven DOC DATA and processing system and its method Download PDF

Info

Publication number
CN110147362A
CN110147362A CN201910271964.2A CN201910271964A CN110147362A CN 110147362 A CN110147362 A CN 110147362A CN 201910271964 A CN201910271964 A CN 201910271964A CN 110147362 A CN110147362 A CN 110147362A
Authority
CN
China
Prior art keywords
data
service module
acquisition
doc
message body
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910271964.2A
Other languages
Chinese (zh)
Inventor
马新凡
王鹏
刘福强
李泽松
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Division Big Data Research Institute Co Ltd
Original Assignee
Division Big Data Research Institute Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Division Big Data Research Institute Co Ltd filed Critical Division Big Data Research Institute Co Ltd
Priority to CN201910271964.2A priority Critical patent/CN110147362A/en
Publication of CN110147362A publication Critical patent/CN110147362A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Computer And Data Communications (AREA)

Abstract

The invention discloses one kind based on the acquisition of event driven DOC DATA and processing system and its method, belongs to big data technical field, comprising: data acquisition service module, data cleansing service module, data pick-up calculates service module, data directory service module, log module;It is acquired and processing method includes: DOC DATA of the distributed reptile crawl publication on website in distributed data acquisition mode, it is sent to data acquisition service module processing, extraction calculating is carried out calculating service module by data pick-up, then it is deposited in database by data directory service module storage, the entire acquisition of log module record and treatment process.The present invention carries out each official document issuing web site using distributed reptile and grabs, and effectively solves the problems, such as the acquisition and processing timeliness of massive multi-source DOC DATA.

Description

One kind is based on the acquisition of event driven DOC DATA and processing system and its method
Technical field
The invention belongs to big data technical fields, more particularly to one kind to be acquired and be handled based on event driven DOC DATA System and method.
Background technique
21 century China is extensive information-based, and internet brings the great change of government information disclosure, the political affairs more than increasingly d Mansion tissue is issued public information by internet, and how the DOC DATA of magnanimity, which efficiently acquires and handle, is given existing information system System framework brings challenge.Recent years, micro services framework was becoming increasingly popular, will be original multiple in the way of small fractionation Miscellaneous system is decoupled, the liberation of bring operation flow, and this junior unit architecture mode can be complete with relatively high fitness At sophisticated functions.Data acquisition in, at present for internet data acquisition have the characteristics that measure in short-term it is big, with the side of timed task Formula timely and effective can not be cleaned and be calculated to DOC DATA, and in addition a large amount of semi-structured text datas increase cleaning How complexity accomplishes increment cleaning and calculates to be also current problems faced.
In conclusion traditional data acquisition and processing (DAP) mode based on timed task has been unable to meet complicated business scene Under data acquisition and processing (DAP), main problems faced has:
1, multi-source heterogeneous internet crawler data have the characteristics that measure in short-term big, and existing acquisition and tupe can not be fast Speed is acquired and handles to data.
2, data acquisition scenarios are changeable, and each process flow is complicated, can not accomplish the task execution in the case of N-free diet method, lead After causing the error of part process, calculates and cleaning expends system resource;
3, DOC DATA is related to acquiring, cleaning, a series of complex process such as extraction and training, volatile for partial data The characteristics of effect, can not timely update this corresponding operation system latest data.
Summary of the invention
It is an object of the invention to: in view of the above problems, the present invention, which provides one kind and is able to solve DOC DATA, to adopt Collection and processing in efficiency and automation issues based on event driven DOC DATA acquisition and processing system and its side Method.
To achieve the goals above, the present invention adopts the following technical scheme:
One kind is based on the acquisition of event driven DOC DATA and processing system, comprising:
Data acquisition service module is acquired and stores for acquiring multi-source heterogeneous DOC DATA, is simultaneously emitted by clear Wash instruction;
Data cleansing service module acquires the clear instruction that service module issues for receiving data and is parsed, sentenced Disconnected cleaning demand is increment cleaning or full dose cleaning, and issues and calculate message body;
Data pick-up calculates service module, and the calculating message body that cleaning service issues for receiving data is simultaneously analyzed, Extraction calculating is carried out to data, and gives and feeds back, while sending data directory message body;
Data directory service module extracts for receiving data and calculates service module sending data directory message body and carry out Parsing, judgement are increment storage or full dose storage;
Log module, for recording by above-mentioned modules to the acquisition of DOC DATA and processing whole process.
Further, the data acquisition service module is acquired DOC DATA using distributed reptile.
Further, the data acquisition module passes through the title of DOC DATA, URL and issuing time in collection process Establish unique index.
Further, the data acquisition module also generates interim table Crawler TMP during acquiring DOC DATA, For storing incremental data.
Further, calculating cleaning service module complete to generate after work interim table Clean TMP for store increasing Measure the data of cleaning.
Further, the extraction calculate can be generated after service module is completed Calculate TMP by incremental data into Row storage.
One kind is based on the acquisition of event driven DOC DATA and processing method, comprising the following steps:
S1, data acquisition service module are by the distributed reptile crawl publication in distributed data acquisition mode in website On DOC DATA, and distributed storage unstructured database, then unique rope is established by title, URL and the time of publication Draw, while Bloom filter record is carried out to the URL grabbed, and interim table is stored in the DOC DATA grabbed In Crawler TMP, in the database and set where the increase data after distributed reptile periodicity crawl event terminates Message body or building full dose data cleansing message body, form clear instruction and send;It is recorded using Bloom filter, the grand mistake of cloth Filter can be used for retrieving an element whether in a set, solve extensive repetition crawl bring network bandwidth consumption
The clear instruction of S2, data cleansing service module receiving step S1, parse clear instruction, judge that cleaning needs The cleaning of Seeking Truth full dose or increment cleaning;After purge event is completed, while sending corresponding calculating message body;
S3, data pick-up calculate the calculating message body that service module receiving step S2 is sent and are parsed, and judge to extract meter It is full dose calculating or incremental computations, finally sends data directory message body;
The data directory service message body sent in S4, data directory service module receiving step S3 is parsed, judgement It is increment storage or full dose storage, is stored in Elasticsearch index data base;
S5, by step S1-S5 process generate sequence of events in the form of log record storage in log module.
Further, the data cleansing service module in the step S2 handles data specific steps are as follows:
1., increment cleaning: parsing database and set message body clear instruction, triggering incremental data cleaning service, increment After the completion of cleaning, by the interim table Crawler TMP data in non-relationship graphic data distributed where former distributed reptile data It deletes, while replying distributed reptile micro services message and consumption is completed;Cleaning error is encountered, then does not delete interim table Data in Crawler TMP, while replying distributed reptile and retransmitting cleaning message;Finally the data that increment cleans are deposited Storage finally sends calculating message body and gives step S3 in interim table Clean TMP;
2., full dose cleaning, parse full dose data cleansing message body clear instruction, trigger full dose data cleansing service, finally Step S3 is sent to message body is calculated;
Further, the step S3 data pick-up calculates the specific steps of service module processing data are as follows:
A, after extraction calculating service module receives a full dose extraction calculating message body, the message is consumed first, Triggering, which is extracted, to be calculated:
If the field extracted is not related to retrieval service, data directory message body is directly transmitted, rope is carried out to data Library is introduced, while sending feedback message and giving data cleansing service module;
If the field extracted is related to retrieval service, data directory message body is not sent, extraction calculating is carried out, until Extracting calculating completion, retransmiting full dose data directory message body gives step S4 later;
B, after extraction calculating service module receives an increment extraction calculating message body, the message is parsed first, The database and name set where incremental data are obtained, triggering, which is extracted, to be calculated;
If encountering exception in extracting calculating process, feeds back to data cleansing service retransmission extraction calculating and disappear Breath, extract calculate service complete after can by incremental data carry out Calculate TMP storage, and delete step 1. in generate Clean TMP table finally sends incremental data index messages body and gives step S4.
Further, the specific service step in the step S4 data directory service module are as follows:
A, increment is put in storage, and database and set where parsing the incremental data in message body carry out data on original index Insertion updates, and after completing data directory, deletes the interim Calculate TMP generated in step b.
B, full dose is put in storage, and is parsed the full dose data cleansing instruction in message body, is created new index, carries out full dose to data Index.
Basic functional principle of the invention are as follows:
Based on event-driven, domestic each official document issuing web site is automatically grabbed by internet crawler, passes through utilization Distributed reptile, takes the relevant field in official document, text, picture and attachment, accomplishes incremental update using Bloom filter;It is logical It crosses and accomplishes to orient duplicate removal and mass memory using distributed unstructured database MongoDB;By utilizing distributed message, solution The problem of certainly handling and calculate in real time;By utilizing chart database and distributed search engine, search and displaying feature content.
The beneficial effects of the present invention are:
It is grabbed 1. the present invention carries out each official document issuing web site using distributed reptile, it is different effectively to solve magnanimity multi-source The acquisition and processing timeliness problem of structure DOC DATA.
2. timely clearing up the various ephemeral data tables generated in file is acquired and handled in the present invention, solve because adopting Because unknown exception leads to system problem, server resource waste in collection or process flow, data read the problems such as dirty.
3. of the invention effectively solve the problems, such as the timely and effective storage of DOC DATA.
Detailed description of the invention
Fig. 1 is in the present invention based on the acquisition of event driven DOC DATA and processing system structural schematic diagram.
Fig. 2 is work flow diagram in the present invention.
Fig. 3 is detailed operational flow diagrams in the present invention.
Specific embodiment
Be described further below technical solution of the present invention, but claimed range be not limited to it is described.
Embodiment 1:
As shown in Figure 1, a kind of based on the acquisition of event driven DOC DATA and processing system, comprising:
Data acquisition service module is acquired and stores for acquiring multi-source heterogeneous DOC DATA, is simultaneously emitted by clear Wash instruction;
Data cleansing service module acquires the clear instruction that service module issues for receiving data and is parsed, sentenced Disconnected cleaning demand is increment cleaning or full dose cleaning, and issues and calculate message body;
Data pick-up calculates service module, and the calculating message body that cleaning service issues for receiving data is simultaneously analyzed, Extraction calculating is carried out to data, and gives and feeds back, while sending data directory message body;
Data directory service module extracts for receiving data and calculates service module sending data directory message body and carry out Parsing, judgement are increment storage or full dose storage;
Log module, for recording by above-mentioned modules to the acquisition of DOC DATA and processing whole process.
The data acquisition service module is acquired DOC DATA using distributed reptile.
The data acquisition module is established uniquely in collection process by the title of DOC DATA, URL and issuing time Index.
The data acquisition module also generates interim table Crawler TMP during acquiring DOC DATA, for storing Incremental data.
Described calculate after cleaning service module completes work generates interim table Clean TMP for storing the number of increment cleaning According to.
The extraction calculating service module can generate Calculate TMP after completing and store incremental data.
As shown in Fig. 2, a kind of based on the acquisition of event driven DOC DATA and processing method, comprising the following steps:
S1, data acquisition service module are by the distributed reptile crawl publication in distributed data acquisition mode in website On DOC DATA, and distributed storage unstructured database, then unique rope is established by title, URL and the time of publication Draw, while Bloom filter record is carried out to the URL grabbed, and interim table is stored in the DOC DATA grabbed In Crawler TMP, in the database and set where the increase data after distributed reptile periodicity crawl event terminates Message body or building full dose data cleansing message body, form clear instruction and send;It is recorded using Bloom filter, the grand mistake of cloth Filter can be used for retrieving an element whether in a set, solves the extensive crawl bring network bandwidth that repeats and disappears Consumption.
The clear instruction of S2, data cleansing service module receiving step S1, parse clear instruction, judge that cleaning needs The cleaning of Seeking Truth full dose or increment cleaning;After purge event is completed, while sending corresponding calculating message body;
S3, data pick-up calculate the calculating message body that service module receiving step S2 is sent and are parsed, and judge to extract meter It is full dose calculating or incremental computations, finally sends data directory message body;
The data directory service message body sent in S4, data directory service module receiving step S3 is parsed, judgement It is increment storage or full dose storage, is stored in Elasticsearch index data base;
S5, by step S1-S5 process generate sequence of events in the form of log record storage in log module.
Data cleansing service module in the step S2 handles data specific steps are as follows:
1., increment cleaning: parsing database and set message body clear instruction, triggering incremental data cleaning service, increment After the completion of cleaning, by the interim table Crawler TMP data in non-relationship graphic data distributed where former distributed reptile data It deletes, while replying distributed reptile micro services message and consumption is completed;Cleaning error is encountered, then does not delete interim table Data in Crawler TMP, while replying distributed reptile and retransmitting cleaning message;Finally the data that increment cleans are deposited Storage finally sends calculating message body and gives step S3 in interim table Clean TMP;
2., full dose cleaning, parse full dose data cleansing message body clear instruction, trigger full dose data cleansing service, finally Step S3 is sent to message body is calculated;
The step S3 data pick-up calculates the specific steps of service module processing data are as follows:
A, after extraction calculating service module receives a full dose extraction calculating message body, the message is consumed first, Triggering, which is extracted, to be calculated:
If the field extracted is not related to retrieval service, data directory message body is directly transmitted, rope is carried out to data Library is introduced, while sending feedback message and giving data cleansing service module;
If the field extracted is related to retrieval service, data directory message body is not sent, extraction calculating is carried out, until Extracting calculating completion, retransmiting full dose data directory message body gives step S4 later;
B, after extraction calculating service module receives an increment extraction calculating message body, the message is parsed first, The database and name set where incremental data are obtained, triggering, which is extracted, to be calculated;
If encountering exception in extracting calculating process, feeds back to data cleansing service retransmission extraction calculating and disappear Breath, extract calculate service complete after can by incremental data carry out Calculate TMP storage, and delete step 1. in generate Clean TMP table finally sends incremental data index messages body and gives step S4.
Specific service step in the step S4 data directory service module are as follows:
A, increment is put in storage, and database and set where parsing the incremental data in message body carry out data on original index Insertion updates, and after completing data directory, deletes the interim Calculate TMP generated in step b.
B, full dose is put in storage, and is parsed the full dose data cleansing instruction in message body, is created new index, carries out full dose to data Index.
The working principle of the present embodiment are as follows: event-driven is based on, by internet crawler to domestic each official document issuing web site It is automatically grabbed, it is grand using cloth by taking the relevant field in official document, text, picture and attachment using distributed reptile Filter accomplishes incremental update;By accomplishing to orient duplicate removal and mass memory using distributed unstructured database MongoDB; By utilizing distributed message, solve the problems, such as to handle and calculate in real time;By utilizing chart database and distributed search engine, Search and displaying feature content.

Claims (10)

1. one kind is based on the acquisition of event driven DOC DATA and processing system characterized by comprising
Data acquisition service module is acquired and stores for acquiring multi-source heterogeneous DOC DATA, is simultaneously emitted by cleaning and refers to It enables;
Data cleansing service module acquires the clear instruction that service module issues for receiving data and is parsed, and judgement is clear Washing demand is increment cleaning or full dose cleaning, and issues and calculate message body;
Data pick-up calculates service module, and the calculating message body that cleaning service issues for receiving data is simultaneously analyzed, logarithm According to carrying out extraction calculating, and gives and feed back, while sending data directory message body;
Data directory service module extracts for receiving data and calculates service module sending data directory message body and solved Analysis, judgement are increment storage or full dose storage;
Log module, for recording by above-mentioned modules to the acquisition of DOC DATA and processing whole process.
2. according to claim 1 a kind of based on the acquisition of event driven DOC DATA and processing system, it is characterised in that: The data acquisition service module is acquired DOC DATA using distributed reptile.
3. according to claim 1 a kind of based on the acquisition of event driven DOC DATA and processing system, it is characterised in that: The data acquisition module establishes unique index by the title of DOC DATA, URL and issuing time in collection process.
4. according to claim 1 a kind of based on the acquisition of event driven DOC DATA and processing system, it is characterised in that: The data acquisition module also generates interim table Crawler TMP during acquiring DOC DATA, for storing incremental data.
5. according to claim 1 a kind of based on the acquisition of event driven DOC DATA and processing system, it is characterised in that: Described calculate after cleaning service module completes work generates interim table Clean TMP for storing the data of increment cleaning.
6. according to claim 1 a kind of based on the acquisition of event driven DOC DATA and processing system, it is characterised in that: The extraction calculating service module can generate Calculate TMP after completing and store incremental data.
7. one kind is based on described in claim 1 based on the acquisition of event driven DOC DATA and processing method, which is characterized in that packet Include following steps:
S1, data acquisition service module are by the distributed reptile crawl publication in distributed data acquisition mode on website DOC DATA, and distributed storage unstructured database, then unique index is established by title, URL and the time of publication, together When Bloom filter record is carried out to the URL that had grabbed, and interim table Crawler is stored in the DOC DATA grabbed In TMP, where the increase data after distributed reptile periodicity crawl event terminates database and set message body or Person constructs full dose data cleansing message body, forms clear instruction and sends;
The clear instruction of S2, data cleansing service module receiving step S1, parse clear instruction, judge that cleaning demand is Full dose cleaning or increment cleaning;After purge event is completed, while sending corresponding calculating message body;
S3, data pick-up calculate the calculating message body that service module receiving step S2 is sent and are parsed, and judging to extract to calculate is Full dose calculates or incremental computations, finally sends data directory message body;
The data directory service message body sent in S4, data directory service module receiving step S3 is parsed, and judgement is to increase Amount storage or full dose storage, are stored in Elasticsearch index data base;
S5, by step S1-S5 process generate sequence of events in the form of log record storage in log module.
8. a kind of DOC DATA acquisition according to claim 7 and processing method, which is characterized in that in the step S2 Data cleansing service module handles data specific steps are as follows:
1., increment cleaning: parsing database and set message body clear instruction, triggering incremental data cleaning service, increment cleaning After the completion, the interim table Crawler TMP data in non-relationship graphic data distributed where former distributed reptile data are deleted, Replying distributed reptile micro services message is completed consumption simultaneously;Cleaning error is encountered, then does not delete interim table Crawler TMP In data, while reply distributed reptile retransmit cleaning message;The data that increment cleans finally are stored in interim table In Clean TMP, finally sends calculating message body and give step S3;
2., full dose cleaning, parse full dose data cleansing message body clear instruction, trigger full dose data cleansing service, finally meter It calculates message body and is sent to step S3.
9. a kind of DOC DATA acquisition according to claim 8 and processing method, which is characterized in that the step S3 data Extract the specific steps for calculating service module processing data are as follows:
A, after extraction calculating service module receives a full dose extraction calculating message body, the message is consumed first, is triggered It extracts and calculates:
If extract field be not related to retrieval service, directly transmit data directory message body, to data be indexed into Library, while sending feedback message and giving data cleansing service module;
If the field extracted is related to retrieval service, data directory message body is not sent, extraction calculating is carried out, until extracting Calculating completion, retransmiting full dose Data Data index messages body gives step S4 later;
B, after extraction calculating service module receives an increment extraction calculating message body, the message is parsed first, is obtained Database and name set where incremental data, triggering, which is extracted, to be calculated;
If encountering exception in extracting calculating process, feeds back to data cleansing service and retransmit extraction calculating message, take out Incremental data can be subjected to Calculate TMP storage after taking calculating service to complete, and delete the step 1. middle Clean generated TMP table finally sends incremental data index messages body and gives step S4.
10. a kind of DOC DATA acquisition according to claim 9 and processing method, which is characterized in that the step S4 number According to the specific service step in index service module are as follows:
A, increment is put in storage, and database and set where parsing the incremental data in message body carry out data insertion on original index Or update, after completing data directory, delete the interim Calculate TMP generated in step b.
B, full dose is put in storage, and is parsed the full dose data cleansing instruction in message body, is created new index, carries out full dose rope to data Draw.
CN201910271964.2A 2019-04-04 2019-04-04 One kind is based on the acquisition of event driven DOC DATA and processing system and its method Pending CN110147362A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910271964.2A CN110147362A (en) 2019-04-04 2019-04-04 One kind is based on the acquisition of event driven DOC DATA and processing system and its method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910271964.2A CN110147362A (en) 2019-04-04 2019-04-04 One kind is based on the acquisition of event driven DOC DATA and processing system and its method

Publications (1)

Publication Number Publication Date
CN110147362A true CN110147362A (en) 2019-08-20

Family

ID=67589343

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910271964.2A Pending CN110147362A (en) 2019-04-04 2019-04-04 One kind is based on the acquisition of event driven DOC DATA and processing system and its method

Country Status (1)

Country Link
CN (1) CN110147362A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112650865A (en) * 2021-01-27 2021-04-13 南威软件股份有限公司 Method and system for solving multi-region license data conflict based on flexible rule
CN113821573A (en) * 2021-08-27 2021-12-21 济南浪潮数据技术有限公司 Mass data rapid retrieval service construction method, system, terminal and storage medium

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102411569A (en) * 2010-09-20 2012-04-11 上海众融信息技术有限公司 Database conversion and cleaning information processing method
CN103617174A (en) * 2013-11-04 2014-03-05 同济大学 Distributed searching method based on cloud computing
CN104102737A (en) * 2014-07-28 2014-10-15 中国农业银行股份有限公司 Historical data storage method and system
CN104951512A (en) * 2015-05-27 2015-09-30 中国科学院信息工程研究所 Public sentiment data collection method and system based on Internet
CN105488187A (en) * 2015-12-02 2016-04-13 北京四达时代软件技术股份有限公司 Method and device for extracting multi-source heterogeneous data increment
CN106682153A (en) * 2016-12-23 2017-05-17 山东浪潮商用系统有限公司 Data extraction tool on basis of data modeling and data increment implementation
CN106776951A (en) * 2016-12-02 2017-05-31 航天星图科技(北京)有限公司 One kind cleaning contrast storage method
CN107103067A (en) * 2017-04-18 2017-08-29 北京思特奇信息技术股份有限公司 A kind of method of data synchronization and system based on search engine
CN107480858A (en) * 2017-07-10 2017-12-15 武汉楚鼎信息技术有限公司 A kind of Aided intelligent decision-making and method based on the analysis of stock big data
CN107895009A (en) * 2017-11-10 2018-04-10 北京国信宏数科技有限责任公司 One kind is based on distributed internet data acquisition method and system
CN107943991A (en) * 2017-12-01 2018-04-20 成都嗨翻屋文化传播有限公司 A kind of distributed reptile frame and implementation method based on memory database
CN108121706A (en) * 2016-11-28 2018-06-05 央视国际网络无锡有限公司 A kind of optimization method of distributed reptile
CN108228815A (en) * 2017-12-29 2018-06-29 安徽迈普德康信息科技有限公司 A kind of real estate data integrated system and method

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102411569A (en) * 2010-09-20 2012-04-11 上海众融信息技术有限公司 Database conversion and cleaning information processing method
CN103617174A (en) * 2013-11-04 2014-03-05 同济大学 Distributed searching method based on cloud computing
CN104102737A (en) * 2014-07-28 2014-10-15 中国农业银行股份有限公司 Historical data storage method and system
CN104951512A (en) * 2015-05-27 2015-09-30 中国科学院信息工程研究所 Public sentiment data collection method and system based on Internet
CN105488187A (en) * 2015-12-02 2016-04-13 北京四达时代软件技术股份有限公司 Method and device for extracting multi-source heterogeneous data increment
CN108121706A (en) * 2016-11-28 2018-06-05 央视国际网络无锡有限公司 A kind of optimization method of distributed reptile
CN106776951A (en) * 2016-12-02 2017-05-31 航天星图科技(北京)有限公司 One kind cleaning contrast storage method
CN106682153A (en) * 2016-12-23 2017-05-17 山东浪潮商用系统有限公司 Data extraction tool on basis of data modeling and data increment implementation
CN107103067A (en) * 2017-04-18 2017-08-29 北京思特奇信息技术股份有限公司 A kind of method of data synchronization and system based on search engine
CN107480858A (en) * 2017-07-10 2017-12-15 武汉楚鼎信息技术有限公司 A kind of Aided intelligent decision-making and method based on the analysis of stock big data
CN107895009A (en) * 2017-11-10 2018-04-10 北京国信宏数科技有限责任公司 One kind is based on distributed internet data acquisition method and system
CN107943991A (en) * 2017-12-01 2018-04-20 成都嗨翻屋文化传播有限公司 A kind of distributed reptile frame and implementation method based on memory database
CN108228815A (en) * 2017-12-29 2018-06-29 安徽迈普德康信息科技有限公司 A kind of real estate data integrated system and method

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112650865A (en) * 2021-01-27 2021-04-13 南威软件股份有限公司 Method and system for solving multi-region license data conflict based on flexible rule
CN112650865B (en) * 2021-01-27 2021-11-09 南威软件股份有限公司 Method and system for solving multi-region license data conflict based on flexible rule
CN113821573A (en) * 2021-08-27 2021-12-21 济南浪潮数据技术有限公司 Mass data rapid retrieval service construction method, system, terminal and storage medium

Similar Documents

Publication Publication Date Title
CN109543086B (en) Network data acquisition and display method oriented to multiple data sources
CN100471121C (en) Decoding method and decoder
CN103294732B (en) Webpage capture method and reptile
US9552435B2 (en) Method and system for incremental collection of forum replies
CN107895009A (en) One kind is based on distributed internet data acquisition method and system
CN105320740B (en) The acquisition methods and acquisition system of wechat article and public platform
CN102760151B (en) Implementation method of open source software acquisition and searching system
CN105956175A (en) Webpage content crawling method and device
Qiu et al. Analysis of user web traffic with a focus on search activities.
CN110147362A (en) One kind is based on the acquisition of event driven DOC DATA and processing system and its method
CN103177380A (en) Method and device for optimizing advertisement delivery effect by combining user groups and pre-delivery
CN101727486A (en) Web forum information extraction system
CN108924199A (en) Crawlers obtain the method, apparatus, computer storage medium and terminal device of network proxy server automatically
CN108959539B (en) Rule-configurable webpage data analysis method
CN108133041A (en) Data collecting system and method based on web crawlers and data transfer technology
CN109150585A (en) A kind of network O&M failure solution, system, device and storage medium
CN107766234A (en) A kind of assessment method, the apparatus and system of the webpage health degree based on mobile device
CN110417873A (en) A kind of network information extraction system for realizing record webpage interactive operation
CN113259467A (en) Webpage asset fingerprint tag identification and discovery method based on big data
CN110059085A (en) A kind of parsing of JSON data and modeling method of web oriented 2.0
CN113157521A (en) Monitoring method and monitoring system for whole life cycle of block chain
CN105825399A (en) Internet based B2B e-commerce information collecting method
CN113760878A (en) Micro-service architecture log analysis method and system based on domestic CPU and operating system
CN103279527B (en) A kind of user interest network address method for digging and device
CN103902707B (en) Expert system URL cleans " rubbish " content filtering method of knowledge base

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20190820

RJ01 Rejection of invention patent application after publication