Summary of the invention
In view of this, the present invention provides a kind of data processing of multithread based on ETL, and to solve the problem that hardware resource utilization is low, data throughout is little and speed is low that prior art exists, technical scheme is following:
A kind of data processing of multithread based on ETL comprises: the data pick-up process of ETL is divided into tangible three phases, promptly extracts, send and synchronously, and use separately independently thread parallel to carry out following four steps:
Step 10: extract thread by one of data pick-up unit starting; Show data in the extraction source in real time through rule; And with being stored in message queue to be sent after the data encapsulation, if the extracted data process makes a mistake, the data that then will make mistakes send to the error data message queue;
Step 11: start one by data transmission unit and send thread; Cycle detection message queue to be sent; When the data that need transmission are arranged in this formation; Then these data are sent to and treat the synchronization message formation, make a mistake if send data procedures, the data that then will make mistakes send to the error data message queue;
Step 12: start a synchronizing thread by data synchronisation unit; Cycle detection is treated the synchronization message formation; When in this formation the data in synchronization of needs being arranged; Then resolve these data and synchronous purpose table data, if the synchrodata process makes a mistake, the data that then will make mistakes send to the error data message queue;
Step 13: by persistence thread of error data persistence unit starting; Cycle detection error data message queue; When the data of makeing mistakes were arranged in the formation, according to causing error reason, promptly extracted data mistake, transmission error in data and synchrodata mistake were preserved with error data.
Preferably, in the said method, in the step 10, said through rule in real time extraction source table data be specially:
Show data in the extraction source in real time through SQL (SQL, Structured Query Language) batch processing query statement.
Preferably, said when needing the data of transmission in this formation in the step 11 in the said method, then these data are sent to and treat that the synchronization message formation is specially:
The data that needs are sent send to and treat the synchronization message formation through http protocol or transmission control protocol (TCP, Transmission Control Protocol).
Preferably, said when in this formation the data in synchronization of needs being arranged in the step 12 in the said method, then resolve the also synchronous purpose table data of these data and be specially:
Need data in synchronization and synchronous purpose table data through the parsing of SQL batch processing mode.
Preferably, in the said method, the concrete operations of said synchronous purpose table data comprise:
Insertion, renewal and deleted data.
Preferably, in the said method, in the step 13, said with error data according to causing error reason, promptly the extracted data mistake, send and also to comprise after error in data and synchrodata mistake are preserved:
According to the preset data method of synchronization said data of makeing mistakes are carried out data sync.
Can know through above technical scheme; The present invention is through being divided into tangible three phases with ETL data pick-up process; Promptly extract, send and synchronously, and use separately thread parallel independently to carry out the extraction of data, transmission and synchronously and the processing of error data; Significantly improved the handling capacity and extraction speed of data, and the utilization factor of hardware resource; Also, improved the fault-tolerance of data, reduced owing to producing the wrong probability that causes whole ETL paralysis in the data pick-up process through the processing of the error data that produces in extraction, transmission and the synchronizing process to data.
Embodiment
The embodiment of the invention discloses a kind of data processing of multithread, comprising: the data pick-up process of ETL is divided into tangible three phases, promptly extracts, send and synchronously, and use separately independently thread parallel to carry out following four steps based on ETL:
Step 10: extract thread by one of data pick-up unit starting; Show data in the extraction source in real time through rule; And with being stored in message queue to be sent after the data encapsulation, if the extracted data process makes a mistake, the data that then will make mistakes send to the error data message queue;
Step 11: start one by data transmission unit and send thread; Cycle detection message queue to be sent; When the data that need transmission are arranged in this formation; Then these data are sent to and treat the synchronization message formation, make a mistake if send data procedures, the data that then will make mistakes send to the error data message queue;
Step 12: start a synchronizing thread by data synchronisation unit; Cycle detection is treated the synchronization message formation; When in this formation the data in synchronization of needs being arranged; Then resolve these data and synchronous purpose table data, if the synchrodata process makes a mistake, the data that then will make mistakes send to the error data message queue;
Step 13: by persistence thread of error data persistence unit starting; Cycle detection error data message queue; When the data of makeing mistakes were arranged in the formation, according to causing error reason, promptly extracted data mistake, transmission error in data and synchrodata mistake were preserved with error data.
The present invention is through being divided into tangible three phases with ETL data pick-up process; Promptly extract, send and synchronously; And use the independently extraction of thread parallel execution data, transmission and synchronous separately; And the preservation of error data, significantly improve the handling capacity of data and extracted speed, and the utilization factor of hardware resource; Also, improved the fault-tolerance of data, reduced owing to producing the wrong probability that causes whole ETL paralysis in the data pick-up process through the processing of the error data that produces in extraction, transmission and the synchronizing process to data.
For those skilled in the art are better understood and embodiment of the present invention, below will combine Figure of description that the technical scheme of the embodiment of the invention is described in detail.
Fig. 1 is the framework synoptic diagram of the data processing of multithread based on ETL provided by the invention.The present invention is through data pick-up unit, data transmission unit, data synchronisation unit and error data persistence unit in the framework, and with the parallelization of ETL data pick-up process, detailed process is following:
The data pick-up process of ETL is divided into tangible three phases, promptly extracts, sends and synchronously, and use separately independently thread parallel to carry out following four steps:
Step 10: extract thread by one of data pick-up unit starting; Show data in the extraction source in real time through rule; And with being stored in message queue to be sent after the data encapsulation, if the extracted data process makes a mistake, the data that then will make mistakes send to the error data message queue.
Data pick-up unit round-robin reads legal data in the source data table; And these data encapsulation are become packet; Then this packet is sent in the message queue to be sent,, then will change packet and store the error data message queue into if in the data encapsulation process, mistake occurs.Wherein, concrete SQL batch processing query statement capable of using reads legal data in the source data table.
The data pick-up real-time can be guaranteed in the data pick-up unit, and the promptly real-time data that will extract encapsulate and are saved in the message queue, and extraction process does not receive transmission and synchronizing process influence.
Step 11: start one by data transmission unit and send thread; Cycle detection message queue to be sent; When the data that need transmission are arranged in this formation; Then these data are sent to and treat the synchronization message formation, make a mistake if send data procedures, the data that then will make mistakes send to the error data message queue.
The data transmission unit round-robin reads the data in the message queue to be sent, and with this data transmission to treating in the synchronization message formation, if occur mistake in the process of transmitting, the data storage that then will make mistakes is to the error data message queue.Wherein, concrete can sending to through the data that http protocol or Transmission Control Protocol will send treated the synchronization message formation.
Data transmission unit only need send to the data in the message queue to be sent and treat to go in the synchronization message formation, can guarantee the real-time that data transmit.
Step 12: start a synchronizing thread by data synchronisation unit; Cycle detection is treated the synchronization message formation; When in this formation the data in synchronization of needs being arranged; Then resolve these data and synchronous purpose table data, if the synchrodata process makes a mistake, the data that then will make mistakes send to the error data message queue.
The data synchronisation unit round-robin reads the data of treating in the synchronization message formation; When in this formation the data in synchronization of needs being arranged; Then resolve these data and synchronous purpose table data, if the synchrodata process makes a mistake, the data that then will make mistakes send to the error data message queue.Concrete, can need data in synchronization and synchronous purpose table data through the parsing of SQL batch processing mode; The concrete operations of purpose table data synchronously comprise: insertion, renewal and deleted data etc.
Data synchronisation unit will need synchrodata to be synchronized in the destination data database data table, can guarantee the real-time of data sync.
Step 13: by persistence thread of error data persistence unit starting; Cycle detection error data message queue; When the data of makeing mistakes were arranged in the formation, according to causing error reason, promptly extracted data mistake, transmission error in data and synchrodata mistake were preserved with error data.
Error data persistence unit round-robin reads data in the data-message formation that makes mistakes, and preserves respectively according to the type of error data.Concrete, can preserve with the form of document form or database.After error data is preserved, can also be according to the preset data method of synchronization, for example manual type is carried out data sync to the data of makeing mistakes.
Error data persistence unit can carry out guaranteeing the security of ETL data-switching and the integrality of data synchronously through other modes with using the synchronous error data of multithreading.
Can find out that from above embodiment the embodiment of the invention has used multithreading to handle framework, parallelization the process of ETL data pick-up; Concrete, ETL data pick-up process is divided into tangible three phases, promptly extract, send and synchronously; And use the independently extraction of thread parallel execution data, transmission and synchronous separately; And the processing of error data, significantly improve the handling capacity of data and extracted speed, and the utilization factor of hardware resource; Also, improved the fault-tolerance of data, reduced owing to producing the wrong probability that causes whole ETL paralysis in the data pick-up process through the processing of the error data that produces in extraction, transmission and the synchronizing process to data.
Description through above method embodiment; The those skilled in the art can be well understood to the present invention and can realize by the mode that software adds essential general hardware platform; Can certainly pass through hardware, but the former is better embodiment under a lot of situation.Based on such understanding; The part that technical scheme of the present invention contributes to prior art in essence in other words can be come out with the embodied of software product; This computer software product is stored in the storage medium; Comprise some instructions with so that computer equipment (can be personal computer, server, the perhaps network equipment etc.) carry out all or part of step of the said method of each embodiment of the present invention.And aforesaid storage medium comprises: various media that can be program code stored such as ROM (read-only memory) (ROM), random-access memory (ram), magnetic disc or CD.
To the above-mentioned explanation of the disclosed embodiments, make this area professional and technical personnel can realize or use the present invention.Multiple modification to these embodiment will be conspicuous concerning those skilled in the art, and defined General Principle can realize under the situation that does not break away from the spirit or scope of the present invention in other embodiments among this paper.Therefore, the present invention will can not be restricted to these embodiment shown in this paper, but will meet and principle disclosed herein and features of novelty the wideest corresponding to scope.