CN104281636A

CN104281636A - Concurrent distributed processing method for mass report data

Info

Publication number: CN104281636A
Application number: CN201410187511.9A
Authority: CN
Inventors: 谭映忠; 张克慧; 刘新宇; 刘畅; 关丹凤; 王亮; 陈璇; 郭磊
Original assignee: Shenhua Group Corp Ltd
Current assignee: China Energy Investment Corp Ltd
Priority date: 2014-05-05
Filing date: 2014-05-05
Publication date: 2015-01-14
Anticipated expiration: 2034-05-05
Also published as: CN104281636B

Abstract

The invention discloses a concurrent distributed processing method for mass report data. The method comprises the following steps: acquiring report data; generating a report data formula set, segmenting the generated report data formula set into a plurality of formula set fragments according to lines, wherein each formula set fragment comprises a multi-line report data formula; pushing the report data to each computer node in a computer cluster; allocating the operation of the formula set fragments to a plurality of computer nodes in the computer cluster for performing operation processing; saving a state snapshot of the operation processing of the computer nodes; when the operation of any formula set fragment is interrupted, restoring an operation state before interruption according to the state snapshot, and continually executing the interrupted operation. Through the technical scheme, the report data formula set is segmented into the formula set fragments, so that the formula set can be processed on different processing nodes in the form of fragments. Each processing node is only used for processing a part of the formula set fragments and corresponding report data, so that the processing efficiency of the report data is increased greatly.

Description

The concurrent distributed approach of magnanimity report data

Technical field

The present invention relates to data processing field, particularly, relate to the concurrent distributed approach of a kind of magnanimity report data.

Background technology

At present, to the process of report data, generally use traditional non-distributed computing technology.This traditional non-distributed computing technique is only applicable to process a small amount of report data, when the quantity of report data constantly expands, when reaching the stage of flood tide or even magnanimity, uses this traditional approach to go to process report data, just there will be various drawback.First, this traditional non-distributed computing technology is all very high to the requirement of software and hardware platform, and this brings very high cost pressure by user.Secondly, even if user is ready to pay high cost, in most cases, the processing speed of this traditional non-distributed computing technology and treatment effeciency are all very low.Sometimes, the processing procedure of some report data, often needs a few hours consuming time or even a couple of days just can complete.

Current, minority enterprise at home, also using the process that some traditional cluster computings realize report data.Namely some computing node (server) with high-performance calculation ability is combined into a computing cluster, uses the computing node in cluster, share the computational load of whole system.

This traditional cluster computing, although processing speed and the treatment effeciency that partly can improve report data.But, because its handling principle whole report data is all pushed to each computing node carry out computing.It is very high to the hardware requirement of each computing node in cluster, and can not make full use of the computing power of each computing node.And, when report data expand into certain degree (flood tide or magnanimity), also there will be the bottleneck for the treatment of effeciency and speed.Namely, under mass data, by increasing the high-performance calculation node in cluster, the efficiency of the process of whole system can not be improved.

For the problems referred to above, in prior art, there is no good solution.

Summary of the invention

The object of this invention is to provide a kind of method, can realize carrying out fast processing to magnanimity report data by the method.

To achieve these goals, the invention provides the concurrent distributed approach of a kind of magnanimity report data, the method comprises: obtain report data; Generating report forms data formula collection, and multiple formulary fragment is cut into by row to generated report data formulary, wherein each formulary fragment comprises multirow report data formula; And each computer node described report data is pushed in computer cluster; By to the computing of formulary fragment, the multiple computer nodes be assigned in described computer cluster carry out calculation process; Preserve the state snapshot of described multiple computer node calculation process; And when interrupting the computing of arbitrary formulary fragment, recover the compute mode before interrupting according to described state snapshot, and continue to perform the computing interrupted.

Further, described multiple computer node carries out calculation process and comprises formula operation and carry out first order merging to operation result, to obtain multiple first order amalgamation result; And the method also comprises: carry out second level merging to the result that described multiple first order merges; And the final data result after being merged the second level exports to intended application.

Further, the step of described generating report forms data formula collection, comprises generating report forms data check formulary, and by the report data generating report forms data conversion formula collection through verification.

Further, the method also comprises: carry out heartbeat detection to described multiple computer node; And be redistributed to other computer nodes by being assigned to the computing of heartbeat detection without the computer node of response.

Further, the method also comprises: the calculation process result of described multiple computer node is saved in the shared storage be connected with all computer nodes in described computer cluster.

Further, the method also comprises: after all computer node calculation process of computing current formulary fragment complete, distribute the computing of next formulary fragment.

Further, the method also comprises: distribute the described computing to formulary fragment according to greedy algorithm.

Further, the method also comprises: after completing the computing to last formulary fragment, exports operation result.

Further, described computer cluster is made up of the computer node disposing cloud computing platform.

Further, described cloud computing platform is HADOOP cloud computing platform.

Further, described computer node is LINUX system server.

Pass through technique scheme, report data formulary is cut into formulary fragment, formulary is processed at different processing nodes with the form of fragment, each processing node only processes a part of formulary fragment and corresponding report data, drastically increases the treatment effeciency of report data.

Other features and advantages of the present invention are described in detail in embodiment part subsequently.

Accompanying drawing explanation

Accompanying drawing is used to provide a further understanding of the present invention, and forms a part for instructions, is used from explanation the present invention, but is not construed as limiting the invention with embodiment one below.In the accompanying drawings:

Fig. 1 is according to the concurrent distributed approach process flow diagram of the magnanimity report data of embodiment of the present invention;

Fig. 2 is according to the concurrent distributed approach process flow diagram of the magnanimity report data of exemplary embodiment of the invention;

Fig. 3 is according to the concurrent distributed approach process flow diagram of the magnanimity report data of exemplary embodiment of the invention;

Fig. 4 is according to the concurrent distributed approach process flow diagram of the magnanimity report data of exemplary embodiment of the invention;

Fig. 5, Fig. 6 and Fig. 7 are the program implementation example figure of the concurrent distributed approach of magnanimity report data according to exemplary embodiment of the invention.

Embodiment

Below in conjunction with accompanying drawing, the specific embodiment of the present invention is described in detail.Should be understood that, embodiment described herein, only for instruction and explanation of the present invention, is not limited to the present invention.

To the process of report data, generally change two kinds with report data School Affairs report data and be treated to master.Report data checking treatment refers to the accuracy to report data, and preciseness checks, finds doubt data, ensures the process of report data accuracy.Report data conversion process refers to and extract the data of specifying from some specific form, then calculates accordingly these data and processes, converting the process of the data of specific format to.No matter be wherein the checking treatment to report data, or the conversion process to report data, all can relate to the calculating of large amount of complex.

In order to improve the counting yield that large amount of complex calculates, the invention provides the concurrent distributed approach of a kind of magnanimity report data, as shown in Figure 1, the method comprises: S101, obtains report data; S102, generating report forms data formula collection, and multiple formulary fragment is cut into by row to generated report data formulary, wherein each formulary fragment comprises multirow report data formula; And S103, described report data is pushed to each computer node in computer cluster; S104, by the computing of formulary fragment, the multiple computer nodes be assigned in described computer cluster carry out calculation process; S105, preserves the state snapshot of described multiple computer node calculation process; And S106, when interrupting the computing of arbitrary formulary fragment, recover the compute mode before interrupting according to described state snapshot, and continue to perform the computing interrupted.

Pass through technique scheme, report data formulary is cut into formulary fragment, formulary is processed at different processing nodes with the form of fragment, each processing node only processes a part of formulary fragment and corresponding report data, drastically increases the treatment effeciency of report data.When the computing failure of formulary fragment, can continue to perform from the computing of the formulary fragment of failure, and without the need to repeating from original state, also without the need to carrying out repetitive operation to other not failed formularies, improving task treatment effeciency, reducing the wasting of resources.After obtaining report data, just obtained report data can be pushed to each computer node in computer cluster in embodiments, with after distributing the computing of formulary fragment, computer node can start computing and immediately without the need to waiting pending data.

In embodiments, the operation result of each computer node can be returned to carry out gathering merging by network.But consider the situation that operation result data volume is very huge, network may the huge operation result data volume of fast transport, becomes short slab.Therefore, in a preferred embodiment, multiple computer node carries out calculation process and can comprise formula operation and carry out first order merging to operation result, to obtain multiple first order amalgamation result; And method can also comprise: carry out second level merging to the result that described multiple first order merges; And the final data result (such as, Credential data) after being merged the second level exports to intended application.Formulary processes at different processing nodes, therefore need to merge processing the result obtained, consider that a processing node may produce multiple result that can merge, therefore the merging of result can be included in gathering and merging and the gathering and merge of result of multiple processing node in a processing node.For merging, one or more computing machines that are that can specify in seletion calculation machine cluster or the free time carry out.

In embodiments, method can also comprise: carry out heartbeat detection to described multiple computer node; And be redistributed to other computer nodes by being assigned to the computing of heartbeat detection without the computer node of response.The duty of the computer node carrying out computing can be determined by heartbeat detection.Can both completing in order to ensure the computing being assigned to each computer node, when there is heartbeat detection without the computing machine responded, the computing being assigned to this computer node can be re-assigned to the normal computer node of other heartbeat detection.Preferably, the computer node be re-assigned to can be the computer node having completed the computing be assigned with, and can make full use of idle computing resources like this and complete computing.

Call for the ease of formula and/or formulary and gather with operation result, in embodiments, the method can also comprise: the calculation process result of described multiple computer node is saved in the shared storage be connected with all computer nodes in described computer cluster.

In embodiments, the method can also comprise: after all computer node calculation process of computing current formulary fragment complete, the computing of next formulary fragment is distributed, to utilize whole computer node resource process formulary fragment, improves processing speed.

Fig. 2 is according to the concurrent distributed approach process flow diagram of the magnanimity report data of exemplary embodiment of the invention.As shown in Figure 2, the concurrent distributed approach of magnanimity report data according to exemplary embodiment of the invention can comprise: according to the form of report data process, report data process is divided into such as " data check " and " data conversion " two kinds of different process types.Rule of classification according to above-mentioned two kinds of process type setting data: " data check rule of classification " and " data conversion rule of classification ", or the condition for verifying data or changing.Preferably, arrange and can the data by verification be changed.Then, can carry out cutting according to the rule of classification of setting to data, be multiple data slot by data cutting.If (" data check rule of classification " then carries out cutting to " the data check formula " that set.Wherein, data check formula can be set up to determine that whether data are accurate according to financial data verification mode; " if data conversion rule of classification ", then cutting is carried out to " the data conversion formula " that set.) multiple computer nodes that the data slot after cutting and the original report data are assigned in computer cluster are carried out Distributed Parallel Computing, and form interim findings collection.Then, gather interim findings collection according to different rules of classification and (such as, if " data check rule of classification ", then can gather the information that verification is failed; If " data conversion rule of classification ", then can gather the data set after conversion), and form final data result.Finally, intended application is exported to for providing form or data error reporting by unified for final data result.

In the embodiment shown in figure 2, from the angle of aggregate data, be aggregate data is divided into according to rule of classification need to carry out School Affairs and need to carry out conversion two kind; And from the angle of a certain data, these data need first to verify, when verify by change again.Therefore, in embodiments, the step of generating report forms data formula collection, can comprise generating report forms data check formulary, and by the report data generating report forms data conversion formula collection through verification.

For same data, the concurrent distributed approach of magnanimity report data provided by the invention can be undertaken by sequence of steps according to task chain form.Aggregate data may be magnanimity, and needs verification and/or need the data of conversion to carry out operand may be magnanimity rank too.Such operand, common system and hardware have been difficult to, the high performance hardware supported of general needs, and high performance hardware certainly will need high cost to drop into.For this reason, following methods is provided to solve contradiction between big data quantity and the not high hardware system of performance in embodiments of the present invention.

Fig. 3 is according to the concurrent distributed approach process flow diagram of the magnanimity report data of exemplary embodiment of the invention.As shown in Figure 3, the concurrent distributed approach of magnanimity report data that embodiment of the present invention provides, can comprise: S301, each step is divided into multiple subtasks node; S302, the computing be assigned needed for the node of subtask carries out calculation process to the multiple computer nodes in described computer cluster; S303, preserves the state snapshot of described multiple computer node calculation process; And S304, when described subtask node interrupts, recover the subtask node state before interrupting according to described state snapshot, and continue to perform the subtask node interrupted.

By technique scheme, the computing completed needed for a task chain step is assigned to multiple computer node and carries out calculation process, operand can be broken the whole up into parts, improve task run speed; By preserving the state snapshot of calculation process, when task chain step is interrupted for some reason, task chain step can be returned to state before interrupting, thus continue chain of executing the task from before interruption state.Therefore, by technique scheme, when task chain step or node failure, can continue from the step of failure or node chain of executing the task, and without the need to repeating from original state, improve task treatment effeciency, reducing the wasting of resources.

In a preferred embodiment, said method can also comprise: carry out heartbeat detection to multiple computer node; And be redistributed to other computer nodes by being assigned to the computing of heartbeat detection without the computer node of response.The duty of the computer node carrying out computing can be determined by heartbeat detection.Can both completing in order to ensure the computing being assigned to each computer node, when there is heartbeat detection without the computing machine responded, the computing being assigned to this computer node can be re-assigned to the normal computer node of other heartbeat detection.Preferably, the computer node be re-assigned to can be the computer node having completed the computing be assigned with, and can make full use of idle computing resources like this and complete computing.

Computer node in order to the computing of each task node or subtask node in the chain that enables to finish the work obtains the data needed for computing, in embodiments, method also comprises: the calculation process result of described multiple computer node is saved in the shared storage be connected with all computer nodes in described computer cluster.All like this computer nodes can obtain operational data when computing starts from shared storage, and in shared storage, store operation result when computing completes.Here it should be noted that, the state snapshot of computer node calculation process also can be kept in shared storage, or also can arrange snapshot separately.The example of storer is including, but not limited to ROM (read-only memory) (ROM), random access memory (RAM), register, memory buffer, semiconductor memory apparatus etc.

In various embodiments, the method can also comprise: after all computer node calculation process performing current subtask node complete, distribute the computing completed needed for the node of next subtask.In embodiments, there is multiple situation to need to perform above-mentioned steps.Such as, for a large amount of computing, need the computing of the subtask node using all computer nodes in computer cluster to participate in, only have the computing that could continue to distribute next son task node after all computer nodes all complete calculation process.For another example, the situation of the computing of next son task node could be distributed after whole operation results of a upper subtask node are obtained for needs, also need in method to comprise such step.Certainly, in embodiments, also the computing of different subtasks node can be processed the different computer node groups be simultaneously assigned in computer cluster.

Below in conjunction with Fig. 4, the processing procedure in the process of magnanimity report data is described.As shown in Figure 4, a task chain from start to end between can have multiple task node (or net-shape processed node), can comprise (such as having fully connected topology) multiple subtasks node in each task node, one or more computer node (not shown)s that the computing of each subtask node can be assigned in computer cluster process.The computing that each computer node will carry out can be undertaken by unified Master Control Unit, and this Master Control Unit can be a computer node in computer cluster, is responsible for the heartbeat detection of other all computer nodes in computer cluster simultaneously.Data (such as, the calculation process result of computer node) shared in the context relevant to computing can be saved to the shared storage be connected with all computer nodes in described computer cluster.After the computing completing last net-shape processed node, can operation result be exported, such as, output to intended application.It should be noted that, the task chain comprising three net-shape processed nodes has been shown in Fig. 4, but accompanying drawing is only for exemplary purpose, the length of task chain is not limited.Such as can use two net-shape processed nodes according to the concurrent distributed approach of magnanimity report data of the present invention, respectively the conversion of check sum data be processed.

In a preferred embodiment, the computing needed for the node of subtask can be completed according to greedy algorithm distribution.That is, the computer node that most complex calculations give computing power the strongest can be processed, and then guarantee the processing speed of whole computing.

In a preferred embodiment, computer cluster can be made up of the computer node disposing cloud computing platform, thus can utilize the resources advantage of cloud computing platform, reduces the hardware requirement of a large amount of computing to computer node.Such as, HADOOP cloud computing platform can be used, and computer node can for deploying cloud computing platform (such as HADOOP's) LINUX system server.Below, be described in conjunction with the implementation of HADOOP cloud computing platform to the concurrent distributed approach of magnanimity report data according to the present invention.

A kind of illustrative embodiments of the present invention uses HADOOP cloud computing platform to realize the concurrent distributed approach of magnanimity report data.Its concrete embodiment is as follows:

(1) 5 to 10 common servers (LINUX operating system) are chosen, as report data computing node;

(2) at these common server deploy HADOOP platforms, these machine assemblies are become a Distributed Calculation cluster;

(3) initialization distributed file system (HDFS) on Distributed Calculation cluster;

(4) system is according to the form of report data process, divides into groups to needing magnanimity report data to be processed according to " data check rule of classification " and " data conversion rule of classification "; (following steps (5)-(14) are treated to example with " data conversion " and are described)

(5) if " data conversion " process, then the content of the content of magnanimity report data and the set of data conversion formula is pushed to distributed file system (HDFS) by system;

(6) content of the data conversion formula set pushed in distributed file system (HDFS) is carried out cutting according to the computing power of the quantity of the machine in Distributed Calculation cluster and machine by system.Be 5 to 10 data slots (can be identical with participation processing node quantity) by the division of teaching contents in the set of data conversion formula;

(7) 5 to 10 data slots will segmenting of system, together with magnanimity report data, push to each computing node (push whole report datas to each computing node, use according to formulary computing demand to make each computing node) in Distributed Calculation cluster together;

(8) system initiation task allocation schedule program, calls each computing node in Distributed Calculation cluster, carries out computing to the data slot be dispensed on this computing node and magnanimity report data simultaneously;

(9) each computing node in system call Distributed Calculation cluster, to the result of data computing on this computing node, carries out gathering and merging among a small circle, on each computing node, produces interim findings collection;

(10) system call task matching scheduler program, observe the idle condition of each computing node in Distributed Calculation cluster, enable wherein some more idle computing node (2 to 3 computing nodes) as the computing node merging work, for merging work is prepared;

(11) system call task matching scheduler program, starts the consolidation procedure on the computing node of the execution merging work chosen;

(12) system will call the computing node performing merging work, collects the interim findings collection produced in previous step, sorts on a large scale and merge;

(13) systematic collection each perform the result set that the computing node of merging work produces, again merge on a large scale and gather, and the data transformation result collection finally formed is pushed in HDSF file system preserving.And stop the parallel computation task of all redundancies performed; And

(14) system obtains the last data transformation result be kept in HDFS file system, carries out last format conversion to it, and outputs to unified for the result after conversion in traditional relevant database (ORACLE) database.Application for other uses.

The embodiment of " data check " is similar with the embodiment of " data conversion ".

Can realize carrying out fast processing to magnanimity report data by said method.

Below for ease of understanding principle of the present invention, composition graphs 5-Fig. 7 provides an embodiment.Be described below:

Suppose there is now a form needing to carry out data conversion, form is called: " construction work in progress detail list (one) ".The data of this form as shown in Figure 5.

Be provided with in supposing the system if next number is according to the set of transformation rule.The structure of this data conversion rule set is the matrix of 1 × 72 (OK × row).In the set of this data conversion rule, be only provided with 2 transformation rules at the 69th row and the 70th row.AA [one, construction project # two, technological transformation project]; AB [one, construction project # two, technological transformation project].This data conversion rule set as shown in Figure 6.

Data conversion rule illustrates:

AA [one, construction project # two, technological transformation project] " implication be: data walking to the 26th row 17 cells from the 10th of taking out AA row in Fig. 5 from " construction work in progress detail list () ", the and successively data stuffing of these 17 cells being arranged to the 68th of data conversion rule set.

AB [one, construction project # two, technological transformation project] " implication be: data walking to the 26th row 17 cells from the 10th of taking out AB row in Fig. 5 from " construction work in progress detail list () ", the and successively data stuffing of these 17 cells being arranged to the 70th of data conversion rule set.

The result that final system is changed out as shown in Figure 7.

Below the preferred embodiment of the present invention is described in detail by reference to the accompanying drawings; but; the present invention is not limited to the detail in above-mentioned embodiment; within the scope of technical conceive of the present invention; can carry out multiple simple variant to technical scheme of the present invention, these simple variant all belong to protection scope of the present invention.Such as, computer node can be changed into computing node or computing unit.

It should be noted that in addition, each the concrete technical characteristic described in above-mentioned embodiment, in reconcilable situation, can be combined by any suitable mode.In order to avoid unnecessary repetition, the present invention illustrates no longer separately to various possible array mode.

In addition, also can carry out combination in any between various different embodiment of the present invention, as long as it is without prejudice to thought of the present invention, it should be considered as content disclosed in this invention equally.

Claims

1. the concurrent distributed approach of magnanimity report data, it is characterized in that, the method comprises:

Obtain report data;

Generating report forms data formula collection, and multiple formulary fragment is cut into by row to generated report data formulary, wherein each formulary fragment comprises multirow report data formula; And

Described report data is pushed to each computer node in computer cluster;

By to the computing of formulary fragment, the multiple computer nodes be assigned in described computer cluster carry out calculation process;

Preserve the state snapshot of described multiple computer node calculation process; And

When interrupting the computing of arbitrary formulary fragment, recover the compute mode before interrupting according to described state snapshot, and continue to perform the computing interrupted.

2. method according to claim 1, is characterized in that, described multiple computer node carries out calculation process and comprises formula operation and carry out first order merging to operation result, to obtain multiple first order amalgamation result; And

The method also comprises: carry out second level merging to the result that described multiple first order merges; And the final data result after being merged the second level exports to intended application.

3. method according to claim 1, is characterized in that, the step of described generating report forms data formula collection, comprises generating report forms data check formulary, and by the report data generating report forms data conversion formula collection through verification.

4. method according to claim 1, is characterized in that, the method also comprises:

Heartbeat detection is carried out to described multiple computer node; And

Other computer nodes are redistributed to by being assigned to the computing of heartbeat detection without the computer node of response.

5. method according to claim 1, is characterized in that, the method also comprises:

The calculation process result of described multiple computer node is saved in the shared storage be connected with all computer nodes in described computer cluster.

6. method according to claim 1, is characterized in that, the method also comprises:

After all computer node calculation process of computing current formulary fragment complete, the computing of next formulary fragment is distributed.

7. method according to claim 1, is characterized in that, the method also comprises:

The described computing to formulary fragment is distributed according to greedy algorithm.

8. method according to claim 1, is characterized in that, the method also comprises:

After completing the computing to last formulary fragment, export operation result.

9. method according to claim 1, is characterized in that, described computer cluster is made up of the computer node disposing cloud computing platform.

10. method according to claim 9, is characterized in that, described cloud computing platform is HADOOP cloud computing platform.

11. methods according to claim 9, is characterized in that, described computer node is LINUX system server.