CN103136217B

CN103136217B - A kind of distributed data method for stream processing and system thereof

Info

Publication number: CN103136217B
Application number: CN201110378247.3A
Authority: CN
Inventors: 张旭; 杨志雄; 徐家; 邓中华
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Filing date: 2011-11-24
Publication date: 2016-12-14
Anticipated expiration: 2031-11-24

Abstract

This application provides a kind of distributed data method for stream processing, described method includes: original data stream is divided into real-time stream and historical data stream；Real-time stream described in parallel processing and described historical data stream, and produce respective result respectively；And produced result is integrated.Present invention also provides a kind of distributed data current processing device, described device includes: data identification module, for original data stream is divided into real-time stream and historical data stream；Parallel processing module, for real-time stream described in parallel processing and described historical data stream, and produces respective result respectively；And Data Integration module, for produced result is integrated.What the application made big data quantity is calculated as possibility in real time, the computing of real-time stream can parallel processing the most in a distributed manner, ensure that big data quantity processes and high real-time simultaneously, improve the response speed of system.

Description

Distributed data stream processing method and system

Technical Field

The present application relates to distributed data processing, and in particular, to a distributed data stream processing method and system for processing a large amount of data.

Background

At present, data stream processing becomes a main mode of data mining and data analysis. For example, a weblog is a data stream with a large amount of data. For another example, the electronic commerce website has continuously increased commodity release information, and continuously increased short message sending records and the like. Such a data stream has the following characteristics: (1) the data volume is large; (2) each piece of information has an ID (identifier) of a feature to be analyzed; (3) has time attribute, namely, the timeliness.

Data flow analysis is typically required to be real-time, fast, so that the system can respond in real-time according to the current behavior of a particular user. For example, the real-time analysis of the log can grasp the current state of the user, the recent access behavior, and can effectively improve the recommendation accuracy or prevent cheating in real time. How to rapidly analyze data streams, especially in the case of large data volume, meeting real-time requirements is a technical difficulty.

Generally, the basic principle of the existing distributed data stream processing system is as shown in fig. 1, where a raw data stream S is distributed to a plurality of functional modules F. And the plurality of functional modules F simultaneously process the data, send the processed results to the data integration module I, and integrate the data by the data integration module I and output the integrated data. However, the following disadvantages exist in the existing distributed data stream processing system:

(1) when a data stream is processed, when the amount of data is very large, data processing and data analysis become very time-consuming. The existing distributed data stream processing system generally adopts a shared storage mode, that is, the data interaction mode among different modules, especially between upstream and downstream modules, is that the result of the module a is put into storage (a database, a file and the like), and then the module B reads data from the storage, so that the data interaction between the module a and the module B is realized. That is, when the speed becomes a bottleneck, most of the existing processing technologies cannot meet the increase speed of the real-time data stream, and the data delay is large, so that the data analysis can only be performed offline, which results in data analysis and data mining delay, and cannot react to the current or recent behavior of the user.

(2) Distributed parallel computing has become a trend for processing of large data volumes. However, the existing parallel computing system is basically limited to a structure of function replication, that is, the method for implementing parallel computing by the system is that all the operation modules are the same function, the same program is run, and only different parts of the operation data are operated, so as to achieve the purpose of parallel computing, and therefore, finer-grained parallel cannot be implemented, modularization and hot plug of the modules cannot be implemented, and maintenance is not facilitated.

Disclosure of Invention

The application provides a distributed data stream processing method, which comprises the following steps: dividing an original data stream into a real-time data stream and a historical data stream; processing the real-time data stream and the historical data stream in parallel and respectively generating respective processing results; and integrating the generated processing results.

Preferably, in the step of processing the real-time data stream, the real-time data stream is divided according to dimensions and processed in parallel.

Preferably, the step of processing the real-time data stream comprises: slicing the real-time data stream into a plurality of data blocks; cutting each of the plurality of data blocks into a plurality of data units in parallel, and then respectively sending the plurality of data units to a plurality of different functional modules for parallel processing; and summarizing the results of the parallel processing.

Preferably, in the step of processing the historical data stream, the historical data stream is cut by dimension and processed in parallel.

The present application further provides a distributed data stream processing apparatus, the apparatus comprising: the data identification module is used for dividing the original data stream into a real-time data stream and a historical data stream; the parallel processing module is used for processing the real-time data stream and the historical data stream in parallel and respectively generating respective processing results; and the data integration module is used for integrating the generated processing results.

Preferably, when the parallel processing module processes the real-time data stream, the real-time data stream is segmented according to the dimension and is processed in parallel.

Preferably, processing the real-time data processing system comprises: a horizontal slicing module for slicing the real-time data stream into a plurality of data blocks; the plurality of longitudinal segmentation modules are used for segmenting each of the plurality of data blocks into a plurality of data units in parallel and then respectively sending the plurality of data units to a plurality of different functional modules for parallel processing; and the result summarizing module is used for summarizing the results of the parallel processing.

Preferably, when processing the historical data stream, the parallel processing module performs dimension segmentation on the historical data stream and performs parallel processing.

According to the distributed data stream processing method, the data stream is segmented and segmented for multiple times according to the time sequence and the dimensionality, namely the data is processed in a time-sharing mode by utilizing the time sequence and adopting a multilayer structure, a new distributed architecture is used, and the information stream is longitudinally segmented by utilizing different dimensionalities. Enabling real-time computation of large data volumes. The operation of the real-time data flow can be processed in a distributed parallel mode to the maximum extent, large data volume processing and high real-time performance are guaranteed, and the response speed of the system is improved.

Drawings

Embodiments of the present application will be described below with reference to the accompanying drawings, in which:

FIG. 1 illustrates a schematic diagram of a prior art distributed data stream processing system;

FIG. 2 illustrates a schematic diagram of one embodiment of a large data volume distributed data stream processing system of the present application;

FIG. 3 is a flow chart illustrating a large data volume distributed data stream processing method of the present application corresponding to the large data volume distributed data stream processing system of FIG. 2;

FIG. 4 illustrates a schematic diagram of one embodiment of real-time processing system 30 of FIG. 2; and

fig. 5 is a flowchart illustrating a real-time processing method of the present application corresponding to the real-time processing system 30 in fig. 4.

Detailed Description

The above-described spirit and substance of the present application will be described in detail with reference to fig. 2 to 5.

Although the embodiments of the system and method of the present application are described below by taking website log data stream as an example, it can be understood that the present application may also be used to process data streams of systems such as personalized recommendation, real-time anti-cheating, commodity release, mobile phone short message sending, and scientific computing.

Taking website log data stream as an example, fig. 2 illustrates a schematic diagram of an embodiment of the large data volume distributed data stream processing system of the present application.

The large data volume distributed data stream processing system in fig. 2 includes: a data identification module 10; data processing system 20 30 days ago; a real-time data processing system 30; a data processing system 40 within 30 days; and a data integration module 50. It will be appreciated that the modules may be implemented by a computer or similar device having computing or processing capabilities, or a network of multiple such devices, or a portion of the hardware or software of such devices.

Fig. 3 is a flowchart illustrating a large data volume distributed data stream processing method according to the present application corresponding to the large data volume distributed data stream processing system in fig. 2. One embodiment of the present application is described below in conjunction with fig. 2 and 3.

In step S100, the original data stream 100 is acquired.

In step S101, after the original data stream 100 is acquired by the data identification module 10, the data identification module 10 identifies whether the data in the original data stream 100 is real-time data, data within 30 days, or data before 30 days, so as to divide the original data stream 100 into the data stream 200 before 30 days, the real-time data stream 300, and the data stream 400 before 30 days in a time sequence. Before 30 days data stream 200 is sent to before 30 days data processing system 20, real time data stream 300 is sent to real time data processing system 30, and within 30 days data stream 400 is sent to within 30 days data processing system 40.

In step S102, the data processing system 20 performs data processing 30 days ago, and sends the processing result to the data integration module 50. In step S103, the real-time data processing system 30 performs real-time data processing and sends the processing result to the data integration module 50. In step S104, the data processing system 40 performs data processing within 30 days, and sends the processing result to the data integration module 50. Step S102, step S103, and step S104 are executed in parallel.

In step S105, the data integration module 50 integrates the received processing results and outputs the integrated data.

It is to be understood that although the original data stream 100 is divided by the data identification module 10 into different portions of the 30-day-old data stream 200, the real-time data stream 300, and the 30-day-old data stream 400, which are distinguished by three time limits, those skilled in the art will be able to divide the original data stream 100 by other time limits, depending on the actual circumstances. For example, the raw data stream 100 is divided into fewer or more time periods (and accordingly, a large data volume distributed data stream processing system includes fewer or more data processing systems), or time limits other than 30 days are employed, or time ranges that would be considered "real-time" are defined according to actual needs.

As can be seen from the above embodiments, the large data amount distributed data stream processing method of the present application is basically divided into three stages of time-sequence segmentation, data processing, and data integration.

In the chronological segmentation stage, since the system log is added at any time, the real-time data stream 300 is first distributed to the real-time processing system 30 by the data recognition module 10; for historical data (e.g., data stream 200 before 30 days and data stream 400 within 30 days), since they have been stored as files, they are sent to historical file processing systems (e.g., data processing system 20 before 30 days and data processing system 40 within 30 days).

In the data processing phase, the history processing system and the real-time processing system process data of different periods in parallel.

In the data integration stage, the results of the parallel processing of the data in different time periods are all sent to the data integration module 50, and after the results are integrated, the results can be output to provide external services.

In this embodiment, the system and the data are divided according to the time sequence, which is very beneficial to process the data stream with large data amount with time sequence, and this is a basis for processing the mass data.

Assuming that every piece of information of the data stream is time-stamped, the full amount of data is streamed from the first data to the present data (still growing). If a certain time point is defined as a separation point, the entire amount (or all) of data can be divided into historical data and real-time data. For a full data stream we can analyse that historical data before a certain period of time already existed before a certain point in time. For example, data before one day is not required to be calculated in real time, so that the data can be calculated off line, and only the calculation result needs to be integrated with the result of other modules (such as a real-time processing module).

The historical data and the real-time calculation are respectively processed, and the pressure of the real-time calculation can be greatly reduced by performing off-line calculation on the historical data. Enabling real-time data to be computed more quickly. Meanwhile, historical data can be calculated more finely.

According to the method and the device, data are segmented according to the time sequence, so that data processing in different time periods can be performed in parallel, and high response performance of real-time data is guaranteed.

In order to further improve the performance of the real-time data processing system, the present application also proposes to further partition the information units (i.e. data blocks) of the data into various function modules (i.e. different types of function modules) according to the dimension (in the present application, the term "dimension" is used to distinguish data of different attributes or types, i.e. data of different dimensions are processed by different types of function modules). The real-time data processing system 30 will be described as an example.

FIG. 4 illustrates a schematic diagram of one embodiment of real-time processing system 30 of FIG. 2.

As shown in fig. 4, the real-time processing system 30 includes: a horizontal (in this application, the term "horizontal" is merely used to facilitate identification of this level of slicing, not the directional concept) slicing module 400; a plurality (N) of vertical (in this application, the term "vertical" is merely used to facilitate identification of this level of partitioning, not the directional concept) partitioning module 500; a plurality (N) of functional module groups 600, wherein each functional module group 600 comprises a plurality (M) of functional modules; and a result summarization module 700.

Fig. 5 is a flowchart illustrating a real-time processing method of the present application corresponding to the real-time processing system 30 in fig. 4. One embodiment of the real-time processing system of the present application is described below in conjunction with fig. 4 and 5.

In step S200, a real-time data stream 300 is acquired.

In step S201, the horizontal segmentation module 400 segments the acquired real-time data stream 300 into a plurality of data blocks (1, 2, 3.. N.) (the segmentation in this step is so-called horizontal segmentation), and sends the segmented data blocks to a plurality of (N) vertical segmentation modules 500, respectively. As shown in fig. 4, the 1 st data block is sent to the 1 st vertical segmentation module 500, the 2 nd data block is sent to the 2 nd vertical segmentation module 500, and so on, the nth data block is sent to the nth vertical segmentation module 500. It is understood that, considering that the data stream is infinite but flowing, each of the plurality (N) of vertical slicing modules 500 can be reused after processing one data block, the number of the vertical slicing modules 500 can be set according to the flow rate of the data stream.

In step S202, each vertical slicing module 500 slices a received data block into a plurality of (as many as M, as the case may be) data units (the slicing in this step is called vertical slicing by dimension), and sends the sliced data units to a plurality of (as many as M, corresponding to the number of data units) different function modules in one function module group 600, respectively.

As shown in fig. 4, the 1 st vertical splitting module 500 splits the data block 1 into M data units, sends the 1 st data unit to the 1 st functional module of the 1 st functional module group 600, sends the 2 nd data unit to the 2 nd functional module of the 1 st functional module group 600, and so on, sends the M-th data unit to the M-th functional module of the 1 st functional module group 600.

By analogy, if the data traffic of the real-time data stream 300 is large enough, the 2 nd vertical splitting module 500 splits the data block 2 into M data units, sends the 1 st data unit to the 1 st functional module of the 2 nd functional module group 600, sends the 2 nd data unit to the 2 nd functional module of the 2 nd functional module group 600, and by analogy, sends the M th data unit to the M-th functional module of the 2 nd functional module group 600.

By analogy, if the data traffic of the real-time data stream 300 is large enough, there may be more data blocks, vertical slicing modules 500, functional module groups 600, and functional modules. It is understood that the number of the longitudinal split module 500, the functional module group 600, and the functional modules in the functional module group 600 may be set according to the requirement.

Step S202 and step S203 are executed in parallel.

In step S203, each functional module processes the received data unit and sends the processed result to the result summarizing module 700.

In step S204, the result summarizing module 700 summarizes the received results and outputs the summarized data.

From the description of the embodiment, it can be seen that, first, the real-time data stream is split horizontally and distributed to each processor (e.g., the vertical splitting module 500), and the functions of each processor are the same. The processors process in parallel, and the processing speed is greatly improved.

Then, the longitudinal segmentation module 500 longitudinally segments the data blocks according to the dimensions, that is, extracts data units with different dimensions from the data blocks, and then the data units with corresponding dimensions are sent to corresponding function processing modules (that is, function modules) and processed by the function processing modules in parallel.

Taking the weblog data stream as an original data stream as an example, the weblog data stream is first divided into a plurality of log information data blocks in a horizontal direction, and each log information data block is allocated to a corresponding vertical dividing module 500. Then, each vertical segmentation module 500 vertically segments the corresponding log information data block according to the dimension, for example, extracts commodity information from the log information data block and sends the commodity information to the commodity processing unit, and extracts keyword information and sends the keyword information to the keyword processing unit. In this way, each information unit is decomposed into finer grained elements, distributed to each functional unit, and processed in parallel. For example, as a functional unit for processing the real-time website log data stream, for example, a commodity information analysis module analyzes commodity information, an access path module analyzes an access path, and the modules perform parallel processing. Then, the user and commodity information is sent to the recommendation function module, and the user and access path information is sent to the anti-cheating module, wherein all the modules are processed in parallel.

Finally, the results processed by the functional modules are sent to an integrator (e.g., the result summarizing module 700, or further including the data integrating module 50), and the integrator integrates (summarizes) the results.

The present application describes the segmentation of data using a real-time data processing system as an example. It will be appreciated that a similar architecture may be employed for historical data processing systems. In contrast, since the historical data processing is performed periodically, a cluster distributed computing system with low cost can be used.

From the above description, it can be seen that, in the present application, a shared storage mode is not adopted as in the existing distributed data stream processing system, but a data stream is segmented and segmented a plurality of times according to the time sequence and the dimension, that is, a multilayer structure is adopted by using the time sequence to perform data time-segment processing, a new distributed architecture is used, and information streams are longitudinally segmented by using different dimensions, rather than an architecture limited to function replication as in the existing parallel computing system, that is, the method for implementing parallel computing of the present invention is not that all operation modules are the same function, and that is, the same program is run, but that different parts of the operation data. Therefore, the invention can realize finer-grained parallelism, can also realize modularization and hot plug of modules, and is beneficial to maintenance.

The invention makes real-time calculation of large data volume possible. The operation of the real-time data flow can be processed in a distributed parallel mode to the maximum extent, large data volume processing and high real-time performance are guaranteed, and the response speed of the system is improved.

The large data volume distributed data stream processing method according to the present application may be implemented by a single or multiple processing devices with arithmetic processing capability, such as a single or multiple computers, running computer executable instructions. A large data volume distributed data stream processing system according to the present application may be a single or multiple processing devices, such as a single or multiple computers, where the individual modules or units may be device components having corresponding functionality when executing computer-executable instructions for the processing device. According to an embodiment of the application, the large data volume distributed data stream processing method and the system thereof can be realized under linux, Windows and other systems by using languages such as JAVA and SQL.

While the present application has been described with reference to exemplary embodiments, it is understood that the terminology used is intended to be in the nature of words of description and illustration, rather than of limitation. As the present application may be embodied in several forms without departing from the spirit or essential characteristics thereof, it should also be understood that the above-described embodiments are not limited by any of the details of the foregoing description, but rather should be construed broadly within its spirit and scope as defined in the appended claims, and therefore all changes and modifications that fall within the meets and bounds of the claims, or equivalences of such meets and bounds are therefore intended to be embraced by the appended claims.

Claims

1. A method of distributed data stream processing, the method comprising:

dividing an original data stream into a real-time data stream and a historical data stream;

processing the real-time data stream and the historical data stream in parallel and respectively generating respective processing results; and

integrating the generated processing results; wherein offline calculations are employed in the step of processing the historical data stream,

in the step of processing the real-time data stream, the real-time data stream is segmented according to dimensions and is processed in parallel, so that finer-grained parallel processing is realized.

2. The method of claim 1, wherein processing the real-time data stream comprises:

slicing the real-time data stream into a plurality of data blocks;

cutting each of the plurality of data blocks into a plurality of data units in parallel, and then respectively sending the plurality of data units to a plurality of different functional modules for parallel processing; and

and summarizing the results of the parallel processing.

3. The method of claim 1, wherein,

and in the step of processing the historical data stream, the historical data stream is segmented according to the dimension and is processed in parallel.

4. A distributed data stream processing apparatus, the apparatus comprising:

the data identification module is used for dividing the original data stream into a real-time data stream and a historical data stream;

the parallel processing module is used for processing the real-time data stream and the historical data stream in parallel and respectively generating respective processing results; and

the data integration module is used for integrating the generated processing results; wherein,

the parallel processing module employs off-line computation in processing the historical data stream,

and when the parallel processing module processes the real-time data stream, the real-time data stream is segmented according to the dimension and is processed in parallel, so that the parallelism of finer granularity is realized.

5. The apparatus of claim 4, wherein processing the real-time data processing system comprises:

a horizontal slicing module for slicing the real-time data stream into a plurality of data blocks;

the plurality of longitudinal segmentation modules are used for segmenting each of the plurality of data blocks into a plurality of data units in parallel and then respectively sending the plurality of data units to a plurality of different functional modules for parallel processing; and

and the result summarizing module is used for summarizing the results of the parallel processing.

6. The apparatus of claim 4, wherein,

and the parallel processing module is used for segmenting the historical data stream according to the dimension and carrying out parallel processing when processing the historical data stream.