CN116028034A

CN116028034A - Big data preprocessing method, system, storage medium and terminal

Info

Publication number: CN116028034A
Application number: CN202211612865.4A
Authority: CN
Inventors: 李彬; 孙卫东; 温冬
Original assignee: Xi'an Huaxun Technology Co ltd
Current assignee: Xi'an Huaxun Technology Co ltd
Priority date: 2022-12-15
Filing date: 2022-12-15
Publication date: 2023-04-28

Abstract

The invention belongs to the technical field of data processing, and particularly discloses a big data preprocessing method, which comprises the following steps: determining the task type of data source preprocessing and the functional composition of a preprocessing model; further determining a component operator dragging track forming flow visualization preprocessing model of the data modeling system; preprocessing part of data of the data source based on the preprocessing model; judging whether the pretreatment result meets the design requirement, if so, storing the pretreatment model as a recording macro model and storing model meta information; if not, the pretreatment model is built again; and introducing the recorded macro model, generating corresponding model meta information, carrying out data processing on the data source, and outputting a processing result. The invention carries out drag modeling based on the methods of the components and the flowcharts, carries out graphical modeling in a zero code mode, encapsulates specific functions in a component operator, processes data through componentization, has flexible configuration and greatly reduces the threshold of data processing.

Description

Big data preprocessing method, system, storage medium and terminal

Technical Field

The invention belongs to the technical field of data processing, and particularly relates to a big data preprocessing method, a big data preprocessing system, a storage medium and a terminal.

Background

With the development of computers and sensors, the production scale of modern enterprises is larger and larger, and the automation level is higher and higher. In the production process, the sensor can monitor a large amount of data in real time, and the computer completes the work of fault detection, fault early warning and the like through data processing.

In the data processing process, scientific researchers develop various calculation processing models to realize automatic and rapid processing of data, in the process, modeling staff need to use a data processing tool and a machine learning tool, write customized codes to be directly realized by using a specific language, and the complete calculation process can be completed by complex environment construction and configuration, more importantly, when the data is processed, the metadata is required to be calculated and maintained, the processing performance mainly depends on the design, algorithm and code quality of hardware and developers, and the processing efficiency of big data is restricted.

Disclosure of Invention

The invention aims to overcome the defects in the prior art and provides a big data preprocessing method and system.

In a first aspect of the present invention, a big data preprocessing method is provided, including:

determining a task type of data source preprocessing, and determining a functional composition of a preprocessing model based on the task type;

determining a component operator drag track forming flow visualization preprocessing model of the data modeling system based on the functional composition;

preprocessing part of data of the data source based on the preprocessing model and outputting a preprocessing result;

judging whether the pretreatment result meets the design requirement, if so, storing the pretreatment model as a recording macro model and storing model meta-information; if not, the pretreatment model is built again;

and introducing the recorded macro model, generating corresponding model meta-information, performing data processing on the data source, and outputting a processing result.

Further, the model meta information is a processing action corresponding to each processing node of the recorded macro model.

Further, the processing action at least comprises creating a database table structure and checking data.

The module operator is a unit packaged with different processing functions; different ones of the processing functions are responsive to corresponding ones of the processing nodes.

In a second aspect of the present invention, there is provided a big data preprocessing system comprising:

the task determining module is used for determining the task type of the data source preprocessing;

the function determining module is connected with the task determining module and is used for determining the function composition of the preprocessing model based on the task type;

the model construction module is connected with the function determination module and is used for determining a component operator dragging track of the data modeling system based on the function composition to form a flow visualization preprocessing model;

data source screening unit: the data source is used for dividing the data source into preprocessing model processing data and recording macro model processing data;

the verification module is used for verifying whether the preprocessing result meets the design requirement, and if so, the preprocessing model is saved as a recording macro model and model meta-information is saved; if not, the pretreatment model is built again;

and a storage module: for storing the pre-processing model as a recorded macro model and storing model meta-information.

The invention also provides a storage medium which stores big data preprocessing instructions, and the big data preprocessing instructions realize the big data preprocessing method when being executed by a processor.

The invention also provides a terminal which comprises a memory and a processor, wherein the memory stores big data preprocessing instructions, and the processor loads the big data preprocessing instructions to execute the big data preprocessing method.

Compared with the prior art, the invention has the beneficial effects that:

(1) In the field of big data processing, for non-professional developers, programming languages need to be learned for coding development, and the threshold is higher, the method based on the components and the flow chart is used for drag modeling, the zero code mode is used for graphical modeling, specific functions are packaged in component operators, data are processed through componentization, the configuration is flexible, and the threshold for data processing is greatly reduced;

(2) Although some modeling tools integrate drag modeling functions at present, a big data environment cannot be supported, and once a model has a vulnerability, an untimely and applicable business scene and modeling capability are found under big data to be limited. According to the invention, under a big data scene, the data modeling logic is packaged, so that the separation of big data calculation and modeling is realized, the data structure is preprocessed and split in advance to form a data object while page modeling is performed, and various data acquisition modes are adapted, so that various scenes such as huge data quantity, complex transmission modes and the like can be completed, and the cost of back-end data calculation resources is reduced;

(3) The data structure and the type can be checked by processing the data structure in advance in the modeling stage, so that the situation that the whole task fails to be recalculated and resources are wasted due to the fact that a certain piece of data is wrong when the data is batched is avoided.

Drawings

The following drawings are illustrative of the invention and are not intended to limit the scope of the invention, in which:

fig. 1: a flow diagram of a big data processing method;

fig. 2: the application discloses a preprocessing model for flow visualization.

Detailed Description

The present invention will be further described in detail with reference to the following specific examples, which are given by way of illustration, in order to make the objects, technical solutions, design methods and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

As shown in fig. 1, the present invention provides a big data preprocessing method, which includes the following steps:

s1, determining a task type of data source preprocessing, and determining a functional composition of a preprocessing model based on the task type;

s2, determining a component operator dragging track forming flow visualization preprocessing model of the data modeling system based on the functional composition;

s3, preprocessing the small batch data of the data source based on the preprocessing model and outputting a preprocessing result;

s4, judging whether the pretreatment result meets the design requirement, if so, storing the pretreatment model as a recording macro model and storing model meta-information; if not, repeating the step S2;

s5, introducing the recording macro model, generating corresponding model meta information, carrying out data processing on the data source, and outputting a processing result.

In step S2, drag modeling may be performed using an existing open source tool, such as ketle or NiFi.

In this embodiment, the model meta information is a processing action corresponding to each processing node of the recorded macro model. In the course of big data processing, each processing node needs to perform different processing actions, for example, after acquiring data, it needs to perform operations related to deleting or adding data in a certain row or a certain column, or modifying a column name, etc., for the data, a name, a type, an attribute, etc. of a certain column all belong to meta information of the data, in the course of data processing, it needs to create a database table structure for holding the data based on the meta information of the data, in step S5, the storage of the recording macro model includes storage of a model, storage of the data meta information, and storage of the model meta information, in this embodiment, the model meta information may be a creation process of the database table structure. When the data source processes the data based on the recorded macro model, the creation process of the model meta information is omitted, and the data is only processed (for example, the data is stored in a database table structure in a classified manner), so that the requirement on system hardware is reduced, and the data processing efficiency is improved. It should be appreciated that under the same embodiment, the metadata of the data processed by the pre-processing model and the data processed by the recording macro model are identical, although the data differs relatively much in volume, the model metadata of the pre-processing model is equally applicable to the recording macro model. The drag type visual data modeling is adopted, the logic of calculation and modeling is separated, the page modeling is carried out, the data structure is preprocessed in advance, only calculation tasks are carried out during data calculation, and the calculation performance is improved.

In the above, the component operator is a unit packaged with different processing functions; different ones of the processing functions are responsive to corresponding ones of the processing nodes. Since the data processing procedure at each processing node is completely different, each processing node needs to drag different component operators correspondingly, and the component operators should have the corresponding functions of the processing nodes.

For example, in the industrial manufacturing industry, if workshop machine data is required to be processed according to days, and the workshop machine data is put into a service system after being regulated; because the machine has very much data and is provided with a plurality of sensors for data real-time acquisition, massive data can be generated every day, and the machine types are different, the data processing methods possibly have differences, and to ensure the integrity and the high efficiency of the data processing flow, some related modeling tools are required to support, although some modeling tools in the industry have some related data processing modes, a large data environment cannot be supported, and applicable business scenes and modeling capabilities are limited. In one embodiment, through a visual dragging mode, firstly, machine data pre-modeling is carried out through a related modeling tool, after the modeling is successful, a model is stored, corresponding matched model meta-information is generated, and finally, the model is imported and then is introduced into the machine data for calculation; during calculation, the corresponding model meta information is directly taken for calculation, the process of maintaining the model information and calculating is omitted during calculation, the stored model can be reused, the calculation is lighter, and the calculation cost is reduced. And the method completes various scenes such as huge data quantity, scripted acquisition of data, low processing efficiency and the like in the data processing process of the machine.

For another example, as shown in fig. 2, according to the service requirement, the data processing of the personnel information needs to be completed, and in this embodiment, the method for processing the data includes:

s01, acquiring a processing type of personnel information, and then determining a functional composition of a preprocessing model based on the processing type, wherein the method comprises the following steps: collecting personnel information to be processed, adding gender information, performing duplication removal processing, renaming, processing address information, information warehousing and the like;

s02, forming a flow visualization preprocessing model by a component operator corresponding to a dragging position in Kettle or NiFi according to the functional composition;

s03, carrying out data processing through a preprocessing model by using a small part of personnel information, wherein the data processing comprises the following steps: collecting personnel information to be processed, adding gender column information to the personnel information according to conditions, removing repeated names in name information, renaming name field names according to service requirements, removing address field columns in personnel, finishing data processing of the personnel information, and outputting processing results after data warehousing operation;

s04, if the processing result meets the design requirement, storing a preprocessing model as a recording macro model and storing model meta-information, wherein the model meta-information in the process at least comprises gender column increasing information, a component database table and the like;

s05, introducing the recorded macro model, generating corresponding model meta-information, carrying out data processing on all personnel information, and outputting a processing result.

The invention relates to a big data management system for integrating, processing and intelligently loading data of batch data, which models in a componentized and flow-processed mode. According to the scheme, pretreatment is carried out on a big data processing mode in advance, data calculation and modeling are completely separated, loss of machine performance during mass data calculation is reduced, and therefore high efficiency of data processing is guaranteed. Firstly, creating a data processing model, preprocessing small batches of data on the model in advance based on the preprocessing model, avoiding the problem of repeated calculation of data processing errors caused by mistakes of the model, and ensuring the correctness of the model and the operability of the data; after data preprocessing, the preprocessing model and the model meta information are stored; when the method is used, the preprocessing model is imported, model meta-information is obtained, and mass data can be calculated after the data is imported. When the data is processed, the performance bottleneck that the data resource needs to analyze the model meta information and calculate is omitted; and the model can be repeatedly used after being stored, so that the convenience of data processing is improved.

In one embodiment, the model meta-information is a processing action corresponding to each processing node of the recorded macro-model.

In some possible embodiments, the processing actions include at least creating a database table structure, data checking.

In the above, the component operator is a unit packaged with different processing functions; different ones of the processing functions are responsive to corresponding ones of the processing nodes.

The embodiment of the application also comprises a storage medium, wherein the storage medium stores instructions, the instructions are executed by a processor to realize the big data preprocessing method shown in fig. 1, and the storage medium can be an optical disk, a ROM or a RAM containing the instructions.

The embodiment of the application also comprises a terminal which comprises a memory and a processor, wherein the memory 5 stores instructions, and the processor loads the instructions to execute the big data preprocessing method shown in fig. 1. The terminal equipment comprises, but is not limited to, mobile phones, computers, tablet computers and other terminal equipment.

The embodiments of the present application are not limited in this regard.

The foregoing description of embodiments of the invention has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope 0 and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the technical improvements in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A big data preprocessing method, characterized by comprising:

2. The method of claim 1, wherein the model meta-information is a processing action corresponding to each processing node of the recorded macro model.

3. A big data preprocessing method according to claim 2, wherein said processing actions include at least creating a database table structure, data verification.

4. The big data preprocessing method according to claim 1, wherein the component operators are units packaged with different processing functions; different ones of the processing functions are responsive to corresponding ones of the processing nodes.

5. A big data preprocessing system, comprising:

6. The big data preprocessing system of claim 5, wherein said model meta-information is a processing action corresponding to each processing node of said recorded macro model.

7. The big data preprocessing system of claim 6, wherein said processing actions include at least creating a database table structure, data verification.

8. The big data preprocessing system of claim 5, wherein said component operators are units packaged with different processing functions; different ones of the processing functions are responsive to corresponding ones of the processing nodes.

9. A storage medium storing instructions which, when executed by a processor, implement the big data preprocessing method of any one of claims 1-4.

10. A terminal comprising a memory and a processor, the memory storing instructions, the processor loading the instructions to perform the big data preprocessing method of any of claims 1-4.