Nothing Special   »   [go: up one dir, main page]

CN114185998A - Data processing method, device, equipment and storage medium - Google Patents

Data processing method, device, equipment and storage medium Download PDF

Info

Publication number
CN114185998A
CN114185998A CN202111289614.2A CN202111289614A CN114185998A CN 114185998 A CN114185998 A CN 114185998A CN 202111289614 A CN202111289614 A CN 202111289614A CN 114185998 A CN114185998 A CN 114185998A
Authority
CN
China
Prior art keywords
data
target
type
processed
class
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111289614.2A
Other languages
Chinese (zh)
Inventor
康全忠
林业宁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Dajia Internet Information Technology Co Ltd
Original Assignee
Beijing Dajia Internet Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Dajia Internet Information Technology Co Ltd filed Critical Beijing Dajia Internet Information Technology Co Ltd
Priority to CN202111289614.2A priority Critical patent/CN114185998A/en
Publication of CN114185998A publication Critical patent/CN114185998A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/283Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The disclosure relates to a data processing method, a data processing device, data processing equipment and a storage medium, relates to the technical field of computers, and can improve the data processing efficiency. The data processing method comprises the following steps: acquiring data to be processed, wherein the data to be processed comprises first class data or second class data, and the complexity of the second class data is higher than that of the first class data; if the data to be processed is first type data, writing the first type data into a click stream data warehouse; and if the data to be processed is second-class data, executing at least one processing operation on the second-class data, and writing the detail data and result data obtained by each processing operation into the click stream data warehouse.

Description

Data processing method, device, equipment and storage medium
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a data processing method, apparatus, device, and storage medium.
Background
In the big data era, due to the continuous iterative update of data, a user can generally perform data query on data stored in a system in a table form according to different requirements, and the user can conveniently make a more intelligent business decision by querying mass data.
Current data queries include offline data queries and real-time data queries. Offline data query is generally based on a data warehouse system of Hadoop (Hadoop is a distributed system infrastructure developed by the Apache foundation), and different business data are scheduled and queried through tasks of different hives (hives are data warehouse tools based on Hadoop). However, the Hadoop data warehouse system is mostly used for performing data query on offline data. When the quantity of the data to be queried is large, the data query efficiency of the Hadoop data warehouse system is low. And real-time data query is generally based on a real-time computing engine Flink of a real-time data warehouse. However, the real-time computing engine Flink of the existing real-time data warehouse has high development cost and difficult data verification process.
From the above, in a scene with a large data query amount, or in a scene of real-time data query, calculation, analysis, and the like, the existing data query method cannot meet the requirements of the above scene, so that the data processing efficiency is reduced.
Disclosure of Invention
The present disclosure provides a data processing method, apparatus, device, and storage medium, which can improve data processing efficiency.
The technical scheme of the embodiment of the disclosure is as follows:
according to a first aspect of embodiments of the present disclosure, there is provided a data processing method that may be applied to an electronic device. The data processing method may include:
acquiring data to be processed, wherein the data to be processed comprises first class data or second class data, and the complexity of the second class data is higher than that of the first class data;
if the data to be processed is first type data, writing the first type data into a click stream data warehouse;
and if the data to be processed is second-class data, executing at least one processing operation on the second-class data, and writing the detail data and result data obtained by each processing operation into the click stream data warehouse.
Optionally, the data processing method further includes:
receiving a data query instruction for acquiring target data; the target data comprises a first type of data or a second type of data;
responding to a data query instruction, and acquiring target data from a click stream data warehouse;
and outputting the target data.
Optionally, when the target data is the second type of data, in response to the data query instruction, acquiring the target data from the clickstream data warehouse, including: and responding to the data query instruction, and acquiring detail data and result data of the target data from the clickstream data warehouse.
Optionally, the data processing method further includes:
receiving a data verification instruction for performing data verification on target data;
responding to the data verification instruction, and performing data verification operation on the target data according to the detail data and the result data of the target data; the data verification operation is used to verify the integrity of the target data.
Optionally, when the target data includes a plurality of first-class data greater than the preset number, in response to the data query instruction, acquiring the target data from the clickstream data warehouse includes:
responding to a data query instruction, and acquiring a plurality of first-class data from a click stream data warehouse;
calling a pre-generated configuration file, converting a plurality of first-class data into a target data set, and determining the target data set as target data; the configuration file is used to convert the plurality of data into a data set.
Optionally, the data processing method further includes:
acquiring a data type identifier of data to be processed;
determining the data type of the data to be processed according to the data type identifier;
and determining the complexity of the data to be processed according to the data type.
According to a second aspect of the embodiments of the present disclosure, there is provided a data processing apparatus, which can be applied to an electronic device. The data processing apparatus may include: an acquisition unit and a processing unit;
the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring data to be processed, the data to be processed comprises first class data or second class data, and the complexity of the second class data is higher than that of the first class data;
the processing unit is used for writing the first type of data into the click stream data warehouse if the data to be processed is the first type of data;
and the processing unit is also used for executing at least one processing operation on the second type of data if the data to be processed is the second type of data, and writing the detail data and the result data obtained by each processing operation into the click stream data warehouse.
Optionally, the data processing apparatus further includes: a receiving unit and an output unit;
the receiving unit is used for receiving a data query instruction for acquiring target data; the target data comprises a first type of data or a second type of data;
the acquisition unit is also used for responding to the data query instruction and acquiring target data from the click stream data warehouse;
an output unit for outputting the target data.
Optionally, when the target data is second-class data, the obtaining unit is specifically configured to:
and responding to the data query instruction, and acquiring detail data and result data of the target data from the clickstream data warehouse.
Optionally, the receiving unit is further configured to receive a data verification instruction for performing data verification on the target data;
the processing unit is also used for responding to the data verification instruction and executing data verification operation on the target data according to the detail data and the result data of the target data; the data verification operation is used to verify the integrity of the target data.
Optionally, when the target data includes a plurality of first-type data larger than the preset number, the obtaining unit is specifically configured to:
responding to a data query instruction, and acquiring a plurality of first-class data from a click stream data warehouse;
calling a pre-generated configuration file, converting a plurality of first-class data into a target data set, and determining the target data set as target data; the configuration file is used to convert the plurality of data into a data set.
Optionally, the obtaining unit is further configured to obtain a data type identifier of the data to be processed;
the processing unit is also used for determining the data type of the data to be processed according to the data type identifier;
and the processing unit is also used for determining the complexity of the data to be processed according to the data type.
According to a third aspect of embodiments of the present disclosure, there is provided an electronic device, which may include: a processor and a memory for storing processor-executable instructions; wherein the processor is configured to execute the instructions to implement any of the above-described optional data processing methods of the first aspect.
According to a fourth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium having stored thereon instructions, which, when executed by a processor of an electronic device, enable the electronic device to perform any one of the above-mentioned optional data processing methods of the first aspect.
According to a fifth aspect of embodiments of the present disclosure, there is provided a computer program product, which includes computer instructions that, when run on an electronic device, cause the electronic device to perform the data processing method according to any one of the optional implementations of the first aspect.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:
based on any one of the above aspects, in the present disclosure, after the electronic device acquires the data to be processed, different data processing operations may be executed according to the complexity of the data to be processed. If the data to be processed is first-class data, writing the first-class data into a click stream data warehouse; and if the data to be processed is second-class data with the complexity higher than that of the first-class data, executing at least one processing operation on the second-class data, and writing the detail data and the result data obtained by each processing operation into a click stream data warehouse. As such, the present disclosure may process different data separately at different complexities. For simple first-class data, the data is directly written into the ClickHouse, the processing flow is simple, the data link is short, the processing cost is low, and the subsequent quick query from the ClickHouse is facilitated. For the second type of data, the detailed data and the calculation result are stored in the ClickHouse in real time in each calculation, so that the detailed data and the calculation result in the ClickHouse can be layered and checked respectively in the following process, the rapid query from the ClickHouse is facilitated, and the data processing efficiency is improved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.
Fig. 1 is a schematic flow chart illustrating a data processing method provided by an embodiment of the present disclosure;
FIG. 2 is a block diagram illustrating a data processing system according to an embodiment of the present disclosure;
fig. 3 is a schematic diagram illustrating an application scenario of a data processing method provided by an embodiment of the present disclosure;
fig. 4 is a schematic diagram illustrating an application scenario of another data processing method provided by an embodiment of the present disclosure;
FIG. 5 is a flow chart illustrating a further data processing method provided by an embodiment of the present disclosure;
fig. 6 is a schematic view illustrating an application scenario of another data processing method provided by an embodiment of the present disclosure;
fig. 7 is a schematic diagram illustrating an application scenario of another data processing method provided by an embodiment of the present disclosure;
fig. 8 is a schematic view illustrating an application scenario of another data processing method provided by an embodiment of the present disclosure;
fig. 9 is a schematic structural diagram of a data processing apparatus provided in an embodiment of the present disclosure;
fig. 10 is a schematic structural diagram of a terminal provided in an embodiment of the present disclosure;
fig. 11 shows a schematic structural diagram of a server provided in an embodiment of the present disclosure.
Detailed Description
In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.
It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.
It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, and/or components.
The data to which the present disclosure relates may be data that is authorized by a user or sufficiently authorized by parties.
As described in the background, current data queries include offline data queries and real-time data queries. Offline data query is generally based on a data warehouse system of Hadoop (Hadoop is a distributed system infrastructure developed by the Apache foundation), and different business data are scheduled and queried through tasks of different hives (hives are data warehouse tools based on Hadoop). However, the Hadoop data warehouse system is mostly used for performing data query on offline data. When the quantity of the data to be queried is large, the data query efficiency of the Hadoop data warehouse system is low. And real-time data query is generally based on a real-time computing engine Flink of a real-time data warehouse. However, the real-time computing engine Flink of the existing real-time data warehouse has high development cost and difficult data verification process.
From the above, in a scene with a large data query amount, or in a scene of real-time data query, calculation, analysis, and the like, the existing data query method cannot meet the requirements of the above scene, so that the data processing efficiency is reduced.
Based on this, the embodiments of the present disclosure provide a data processing method, where after acquiring data to be processed, an electronic device may execute different data processing operations according to the complexity of the data to be processed. If the data to be processed is first-class data, writing the first-class data into a click stream data warehouse; and if the data to be processed is second-class data with the complexity higher than that of the first-class data, executing at least one processing operation on the second-class data, and writing the detail data and the result data obtained by each processing operation into a click stream data warehouse. As such, the present disclosure may process different data separately at different complexities. For simple first-class data, the data is directly written into the ClickHouse, the processing flow is simple, the data link is short, the processing cost is low, and the subsequent quick query from the ClickHouse is facilitated. For the second type of data, the detailed data and the calculation result are stored in the ClickHouse in real time in each calculation, so that the detailed data and the calculation result in the ClickHouse can be layered and checked respectively in the following process, the rapid query from the ClickHouse is facilitated, and the data processing efficiency is improved.
The following is an exemplary description of the data processing method provided by the embodiments of the present disclosure:
the data processing method provided by the disclosure can be applied to electronic equipment.
In some embodiments, the electronic device may be a server, a terminal, or other electronic devices for performing data processing, which is not limited in this disclosure.
The server may be a single server, or may be a server cluster including a plurality of servers. In some embodiments, the server cluster may also be a distributed cluster. The present disclosure is also not limited to a specific implementation of the server.
The terminal may be a mobile phone, a tablet computer, a desktop computer, a laptop computer, a handheld computer, a notebook computer, an ultra-mobile personal computer (UMPC), a netbook, a cellular phone, a Personal Digital Assistant (PDA), an Augmented Reality (AR), a Virtual Reality (VR) device, and other devices that can install and use a content community application (e.g., a fast hand), and the specific form of the electronic device is not particularly limited by the present disclosure. The system can be used for man-machine interaction with a user through one or more modes of a keyboard, a touch pad, a touch screen, a remote controller, voice interaction or handwriting equipment and the like.
The data processing method provided by the embodiment of the present application is described in detail below with reference to the accompanying drawings.
As shown in fig. 1, when the data processing method is applied to an electronic device, the data processing method may include:
s101, the electronic equipment acquires data to be processed.
The data to be processed comprises first class data or second class data, and the complexity of the second class data is higher than that of the first class data.
Specifically, the electronic device may obtain the data to be processed from the database.
In one embodiment, the database may or may not be a source data repository.
In one embodiment, as shown in FIG. 2, an electronic device is attributed to a data processing system that includes an electronic device, a storage device, and a requestor.
The electronic equipment is in communication connection with the storage equipment and the request terminal respectively.
The storage device is provided with a database for providing target data to a request end. When the data volume of the target data of the request end is large, the efficiency of directly acquiring the target data from the storage device by the request end is low. In this case, a clickstream Data repository (collectively called "Click Stream", Data wait house ", or" clickshouse ") may be deployed on the electronic device, and the target Data may be sent to the requesting end via the clickshouse.
Optionally, the electronic device may obtain the data to be processed from the source data repository in real time.
In particular, the data in the source data repository may change in real-time. For example, if the data a belongs to garbage data, the data a needs to be deleted. If the data B is newly generated, the data B needs to be stored. If the data C is newly updated, the data C needs to be updated. In this case, the data a, the data B, and the data C are data to be processed.
In one embodiment, the ClickHouse has a function of synchronizing data in the source data warehouse in real time, and when the data in the source data warehouse changes, the electronic device can acquire the data to be processed in real time.
Optionally, after the electronic device obtains the data to be processed, the data type identifier of the data to be processed may be obtained. And then determining the data type of the data to be processed according to the data type identification, and determining the complexity of the data to be processed according to the data type.
In particular, data of different data types are stored in different databases or in different partitions of a database. The electronic equipment can determine the data type identification of the data to be processed through the position identification of the acquisition position of the data to be processed, and further determine the data type of the data to be processed.
Illustratively, the database a stores data to be processed of a video class. The database B stores the data to be processed of the text class. After the electronic device obtains the data to be processed from the database a, the location identifier of the database a may be determined as the data type identifier of the data to be processed. Subsequently, the electronic device determines that the data to be processed is video-type data according to the pre-established corresponding relationship between the data type identifier and the data type.
Therefore, the electronic equipment can quickly and accurately determine the complexity of the data to be processed according to the data type of the data to be processed, so that the specific implementation mode for writing the data to be processed into the click stream data warehouse is determined according to the complexity of the data to be processed subsequently.
And S102, if the data to be processed is the first type of data, the electronic equipment writes the first type of data into the click stream data warehouse.
Specifically, the first type of data is raw data in the source data warehouse that has not been processed. The first type data can be directly sent to the request end, so that the request end uses the first type data to perform business analysis or model training. The data type of this kind of data is simple, and is usually one or several simple hive tables.
When the electronic equipment receives the data to be processed and determines that the data to be processed is the first type of data, a data acquisition module in the electronic equipment sends the acquired data to be processed to a synchronization engine of a click stream data warehouse. Correspondingly, the synchronous engine of the clickstream data warehouse writes the data to be processed into the clickstream data warehouse. Therefore, when the data in the source data warehouse changes, the changed data can be synchronized in real time in the clickstream data warehouse, and the real-time performance of data processing is guaranteed.
Alternatively, the synchronization engine may be a ClickHouse's relational database management System (MYSQL) database synchronization engine (materialized MYSQL) function. Through the materialized MYSQL function of the ClickHouse, the electronic equipment can synchronize the data of the MYSQL data table (namely a source data warehouse) to the ClickHouse in real time.
Illustratively, as shown in fig. 3, the electronic device may obtain the first type of data from MYSQL deployed on the storage device through a materialized MYSQL function, and write the first type of data into clickwouse. Subsequently, the data in the clickwouse can be analyzed and processed through a functional module with Online Analytical Processing (OLAP) deployed on the electronic device.
The click stream data warehouse is a column type database management system (the data query speed of the column type database management system is higher), so that compared with a Hadoop distributed system in the prior art, after a data query instruction is received, the electronic equipment can acquire target data from the click stream data warehouse in real time and quickly, and the data processing efficiency is improved.
S103, if the data to be processed is second-class data, the electronic equipment executes at least one processing operation on the second-class data, and writes detail data and result data obtained by each processing operation into a click stream data warehouse.
The detailed data is specific data obtained after the electronic device performs each processing operation on the second type of data, and the result data is final data obtained after the electronic device performs the processing operation on the second type of data.
Specifically, when the electronic device receives the data to be processed and determines that the data to be processed is the second type of data, the data type or format of the second type of data does not meet the service analysis or model training of the subsequent request end. In this case, the data acquisition module in the electronic device sends the acquired data to be processed to a real-time computing engine deployed in the electronic device. And the real-time computing engine executes data processing operation on the second type data acquired in real time, and acquires the obvious data acquired after each processing operation and the final result data acquired after the processing operation is executed. Subsequently, the real-time computing engine can write the detail data and the result data into the click stream data warehouse, so that when the subsequent electronic equipment executes data verification, the detail data and the result data after the real-time computing engine executes data processing operation on the original data each time can be obtained, and the accuracy of data verification is improved.
For example, as shown in fig. 4, the data obtaining module in the electronic device may obtain 20 hive tables (to-be-processed data) of to-be-processed data from a Kafka message queue of MYSQL-binlog deployed on the storage device. The 20 hive tables are the second kind of data. The electronic device needs to execute 2 times of data processing operations on the 20 hive tables to obtain data meeting service analysis or model training required by a subsequent request end.
In this case, the data acquisition module in the electronic device sends the acquired 20 hive tables to a real-time computing engine Flink deployed inside the electronic device. The real-time computing engine Flink executes the first data processing operation on 20 hive tables in a detail data layer (DWD layer for short), so as to obtain 10 hive tables (i.e. detail data). After obtaining the 10 hive tables, the real-time computing engine Flink in the electronic device may first write (sink) the 10 hive tables into ClickHouse, and send the 10 hive tables to a Kafka message queue of the DWD layer. Correspondingly, the Kafka message queue of the DWD layer sends the 10 hive tables to a service data layer (called data consumer service for short: DWS layer).
Next, the real-time computing engine Flink in the electronic device performs a second data processing operation on the obtained 10 hive tables at the DWS layer to obtain a target hive table (i.e., result data). After obtaining the target hive table, the real-time computing engine Flink in the electronic device may write (sink) the target hive table into clickwouse.
Optionally, after the DWS layer performs a second data processing operation on the obtained 10 hive tables to obtain a processed target hive table, the target hive table may also be sent to a Kafka message queue of the DWS layer. Correspondingly, the Kafka message queue of the DWS layer sends the target hive table to the request end for the direct use of the request end.
In this way, the clickwouse stores the detailed data and the result data after the real-time computing engine Flink performs the data processing operation on the original data each time. Subsequently, when the operation and maintenance personnel find that data in the target hive table is abnormal, the operation and maintenance personnel can directly call the SQL query statement to query detailed data and result data of the original data subjected to data processing operation each time by the real-time computing engine Flink, so that the data operation node with the abnormal data can be quickly positioned, the logic code of the data operation node with the abnormal data can be timely modified, and the accuracy of data verification is improved.
The click stream data warehouse is a column type database management system (the data query speed of the column type database management system is higher), so that compared with a Hadoop distributed system in the prior art, after a data query instruction is received, the electronic equipment can acquire target data from the click stream data warehouse in real time and quickly, and the data processing efficiency is improved.
The technical scheme provided by the embodiment at least has the following beneficial effects: from S101 to S103, after the electronic device obtains the data to be processed, different data processing operations may be executed according to the complexity of the data to be processed. If the data to be processed is first type data, writing the first type data into a click stream data warehouse; and if the data to be processed is second-class data with the complexity higher than that of the first-class data, executing at least one processing operation on the second-class data, and writing the detail data and the result data obtained by each processing operation into a click stream data warehouse. As such, the present disclosure may process different data separately at different complexities. For simple first-class data, the data is directly written into the ClickHouse, the processing flow is simple, the data link is short, the processing cost is low, and the subsequent quick query from the ClickHouse is facilitated. For the second type of data, the detailed data and the calculation result are stored in the ClickHouse in real time in each calculation, so that the detailed data and the calculation result in the ClickHouse can be layered and checked respectively in the following process, the rapid query from the ClickHouse is facilitated, and the data processing efficiency is improved.
In an implementation manner, as shown in fig. 5, the data processing method further includes:
s401, the electronic equipment receives a data query instruction for acquiring target data.
The target data comprises first class data or second class data.
In one embodiment, the general idea of solving a problem using machine learning, deep learning can be broken down into the following steps:
obtaining the most original sample data;
then, performing characteristic engineering on the sample data to obtain characteristic data;
performing data processing (such as processing positive and negative sample proportion, invalid or cheating samples and the like) on the characteristic data to obtain a sample set for training, verification and testing;
and obtaining a training model according to the sample set.
As can be seen from the above, when a certain terminal or server needs to solve a problem by using machine learning or deep learning, corresponding raw data or feature data (i.e., target data) can be obtained by the electronic device. In this case, the operation and maintenance personnel can write a corresponding data query script to acquire the target data. Accordingly, the electronic device receives a data query instruction for obtaining the target data.
S402, the electronic equipment responds to the data query instruction and obtains target data from the click stream data warehouse.
Wherein the clickstream data repository is a columnar database management system (DBMS) for online analysis.
The click stream data warehouse stores first type data and second type data; the first type of data is original data which is not processed in the source data warehouse; the second type of data is detail data and result data obtained by performing data processing operation on the original data acquired in real time through a real-time computing engine.
Specifically, after receiving a data query instruction for acquiring target data, the electronic device queries the target data from the clickstream data warehouse in response to the data query instruction.
Optionally, the real-time computing engine and the source data repository may be two pieces of functional software integrated on the same device, or may be two pieces of functional software on two separate devices, which is not limited in this disclosure.
And S403, the electronic equipment outputs the target data.
Specifically, after responding to a data query instruction and querying target data from a click stream data warehouse, the electronic device outputs the target data, so that a subsequent request end performs service analysis or model training by using the target data.
Generally, the flow of the data processing method described above is generally applied to a data mart portal. The technical scheme provided by the embodiment at least has the following beneficial effects: as can be seen from S401-S403, after receiving a data query instruction for acquiring target data, the electronic device may query the target data from the clickstream data warehouse in response to the data query instruction; the click stream data warehouse stores first type data and second type data; the first type of data is original data which is not processed in the source data warehouse; the second type of data is detail data and result data obtained by performing data processing operation on the original data acquired in real time through a real-time computing engine. Subsequently, the electronic device outputs the target data. Because the click stream data warehouse stores real-time synchronous data and is a column type database management system (compared with a Hadoop distributed system in the prior art, the data query speed of the column type database management system is higher), after a data query instruction is received, the electronic equipment can acquire target data from the click stream data warehouse in real time and quickly, and the data processing efficiency is improved.
In an embodiment, referring to fig. 4 and as shown in fig. 6, when the target data is the second type of data, in the above S402, the method for the electronic device to obtain the target data from the clickstream data warehouse in response to the data query instruction specifically includes:
s601, the electronic equipment responds to the data query instruction, and details data and result data of the target data are obtained from the click stream data warehouse.
Therefore, the electronic equipment can acquire the detail data and the result data of the target data from the click stream data warehouse, so that the detail data and the calculation result in the click stream data warehouse can be layered and verified respectively in the following process, the quick query from the click House is facilitated, and the data processing efficiency is improved.
In an embodiment, as shown in fig. 7 in conjunction with fig. 6, after S601, the method further includes:
s701, the electronic device receives a data verification instruction for performing data verification on target data.
Specifically, when the request end performs business analysis or model training by using the target data, the accuracy of the target data needs to be verified. In this case, if the target data is abnormal, the operation and maintenance personnel is required to call the data verification script. Accordingly, the electronic device receives a data verification instruction for performing data verification on the target data.
S702, the electronic equipment responds to the data verification instruction, and data verification operation is carried out on the target data according to the detail data and the result data of the target data.
The data verification operation is used for verifying the integrity of the target data.
Specifically, after receiving a data verification instruction for performing data verification on target data, when the target data is the second type of data, the electronic device may respond to the data verification instruction to obtain target detail data and result data corresponding to the target data.
After obtaining the target detail data corresponding to the target data, the electronic device may perform a data verification operation on the target data according to the target detail data and the result data.
A data verification operation is a type of verification operation performed to ensure the integrity of data. Usually, a designated algorithm is used for the electronic device to calculate a check value for the data to be checked, and the storage device calculates a primary check value by using the same algorithm and sends the primary check value to the electronic device. And the electronic equipment determines that the check values obtained by the two calculations are the same, and the data to be checked is complete.
In connection with the above example, the ClickHouse has stored therein: the real-time calculation engine Flink executes detailed data and result data after data processing operation is executed on the original data at each time in the DWD layer, and the real-time calculation engine Flink executes data processing operation on the original data at the DWS layer to obtain calculated data.
When data verification is performed, the electronic device can respond to the data verification instruction, obtain detailed data and result data of the Flink after performing data processing operation on the original data at each time on the DWD layer, and calculated data of the Flink after performing data processing operation on the original data on the DWS layer, perform data verification according to the obtained data, and improve accuracy of the data verification.
Illustratively, after acquiring the target detail data a corresponding to the target data, the electronic device calculates the target obvious data according to the verification rule to obtain a verification value M. Correspondingly, the electronic device can also send a verification instruction to the storage device, the storage device responds to the verification instruction, calculates the target obvious data to obtain a verification value N, and sends the verification value N to the electronic device. And after receiving the check value N, the electronic equipment determines whether the check value N is consistent with the check value M. If the data are consistent, the target detail data are complete.
The technical scheme provided by the embodiment at least has the following beneficial effects: as can be seen from S701-S702, when performing data verification on target data, the electronic device may receive a data verification instruction for performing data verification on the target data. When the target data is the second type of data, because the detail data and the result data obtained after the real-time computing engine performs the data processing operation on the original data each time and the calculated data obtained after the real-time computing engine performs the data processing operation on the original data are stored in the clickstream data warehouse, the electronic device can respond to the data verification instruction, obtain the target detail data and the result data corresponding to the target data, perform the data verification operation on the target data according to the target detail data and the result data, and improve the accuracy of data verification.
In an embodiment, referring to fig. 4, as shown in fig. 8, when the target data includes a plurality of first-type data greater than a preset number, in S402, the method for the electronic device to obtain the target data from the clickstream data store in response to the data query instruction specifically includes:
s801, the electronic device responds to a data query instruction and acquires a plurality of first-class data from a click stream data warehouse.
Specifically, after the plurality of first-class data are acquired, if the plurality of first-class data cannot meet the data requirements of subsequent business analysis or model training, feature engineering processing needs to be performed on the plurality of first-class data to obtain a data set.
Generally, for target data which is acquired from the clickstream data warehouse and can be directly used, the electronic device may add the target data to a message queue for a subsequent request end to perform business analysis or model training.
For target data (i.e., a plurality of first-class data larger than a preset number) which cannot be directly used and acquired from the clickstream data warehouse, the electronic device may perform feature engineering processing on the target data to obtain a data set.
Feature Engineering (ETL) is used to represent the process of extracting (Extract), converting (Transform), and loading (Load) data from a source to a destination.
S802, the electronic equipment calls a pre-generated configuration file, converts the first type data into a target data set, and determines the target data set as target data.
Specifically, after the electronic device responds to the data query instruction and acquires a plurality of first-class data from the click stream data warehouse, a pre-generated configuration file can be called, the plurality of first-class data are converted into a target data set, and the target data set is determined as target data; the configuration file is used to convert the plurality of data into a data set.
Wherein the configuration file is used to convert the plurality of data into a data set. The configuration file is a pre-written SQL script. After the electronic device takes the plurality of first-type data, an executive program in the electronic device can be called to read the SQL script, and the plurality of first-type data are converted into a target data set.
For example, when the plurality of first-type data are 2 original hive data tables, the electronic device may obtain the plurality of first-type data, call an SQL script for data conversion on ClickHouse, and perform materialized view processing on the plurality of first-type data to obtain a wide table. The broad table is the dataset.
Thus, the operation of data processing on the target data on the clickwouse is simpler than the operation of processing the target data on the Flink. For simple statistical index data, the method can be realized only by calling a pre-compiled SQL script in a click stream data warehouse, and a complex Flink script does not need to be developed, so that the development convenience is improved.
As a further alternative, the target data may also be the second type of data. When the target data is the second type of data, the electronic device may obtain the target data after a certain data processing operation is performed by the real-time computing engine from the clickstream data warehouse. Then, the electronic device responds to the received data processing instruction and executes new feature engineering processing on the target data to obtain feature data.
Thus, complex statistical index data (which needs operations such as multi-table association and wide-table marking) can be primarily calculated through the Flink, written into the clickwouse, and then subjected to secondary polymerization calculation. The electronic equipment can call a Flink streaming computation engine deployed in the electronic equipment and a ClickHouse feature engineering processing engine deployed in the electronic equipment to process data to be processed, and the data processing efficiency is improved.
The technical scheme provided by the embodiment at least has the following beneficial effects: from S801 to S802, when the target data includes a plurality of first-type data greater than the preset number, the electronic device may call a pre-generated configuration file after acquiring the plurality of first-type data from the clickstream data warehouse in response to the data query instruction, convert the plurality of first-type data into a target data set, and determine the target data set as the target data; the configuration file is used for converting the plurality of data into a data set so as to be used for subsequent request ends to perform service analysis or model training, and the data processing efficiency is improved.
It is understood that, in practical implementation, the terminal/server according to the embodiments of the present disclosure may include one or more hardware structures and/or software modules for implementing the corresponding data processing methods, and these hardware structures and/or software modules may constitute an electronic device. Those of skill in the art will readily appreciate that the present disclosure can be implemented in hardware or a combination of hardware and computer software for implementing the exemplary algorithm steps described in connection with the embodiments disclosed herein. Whether a function is performed as hardware or computer software drives hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.
Based on such understanding, the embodiment of the present disclosure also provides a data processing apparatus, which can be applied to an electronic device. Fig. 9 shows a schematic structural diagram of a data processing apparatus provided in an embodiment of the present disclosure. As shown in fig. 9, the data processing apparatus may include: an acquisition unit 901 and a processing unit 902;
an obtaining unit 901, configured to obtain data to be processed, where the data to be processed includes first class data or second class data, and complexity of the second class data is higher than complexity of the first class data;
the processing unit 902 is configured to, if the data to be processed is first-class data, write the first-class data into the clickstream data warehouse;
the processing unit 902 is further configured to, if the data to be processed is second-class data, perform at least one processing operation on the second-class data, and write the detail data and result data obtained in each processing operation into the clickstream data warehouse.
Optionally, the data processing apparatus further includes: a receiving unit 903 and an output unit 904;
a receiving unit 903, configured to receive a data query instruction for acquiring target data; the target data comprises a first type of data or a second type of data;
the obtaining unit 901 is further configured to obtain target data from a click stream data warehouse in response to a data query instruction;
an output unit 904 for outputting the target data.
Optionally, when the target data is second-class data, the obtaining unit 901 is specifically configured to:
and responding to the data query instruction, and acquiring detail data and result data of the target data from the clickstream data warehouse.
Optionally, the receiving unit 903 is further configured to receive a data verification instruction for performing data verification on the target data;
the processing unit 902 is further configured to, in response to the data verification instruction, perform a data verification operation on the target data according to the detail data and the result data of the target data; the data verification operation is used to verify the integrity of the target data.
Optionally, when the target data includes a plurality of first-type data larger than the preset number, the obtaining unit 901 is specifically configured to:
responding to a data query instruction, and acquiring a plurality of first-class data from a click stream data warehouse;
calling a pre-generated configuration file, converting a plurality of first-class data into a target data set, and determining the target data set as target data; the configuration file is used to convert the plurality of data into a data set.
Optionally, the obtaining unit 901 is further configured to obtain a data type identifier of the data to be processed;
the processing unit 902 is further configured to determine a data type of the data to be processed according to the data type identifier;
the processing unit 902 is further configured to determine complexity of the data to be processed according to the data type.
As described above, the embodiment of the present disclosure may perform division of functional modules on an electronic device according to the above method example. The integrated module can be realized in a hardware form, and can also be realized in a software functional module form. In addition, it should be further noted that the division of the modules in the embodiments of the present disclosure is schematic, and is only a logic function division, and there may be another division manner in actual implementation. For example, the functional blocks may be divided for the respective functions, or two or more functions may be integrated into one processing block.
With regard to the data processing apparatus in the foregoing embodiments, the specific manner in which each module performs operations and the beneficial effects thereof have been described in detail in the foregoing method embodiments, and are not described herein again.
The embodiment of the disclosure also provides a terminal, which can be a user terminal such as a mobile phone, a computer and the like. Fig. 10 shows a schematic structural diagram of a terminal provided in an embodiment of the present disclosure. The terminal, which may be a data processing device, may include at least one processor 61, a communication bus 62, a memory 63, and at least one communication interface 64.
The processor 61 may be a Central Processing Unit (CPU), a micro-processing unit, an ASIC, or one or more integrated circuits for controlling the execution of programs according to the present disclosure. As an example, in connection with fig. 9, the processing unit 902 in the electronic device implements the same functions as the processor 61 in fig. 10.
The communication bus 62 may include a path that carries information between the aforementioned components.
The communication interface 64 may be any device, such as a transceiver, for communicating with other devices or communication networks, such as a server, an ethernet, a Radio Access Network (RAN), a Wireless Local Area Network (WLAN), etc. As an example of this, it is possible to provide,
the memory 63 may be, but is not limited to, a read-only memory (ROM) or other type of static storage device that may store static information and instructions, a Random Access Memory (RAM) or other type of dynamic storage device that may store information and instructions, an electrically erasable programmable read-only memory (EEPROM), a compact disc read-only memory (CD-ROM) or other optical disk storage, optical disk storage (including compact disc, laser disc, optical disc, digital versatile disc, blu-ray disc, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. The memory may be self-contained and connected to the processing unit by a bus. The memory may also be integrated with the processing unit.
The memory 63 is used for storing application program codes for executing the disclosed solution, and is controlled by the processor 61. The processor 61 is configured to execute application program code stored in the memory 63 to implement the functions in the disclosed method.
In particular implementations, processor 61 may include one or more CPUs such as CPU0 and CPU1 in fig. 10, for example, as one embodiment.
In one implementation, the terminal may include multiple processors, such as processor 61 and processor 65 in fig. 10, for example, as an example. Each of these processors may be a single-core (single-CPU) processor or a multi-core (multi-CPU) processor. A processor herein may refer to one or more devices, circuits, and/or processing cores for processing data (e.g., computer program instructions).
In one implementation, the terminal may further include an input device 66 and an output device 67, as one example. The input device 66 communicates with the output device 67 and may accept user input in a variety of ways. For example, the input device 66 may be a mouse, a keyboard, a touch screen device or a sensing device, and the like. The output device 67 is in communication with the processor 61 and may display information in a variety of ways. For example, the output device 61 may be a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display device, or the like.
Those skilled in the art will appreciate that the configuration shown in fig. 10 is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components may be used.
The embodiment of the disclosure also provides a server. Fig. 11 shows a schematic structural diagram of a server provided by an embodiment of the present disclosure. The server may be a data processing device. The server, which may vary widely in configuration or performance, may include one or more processors 71 and one or more memories 72. At least one instruction is stored in the memory 72, and the at least one instruction is loaded and executed by the processor 71 to implement the data processing method provided by the above-mentioned method embodiments. Of course, the server may also have components such as a wired or wireless network interface, a keyboard, and an input/output interface, so as to perform input/output, and the server may also include other components for implementing the functions of the device, which are not described herein again.
The present disclosure also provides a computer-readable storage medium including instructions stored thereon, which, when executed by a processor of a computer device, enable a computer to perform the data processing method provided by the above-described illustrated embodiment. For example, the computer readable storage medium may be a memory 63 comprising instructions executable by the processor 61 of the terminal to perform the above described method. Also for example, the computer readable storage medium may be a memory 72 comprising instructions executable by a processor 71 of the server to perform the above-described method. Alternatively, the computer readable storage medium may be a non-transitory computer readable storage medium, for example, which may be a ROM, a RAM, a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
The present disclosure also provides a computer program product comprising computer instructions which, when run on an electronic device, cause the electronic device to perform the data processing method illustrated in any of the above-mentioned fig. 1-8.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (10)

1. A data processing method, comprising:
acquiring data to be processed, wherein the data to be processed comprises first class data or second class data, and the complexity of the second class data is higher than that of the first class data;
if the data to be processed is the first type of data, writing the first type of data into a click stream data warehouse;
and if the data to be processed is the second type of data, executing at least one processing operation on the second type of data, and writing the detail data and result data obtained by each processing operation into the clickstream data warehouse.
2. The data processing method of claim 1, further comprising:
receiving a data query instruction for acquiring target data; the target data comprises the first type of data or the second type of data;
responding to the data query instruction, and acquiring the target data from the clickstream data warehouse;
and outputting the target data.
3. The data processing method of claim 2, wherein when the target data is the second type of data, the obtaining the target data from the clickstream data store in response to the data query instruction comprises:
and responding to the data query instruction, and acquiring detail data and result data of the target data from the clickstream data warehouse.
4. The data processing method of claim 3, further comprising:
receiving a data verification instruction for performing data verification on the target data;
responding to the data verification instruction, and performing data verification operation on the target data according to the detail data and result data of the target data; the data verification operation is used for verifying the integrity of the target data.
5. The data processing method of claim 2, wherein when the target data includes a number of first-class data greater than a preset number, the obtaining the target data from the clickstream data store in response to the data query instruction comprises:
in response to the data query instruction, acquiring the plurality of first-class data from the clickstream data warehouse;
calling a pre-generated configuration file, converting the plurality of first-class data into a target data set, and determining the target data set as the target data; the configuration file is used to convert the plurality of data into a data set.
6. The data processing method of claim 1, further comprising:
acquiring a data type identifier of the data to be processed;
determining the data type of the data to be processed according to the data type identifier;
and determining the complexity of the data to be processed according to the data type.
7. A data processing apparatus, comprising: an acquisition unit and a processing unit;
the acquiring unit is used for acquiring data to be processed, wherein the data to be processed comprises first class data or second class data, and the complexity of the second class data is higher than that of the first class data;
the processing unit is used for writing the first type of data into a click stream data warehouse if the data to be processed is the first type of data;
the processing unit is further configured to, if the data to be processed is the second type of data, perform at least one processing operation on the second type of data, and write the detail data and result data obtained in each processing operation into the clickstream data warehouse.
8. An electronic device, characterized in that the electronic device comprises:
a processor;
a memory for storing the processor-executable instructions;
wherein the processor is configured to execute the instructions to implement the data processing method of any one of claims 1-6.
9. A computer-readable storage medium having instructions stored thereon, wherein the instructions in the computer-readable storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the data processing method of any one of claims 1-6.
10. A computer program product comprising instructions that, when run on an electronic device, cause the electronic device to perform the data processing method of any one of claims 1-6.
CN202111289614.2A 2021-11-02 2021-11-02 Data processing method, device, equipment and storage medium Pending CN114185998A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111289614.2A CN114185998A (en) 2021-11-02 2021-11-02 Data processing method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111289614.2A CN114185998A (en) 2021-11-02 2021-11-02 Data processing method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114185998A true CN114185998A (en) 2022-03-15

Family

ID=80540597

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111289614.2A Pending CN114185998A (en) 2021-11-02 2021-11-02 Data processing method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114185998A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190102415A1 (en) * 2017-09-29 2019-04-04 Oracle International Corporation Recreating an oltp table and reapplying database transactions for real-time analytics
CN110147398A (en) * 2019-04-25 2019-08-20 北京字节跳动网络技术有限公司 A kind of data processing method, device, medium and electronic equipment
CN111768850A (en) * 2020-06-05 2020-10-13 上海森亿医疗科技有限公司 Hospital data analysis method, hospital data analysis platform, device and medium
CN113010565A (en) * 2021-03-25 2021-06-22 腾讯科技(深圳)有限公司 Server cluster-based server real-time data processing method and system
CN113297270A (en) * 2021-04-09 2021-08-24 西安交大捷普网络科技有限公司 Data query method and device, electronic equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190102415A1 (en) * 2017-09-29 2019-04-04 Oracle International Corporation Recreating an oltp table and reapplying database transactions for real-time analytics
CN110147398A (en) * 2019-04-25 2019-08-20 北京字节跳动网络技术有限公司 A kind of data processing method, device, medium and electronic equipment
CN111768850A (en) * 2020-06-05 2020-10-13 上海森亿医疗科技有限公司 Hospital data analysis method, hospital data analysis platform, device and medium
CN113010565A (en) * 2021-03-25 2021-06-22 腾讯科技(深圳)有限公司 Server cluster-based server real-time data processing method and system
CN113297270A (en) * 2021-04-09 2021-08-24 西安交大捷普网络科技有限公司 Data query method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN109656963B (en) Metadata acquisition method, apparatus, device and computer readable storage medium
US20150088807A1 (en) System and method for granular scalability in analytical data processing
US11941034B2 (en) Conversational database analysis
US8682876B2 (en) Techniques to perform in-database computational programming
CN104331477A (en) Method for testing concurrency property of cloud platform based on federated research
US10394805B2 (en) Database management for mobile devices
JP2018506775A (en) Identifying join relationships based on transaction access patterns
US11379499B2 (en) Method and apparatus for executing distributed computing task
CN108885641A (en) High Performance Data Query processing and data analysis
US11507555B2 (en) Multi-layered key-value storage
CN114416855A (en) Visualization platform and method based on electric power big data
CN111046059B (en) Low-efficiency SQL statement analysis method and system based on distributed database cluster
US20140379691A1 (en) Database query processing with reduce function configuration
CN115335821B (en) Offloading statistics collection
US20150120697A1 (en) System and method for analysis of a database proxy
CN117271481B (en) Automatic database optimization method and equipment
CN116057518A (en) Automatic query predicate selective prediction using machine learning model
CN116737753A (en) Service data processing method, device, computer equipment and storage medium
US11989196B2 (en) Object indexing
US20200089799A1 (en) Cube construction for an olap system
US11645283B2 (en) Predictive query processing
CN114185998A (en) Data processing method, device, equipment and storage medium
CN111143328A (en) Agile business intelligent data construction method, system, equipment and storage medium
US12099575B2 (en) Auto-triage failures in A/B testing
US12072890B2 (en) Visualization data reuse in a data analysis system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination