CN116010340A - Data table management method and device - Google Patents
Data table management method and device Download PDFInfo
- Publication number
- CN116010340A CN116010340A CN202211732116.5A CN202211732116A CN116010340A CN 116010340 A CN116010340 A CN 116010340A CN 202211732116 A CN202211732116 A CN 202211732116A CN 116010340 A CN116010340 A CN 116010340A
- Authority
- CN
- China
- Prior art keywords
- target
- data
- object table
- record
- historical
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a data table management method and a data table management device, wherein the method comprises the following steps: determining a multi-label operation record to be managed from an operation data set; counting table parameter information corresponding to a target operation object table included in the multi-item target operation record, wherein the table parameter information comprises accumulated operation times and historical operation time, the accumulated operation times indicate operation times of historical data operation executed by the corresponding target operation object table, and the historical operation time indicates operation time of the historical data operation executed by the corresponding target operation object table last time; and determining the target operation object table with the table parameter information meeting the screening condition as a redundant object table, and deleting the target operation record corresponding to the redundant object table from the operation data set. By the mode of the invention, the technical problem of low data management efficiency in the prior art can be solved.
Description
Technical Field
The present invention relates to the field of data processing, and in particular, to a method and apparatus for managing a data table.
Background
In a laboratory environment, a big data environment oriented to data exploration, analysis, modeling and other scenes is not a production system. The data generation in the laboratory environment is more random, the life cycle management strategy is not easy to judge manually, the expired data and the invalid data are more, the storage resource waste is serious, and the data management difficulty is high.
In view of the above problems, the conventional manner is to stop managing the data life cycle of the data table according to a simple rule when the creation duration of the data table is less than or equal to the preset duration. However, the method for data management based on the simple rule is high in management cost, mainly depends on manpower, and is insufficient in automation degree. That is, the conventional data management method has a technical problem of low management efficiency.
Solutions have not been proposed to the above technical problems.
Disclosure of Invention
The embodiment of the invention provides a data table management method and device, which at least solve the technical problem of low management efficiency in the existing data management mode.
According to an embodiment of the present invention, there is provided a method of managing a data table, including:
determining a multi-item target operation record to be managed from an operation data set, wherein the operation data set comprises a plurality of operation records, the operation records comprise historical data operations executed on an operation object table, and a database to which a target operation object table contained in the multi-item target operation record belongs is a target database;
counting table parameter information corresponding to a target operation object table included in the multi-item target operation record, wherein the table parameter information comprises accumulated operation times and historical operation time, the accumulated operation times indicate the operation times of the corresponding target operation object table for executing the historical data operation, and the historical operation time indicates the operation time of the corresponding target operation object table for executing the historical data operation last time;
And determining a target operation object table of which the table parameter information meets the screening condition as a redundant object table, and deleting the target operation record corresponding to the redundant object table from the operation data set.
Optionally, the determining the target operation object table that the table parameter information satisfies the filtering condition as the redundant object table includes:
sorting a plurality of target operation object tables included in the multi-item target operation record according to the sequence of the historical operation time, and determining a target table sequence;
and determining a target operation object table which is in the target order in the target table sequence and corresponds to the accumulated operation times less than or equal to a target threshold value as the redundant object table.
Optionally, the determining the target operation object table that the table parameter information satisfies the filtering condition as the redundant object table includes:
under the condition that a first type of operation object table exists in the plurality of target operation object tables, sequencing the first type of operation object tables according to the sequence of the historical operation time, and determining a first type of table sequence, wherein data in the first type of operation object tables are temporary data; determining a target operation object table which is in the first target order and has the accumulated operation times smaller than or equal to a first target threshold value in the first class table sequence as the redundant object table;
Under the condition that a second type of operation object table exists in the plurality of target operation object tables, sorting the second type of operation object tables according to the sequence of the historical operation time, and determining a second type of table sequence, wherein data in the second type of operation object tables are common data; determining a target operation object table which is in the second class table sequence and is in a second target order and has the accumulated operation times smaller than or equal to a second target threshold value as a candidate redundant object table; and determining the redundant object table from the candidate redundant object tables according to the checking result of the candidate redundant object tables.
Optionally, the determining the multi-label operation record to be managed from the operation data set includes:
acquiring first-type operation data and second-type operation data from the operation data set, wherein the first-type operation data comprises operation records for a user to manually execute data operation, and the second-type operation data comprises operation records for a data query tool to automatically execute data operation;
and preprocessing the first type of operation data and the second type of operation data to obtain the multi-item label operation record.
Optionally, the preprocessing the first type of operation data and the second type of operation data to obtain the multi-label operation record includes:
traversing the first type operation data and the second type operation data, and screening out a reference operation record containing a reference operation object table from the first type operation data and the second type operation data, wherein a database to which the reference operation object table belongs is the target database;
searching and deleting redundant characters in the reference operation record, wherein the redundant characters comprise line feed symbols and annotation symbols;
and determining the processed reference operation record as the target operation record.
Optionally, the counting the accumulated operation times and the historical operation time of the plurality of target operation object tables included in the multi-entry operation record includes:
the following operations are circularly executed until the multi-label operation record is traversed:
acquiring a label operation record as a current operation record;
acquiring a current operation object table corresponding to the historical data operation in the current operation record under the condition that the operation type of the historical data operation included in the current operation record is a target operation type; under the condition that the searching of the current operation object table in the statistics list is successful, updating the accumulated operation times and the historical operation time of the current operation object table in the statistics list; under the condition that searching the current operation object table in the statistical list fails, newly adding the current operation object table, corresponding accumulated operation times and historical operation time in the statistical list;
And acquiring a next item target operation record under the condition that the operation type of the historical data operation included in the current operation record is not the target operation type.
According to another embodiment of the present invention, there is also provided a management apparatus for a data table, including:
a first determining unit, configured to determine a multi-entry target operation record to be managed from an operation data set, where the operation data set includes a plurality of operation records, the operation records include a history data operation performed on an operation object table, and a database to which a target operation object table included in the multi-entry target operation record belongs is a target database;
a statistics unit, configured to count table parameter information corresponding to a target operation object table included in the multi-entry target operation record, where the table parameter information includes a cumulative operation number and a historical operation time, the cumulative operation number indicates an operation number of times the corresponding target operation object table is executed with the historical data operation, and the historical operation time indicates an operation time of the corresponding target operation object table executed with the historical data operation last time;
and a second determining unit configured to determine a target operation object table for which the table parameter information satisfies a filtering condition as a redundant object table, and delete the target operation record corresponding to the redundant object table from the operation data set.
Optionally, the second determining unit includes:
the sorting module is used for sorting a plurality of target operation object tables included in the multi-item target operation record according to the sequence of the historical operation time, and determining a target table sequence;
and the determining module is used for determining the target operation object table which is in the target order and the corresponding accumulated operation times of which are smaller than or equal to the target threshold value in the target table sequence as the redundant object table.
Optionally, the second determining unit includes:
the first sorting module is used for sorting the first type of operation object tables according to the sequence of the historical operation time under the condition that the first type of operation object tables exist in the plurality of target operation object tables, and determining a first type of table sequence, wherein the data in the first type of operation object tables are temporary data; determining a target operation object table which is in the first target order and has the accumulated operation times smaller than or equal to a first target threshold value in the first class table sequence as the redundant object table;
the second sorting module is used for sorting the second-type operation object tables according to the sequence of the historical operation time under the condition that the second-type operation object tables exist in the plurality of target operation object tables, and determining a second-type table sequence, wherein data in the second-type operation object tables are common data; determining a target operation object table which is in the second class table sequence and is in a second target order and has the accumulated operation times smaller than or equal to a second target threshold value as a candidate redundant object table; and determining the redundant object table from the candidate redundant object tables according to the checking result of the candidate redundant object tables.
Optionally, the first determining unit includes:
the system comprises an acquisition module, a data query module and a data query module, wherein the acquisition module is used for acquiring first-class operation data and second-class operation data from the operation data set, the first-class operation data comprises operation records for a user to manually execute data operation, and the second-class operation data comprises operation records for a data query tool to automatically execute data operation;
the preprocessing module is used for preprocessing the first-type operation data and the second-type operation data to obtain the multi-item label operation record.
Optionally, the preprocessing module includes:
the traversing sub-module is used for traversing the first-type operation data and the second-type operation data, and screening out a reference operation record containing a reference operation object table from the first-type operation data and the second-type operation data, wherein a database to which the reference operation object table belongs is the target database;
the searching sub-module is used for searching and deleting redundant characters in the reference operation record, wherein the redundant characters comprise a line feed character and an annotation character;
and the determining submodule is used for determining the processed reference operation record as the target operation record.
Optionally, the second determining unit includes:
The circulation module is used for circularly executing the following operations until the multi-item label operation record is traversed: acquiring a label operation record as a current operation record; acquiring a current operation object table corresponding to the historical data operation in the current operation record under the condition that the operation type of the historical data operation included in the current operation record is a target operation type; under the condition that the searching of the current operation object table in the statistics list is successful, updating the accumulated operation times and the historical operation time of the current operation object table in the statistics list; under the condition that searching the current operation object table in the statistical list fails, newly adding the current operation object table, corresponding accumulated operation times and historical operation time in the statistical list; and acquiring a next item target operation record under the condition that the operation type of the historical data operation included in the current operation record is not the target operation type.
According to a further embodiment of the present invention, there is also provided a computer-readable storage medium having stored therein a computer program, wherein the computer program is arranged to perform the steps of any of the method embodiments described above when run.
According to a further embodiment of the invention, there is also provided an electronic device comprising a memory having stored therein a computer program, and a processor arranged to run the computer program to perform the steps of any of the method embodiments described above.
According to the method and the device, the multi-item label operation record to be managed is determined from the operation data set; counting table parameter information corresponding to a target operation object table included in the multi-item target operation record, wherein the table parameter information comprises accumulated operation times and historical operation time, the accumulated operation times indicate operation times of historical data operation executed by the corresponding target operation object table, and the historical operation time indicates operation time of the historical data operation executed by the corresponding target operation object table last time; and determining a target operation object table with the table parameter information meeting the screening conditions as a redundant object table, and deleting target operation records corresponding to the redundant object table from the operation data set, so that the target operation records meeting the conditions of accumulated operation times and historical operation time are determined as redundant records according to the statistical result of the target operation record records in the operation data set, and further, the redundant records are automatically deleted, thereby improving the management efficiency of data management and solving the technical problem of low management efficiency of the existing data management mode.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiments of the invention and together with the description serve to explain the invention and do not constitute a limitation on the invention. In the drawings:
fig. 1 is a block diagram of a hardware configuration of a mobile terminal of a data table management method according to an embodiment of the present invention;
FIG. 2 is a flow chart of a method of managing a data table according to an embodiment of the invention;
FIG. 3 is a schematic diagram of an implementation of a method for managing a data table according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of another implementation of a method for managing a data table according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of an implementation of a method of managing a data table according to an embodiment of the present invention;
FIG. 6 is a schematic diagram of an implementation of a method of managing a data table according to yet another embodiment of the present invention;
FIG. 7 is a schematic diagram of a method of managing a data table according to an embodiment of the present invention;
fig. 8 is a block diagram of a management apparatus of a data table according to an embodiment of the present invention.
Detailed Description
The invention will be described in detail hereinafter with reference to the drawings in conjunction with embodiments. It should be noted that, in the case of no conflict, the embodiments and features in the embodiments may be combined with each other.
It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order.
Example 1
The method embodiment provided in the first embodiment of the present application may be executed in a mobile terminal, a computer terminal or a similar computing device. Taking a mobile terminal as an example, fig. 1 is a block diagram of a hardware structure of a mobile terminal according to a data table management method of an embodiment of the present invention, as shown in fig. 1, the mobile terminal may include one or more (only one is shown in fig. 1) processors 102 (the processors 102 may include, but are not limited to, a microprocessor MCU or a processing device such as a programmable logic device FPGA) and a memory 104 for storing data, and optionally, the mobile terminal may further include a transmission device 106 for a communication function and an input/output device 108. It will be appreciated by those skilled in the art that the structure shown in fig. 1 is merely illustrative and not limiting of the structure of the mobile terminal described above. For example, the mobile terminal may also include more or fewer components than shown in fig. 1, or have a different configuration than shown in fig. 1.
The memory 104 may be used to store a computer program, for example, a software program of application software and a module, such as a computer program corresponding to a method for managing a data table in an embodiment of the present invention, and the processor 102 executes the computer program stored in the memory 104 to perform various functional applications and data processing, that is, to implement the above-mentioned method. Memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory remotely located relative to the processor 102, which may be connected to the mobile terminal via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The transmission means 106 is arranged to receive or transmit data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the mobile terminal. In one example, the transmission device 106 includes a network adapter (Network I nterface Contro l l er, abbreviated NIC) that can communicate with other network equipment via a base station to communicate with the Internet. In one example, the transmission device 106 may be a radio frequency (Rad i o Frequency, abbreviated as RF) module for communicating with the internet wirelessly.
First, terms appearing later in the present application will be described:
hadoop cluster: a distributed system architecture developed by the Apache foundation.
H i ve: the Hadoop-based data warehouse tool is used for extracting, converting and loading data, is a mechanism capable of storing, inquiring and analyzing large-scale data stored in the Hadoop, can map a structured data file into a database table, and provides SQL inquiring functions.
CDH: CDH is a commercial release Hadoop from cl oudera corporation, built specifically to meet business needs.
Data lifecycle management: data lifecycle management is an important component of a data management system, which is a policy-based method for managing the flow of data throughout a lifecycle: from creation and initial storage, to final obsolescence is deleted, i.e., the process of data from creation or retrieval to destruction.
Laboratory environment: big data environments for data exploration, analysis, modeling and other scenes are not production systems.
In this embodiment, a method for managing a data table running on the mobile terminal or the network architecture is provided, and fig. 2 is a flowchart of a method for managing a data table according to an embodiment of the present invention, as shown in fig. 2, where the flowchart includes the following steps:
S202, determining a multi-item label operation record to be managed from an operation data set;
the operation data set comprises a plurality of operation records, wherein the operation records comprise historical data operations executed on an operation object table, and a database to which a target operation object table contained in the multi-item target operation records belongs is a target database;
s204, counting table parameter information corresponding to a target operation object table included in the multi-item target operation record, wherein the table parameter information comprises accumulated operation times and historical operation time, the accumulated operation times indicate operation times of the corresponding target operation object table for executing historical data operation, and the historical operation time indicates operation time of the corresponding target operation object table for executing the historical data operation last time;
s206, determining the target operation object table with the table parameter information meeting the screening condition as a redundant object table, and deleting the target operation record corresponding to the redundant object table from the operation data set.
In step S202, the operation data set may be specifically operation data generated in a laboratory environment, and the operation data may include, but is not limited to, operation records of operations such as adding, deleting, modifying, checking, and the like to a data table. The target operation record may be a record included in a database concerned with a currently executed data table management task, in other words, the target database may be a target database concerned with a current data table management task, and the target operation data is an operation record corresponding to an object table included in the target database.
Through the above embodiment of the present application, a multi-label operation record to be managed is determined from an operation data set; counting table parameter information corresponding to a target operation object table included in the multi-item target operation record, wherein the table parameter information comprises accumulated operation times and historical operation time, the accumulated operation times indicate operation times of historical data operation executed by the corresponding target operation object table, and the historical operation time indicates operation time of the historical data operation executed by the corresponding target operation object table last time; and determining a target operation object table with the table parameter information meeting the screening conditions as a redundant object table, and deleting target operation records corresponding to the redundant object table from the operation data set, so that the target operation records meeting the conditions of accumulated operation times and historical operation time are determined as redundant records according to the statistical result of the target operation record records in the operation data set, and further, the redundant records are automatically deleted, thereby improving the management efficiency of data management and solving the technical problem of low management efficiency of the existing data management mode.
As an optional embodiment, the determining the target operation object table in which the table parameter information satisfies the filtering condition as the redundant object table includes:
S1, sorting a plurality of target operation object tables included in a multi-item target operation record according to the sequence of historical operation time, and determining a target table sequence;
s2, determining a target operation object table which is in the target order in the target table sequence and corresponds to the accumulated operation times smaller than or equal to the target threshold value as a redundant object table.
It can be understood that, in this embodiment, after the statistics parameters corresponding to the target operation object table included in the target operation record are counted, the redundant object table may be determined from the target operation object table according to the statistics parameters, and the operation record corresponding to the redundant object table may be deleted from the operation data set.
In this embodiment, the cleaning policy of the data may be specified specifically by the LRU algorithm in combination with the service and technical rules in the laboratory environment.
For example, in this embodiment, the historical operation time corresponding to the target operation object table is the operation time of the last time the table was executed with the related data operation, for example, for the data table 1, the operation time of the data table 1 in which the update operation is executed is determined according to the corresponding historical operation record: 10.1.00:00 in 2022, 16:00 in 2022.10.15, 12:18 in 2022.11.1. Assuming that the current management time is 2022, 12, 1, and 12:18 is the time with the shortest distance from the current time among the three operation times, and further the time 2022, 11, 1, and 12:18 is determined as the historical operation time corresponding to the data table 1; the historical operation times are determined according to the three updating operations, and the accumulated operation times of the data table 1 are 3.
The above embodiments are further exemplified below. Assuming that the historical operation time of the data table 2 is 2022, 10 months, 1 day, 00:00, the cumulative operation number is 5; the historical operation time of the data table 3 is 2022, 10, 15, 16:00, and the cumulative operation number is 1, so that the three data tables can be arranged according to the historical operation time of the three data tables: data table 2, data table 3, data table 1. Assume that the current screening conditions are: the target order condition is 3, the cumulative operation number condition is 1, and then the data table 3 with the cumulative operation number of 1 and the 2 nd rank in the list is determined as the redundant data table.
As an alternative embodiment, the above screening conditions may further increase the screening cycle conditions, and the above screening conditions may be further determined as: and determining the data table with the cumulative operation times less than or equal to 2 times as the redundant data table, wherein the arrangement order of the data table arrangement obtained according to the historical operation time is 10 before the current management time is 3 months away.
The above screening conditions are merely examples, and are not limited to the screening conditions actually employed.
According to the embodiment of the application, the multiple target operation object tables included in the multiple-item target operation record are ordered according to the sequence of the historical operation time, and the target table sequence is determined; and determining a target operation object table which is in the target sequence and has the corresponding accumulated operation times smaller than or equal to the target threshold value as a redundant object table, thereby determining a data table which is relatively long away from the current management time and has the smaller operated operation times as a redundant data table, further realizing the technical effects of effectively clearing redundant data and furthest preventing data from being deleted by mistake.
As an optional embodiment, the determining the target operation object table in which the table parameter information satisfies the filtering condition as the redundant object table includes:
s1, under the condition that a first type of operation object table exists in a plurality of target operation object tables, ordering the first type of operation object tables according to the sequence of historical operation time, and determining a first type of table sequence, wherein data in the first type of operation object tables are temporary data; determining a target operation object table which is in a first target order in the first class table sequence and has the accumulated operation times smaller than or equal to a first target threshold value as a redundant object table;
s2, under the condition that a second type of operation object table exists in the plurality of target operation object tables, sorting the second type of operation object tables according to the sequence of the historical operation time, and determining a second type of table sequence, wherein data in the second type of operation object tables are common data; determining a target operation object table which is in the second class table sequence and is in the second target order and has the accumulated operation times smaller than or equal to a second target threshold value as a candidate redundant object table; and determining the redundant object table from the candidate redundant object tables according to the checking result of the candidate redundant object tables.
In this embodiment, different cleaning strategies may be refined for the data tables stored in different databases. For example: the strategy of the step S1 may be: in the last 3 months, the number of times of use is less than 1, and the pool is dfxb_tmp_db or dfxb_ ceb _tmp_db, which is mainly used for storing temporary data in the present embodiment, is deleted and destroyed, and the life cycle of the table is completed.
The strategy of the step S2 may be: in the last year, the number of times of use is less than 1, and the libraries are other libraries except for the dfxb_tmp_db, the dfxb_ ceb _tmp_db and the dfxb_pro_db libraries, so that the opinion of each item group is solicited, deleted and destroyed, and the life cycle of the table is completed. It will be appreciated that other libraries than the dfxb_tmp_db, dfxb_ ceb _tmp_db, dfxb_pro_db libraries described above may be databases for storing common data during execution of conventional tasks for a particular set of development projects.
According to the embodiment of the application, under the condition that a first type of operation object table exists in a plurality of target operation object tables, the first type of operation object tables are ordered according to the sequence of the historical operation time, and a first type table sequence is determined, wherein data in the first type of operation object tables are temporary data; determining a target operation object table which is in a first target order in the first class table sequence and has the accumulated operation times smaller than or equal to a first target threshold value as a redundant object table; under the condition that a second type of operation object table exists in the plurality of target operation object tables, sorting the second type of operation object tables according to the sequence of the historical operation time, and determining a second type of table sequence, wherein data in the second type of operation object tables are common data; determining a target operation object table which is in the second class table sequence and is in the second target order and has the accumulated operation times smaller than or equal to a second target threshold value as a candidate redundant object table; according to the verification result of the candidate redundant object table, the redundant object table is determined from the candidate redundant object table, so that different cleaning strategies are formulated for the data tables included in different databases, the life cycle management strategy of the data in the large data cluster is more automatic and reasonable, redundant data can be effectively cleaned, and data false deletion can be prevented to the greatest extent.
As an optional implementation manner, the determining the multi-entry label operation record to be managed from the operation data set includes:
s1, acquiring first-class operation data and second-class operation data from an operation data set, wherein the first-class operation data comprises operation records for a user to manually execute data operation, and the second-class operation data comprises operation records for a data query tool to automatically execute data operation;
s2, preprocessing the first type of operation data and the second type of operation data to obtain a multi-item label operation record.
In this embodiment, the operation data set may be a Hadoop cluster, the first type of operation data may be HUE-side operation metadata, and the second type of operation data may be job metadata. Specifically, the HUE end operation metadata refers to the inquiry condition manually executed by a user through the HUE; the job metadata refers to the job scheduling situation of the automated scheduling tool. The two are combined to be the global metadata of the complete hive operation. The yarn API is designed by self-grinding, specifically, the yarn API provided by the analysis CDH is obtained at fixed time, the analysis interface returns information, and the operation condition of the hive is obtained.
The logic for obtaining the job metadata through the yarn api may be as shown in fig. 3 and fig. 4: s1, periodically grabbing operation information; s2, analyzing and obtaining the operation condition. Wherein, as shown in fig. 3, the capturing may be to obtain the job metadata included therein by periodically calling the AP I interface provided by the CDH; as shown in fig. 4, the parsing obtaining job may be to determine that the content included in the job metadata is the operation record by obtaining a target field included in the job metadata, and in particular, the target field may be "h i ve query str i ng".
As an optional implementation manner, preprocessing the first type of operation data and the second type of operation data to obtain a multi-label operation record includes:
s1, traversing first-class operation data and second-class operation data, and screening out a reference operation record containing a reference operation object table, wherein a database to which the reference operation object table belongs is a target database;
s2, searching and deleting redundant characters in the reference operation record, wherein the redundant characters comprise a line feed character and an annotation character;
s3, determining the processed reference operation record as a target operation record.
Specifically, in the present embodiment, after the job metadata and the HUE end operation metadata are summarized, the cold and hot conditions of the data are automatically obtained by the self-developed SQL analysis processing engine. The parsing engine logic is as follows:
s1, introducing external dependence and limiting the library name of a database corresponding to a data table included in original data;
s2, removing line feed symbols in the data obtained through the screening operation;
s3, removing the annotation symbol and the operation record corresponding to the drop operation; it can be understood that the operation record corresponding to the drop operation indicates that the corresponding operation object table has been deleted from the operation data set, so that the operation record carrying the operation can be directly filtered in this step.
S4, legal table names need to be provided with library names, and judgment needs to be made;
in this embodiment, since further operations are required according to the library names in the subsequent processing, it is necessary to screen out the operation records corresponding to the legal table names carrying the library names.
As an optional implementation manner, the counting the accumulated operation times and the historical operation time of the plurality of target operation object tables included in the multi-entry target operation record includes:
The following operations are circularly executed until the multi-label operation record is traversed:
s1, acquiring a target operation record as a current operation record;
s2, under the condition that the operation type of the historical data operation included in the current operation record is the target operation type, acquiring a current operation object table corresponding to the historical data operation in the current operation record; under the condition that the searching of the current operation object table in the statistics list is successful, updating the accumulated operation times and the historical operation time of the current operation object table in the statistics list; under the condition that searching the current operation object table in the statistical list fails, newly adding the current operation object table, corresponding accumulated operation times and historical operation time in the statistical list;
s3, acquiring a next item target operation record under the condition that the operation type of the historical data operation included in the current operation record is not the target operation type.
In this embodiment, the table names behind create, i nsert, from, jo n may be obtained iteratively by traversing the target operation record after each line of screening; and besides, the phenomena of special characters and the like in some extracted table names are removed, and further, the heat statistics is carried out in a mode shown in fig. 5.
According to the embodiment of the application, under the condition that the operation type of the historical data operation included in the current operation record is the target operation type, the current operation object table corresponding to the historical data operation in the current operation record is obtained; under the condition that the searching of the current operation object table in the statistics list is successful, updating the accumulated operation times and the historical operation time of the current operation object table in the statistics list; under the condition that searching the current operation object table in the statistical list fails, newly adding the current operation object table, corresponding accumulated operation times and historical operation time in the statistical list; under the condition that the operation type of the historical data operation included in the current operation record is not the target operation type, the next item label operation record is obtained, so that through the embodiment of the application, the operation of all tenants in the Hadoop big data cluster in the laboratory environment is obtained and managed, and the names of the Hi ve tables are analyzed from the operation in the mode, so that the cold and hot conditions of the Hi ve tables are obtained, and the cold and hot statistical efficiency of the Hi ve tables is improved.
A complete embodiment of the present application is described below in conjunction with fig. 7.
As shown in fig. 7, in the global metadata acquisition phase: obtaining HUE end operation metadata by reading the HUE metadata database; through the yarn API, job metadata is obtained by parsing. And (3) injection: HUE end operation metadata refer to the inquiry condition manually executed by a user through HUE; the job metadata refers to the job scheduling situation of the automated scheduling tool. The two are combined to be the global metadata of the complete hive operation. The yarn api is designed by self-grinding, specifically, the yarn AP I provided by the analysis CDH is obtained through timing, the analysis interface returns information, and the operation condition of the hive is obtained. Acquiring job metadata data through the yarn AP I may include the steps of S1, periodically grabbing job information; s2, analyzing and obtaining the operation condition.
Then, after the two are combined, the self-developed SQL analysis processing engine is used for automatically obtaining the cold and hot degree condition of the data. The parsing engine logic is as follows: s1, introducing external dependence and defining a library name; s2, removing some line feed symbols; s3, removing annotators and drop operation; s4, legal table names need to be provided with library names, and judgment needs to be made; s5, traversing each row, and iteratively obtaining table names behind create, i nsert, from and jo n; s6, removing special characters and other phenomena in some extracted table names; s7, generating heat statistics; s8, sequencing and printing; an example of the output result is shown in fig. 6.
And finally, based on the LRU algorithm, combining the business and technical rules in the laboratory environment, and formulating a data cleaning strategy. The specific rules are as follows:
rule 1: in the last 3 months, the use times are less than 1 time, and the pool is dfxb_tmp_db or dfxb_ ceb _tmp_db, and the table life+ -life cycle is completed by deleting and destroying.
Rule 2: in recent 1 year, the number of times of use is less than 1, and the libraries are other libraries except for the dfxb_tmp_db, the dfxb_ ceb _tmp_db and the dfxb_pro_db libraries, so that the opinion of each item group is solicited, deleted and destroyed, and the life cycle of the table is completed.
Through the embodiment of the application, firstly, the operation of all tenants in the Hadoop big data cluster in a laboratory environment is acquired and managed, and the names of the Hi ve tables are analyzed from the operation through the analysis method, so that the cold and hot conditions of the Hi ve tables are obtained; and then, combining the cold and hot degree analysis, considering the actual cluster business, and providing a data archiving and cleaning strategy on the basis of the LRU algorithm, so that the pressure of the storage cost of enterprises is reduced while the data is prevented from being deleted by mistake.
Example 2
According to another embodiment of the present invention, there is also provided a management apparatus for a data table, fig. 8 is a block diagram of the management apparatus for a data table according to the present embodiment, as shown in fig. 8, including:
A first determining unit 82, configured to determine a multi-entry operation record to be managed from an operation data set, where the operation data set includes a plurality of operation records, the operation records include a history data operation performed on an operation object table, and a database to which a target operation object table included in the multi-entry operation record belongs is a target database;
a statistics unit 84, configured to count table parameter information corresponding to the target operation object table included in the multi-entry target operation record, where the table parameter information includes a cumulative operation number and a historical operation time, the cumulative operation number indicates an operation number of the corresponding target operation object table that is executed with the historical data operation, and the historical operation time indicates an operation time of the corresponding target operation object table that is executed with the historical data operation last time;
a second determining unit 86 configured to determine a target operation object table whose table parameter information satisfies a filtering condition as a redundant object table, and delete the target operation record corresponding to the redundant object table from the operation data set.
Optionally, the second determining unit includes:
The sorting module is used for sorting a plurality of target operation object tables included in the multi-item target operation record according to the sequence of the historical operation time, and determining a target table sequence;
and the determining module is used for determining the target operation object table which is in the target order and the corresponding accumulated operation times of which are smaller than or equal to the target threshold value in the target table sequence as the redundant object table.
Optionally, the second determining unit includes:
the first sorting module is used for sorting the first type of operation object tables according to the sequence of the historical operation time under the condition that the first type of operation object tables exist in the plurality of target operation object tables, and determining a first type of table sequence, wherein the data in the first type of operation object tables are temporary data; determining a target operation object table which is in the first target order and has the accumulated operation times smaller than or equal to a first target threshold value in the first class table sequence as the redundant object table;
the second sorting module is used for sorting the second-type operation object tables according to the sequence of the historical operation time under the condition that the second-type operation object tables exist in the plurality of target operation object tables, and determining a second-type table sequence, wherein data in the second-type operation object tables are common data; determining a target operation object table which is in the second class table sequence and is in a second target order and has the accumulated operation times smaller than or equal to a second target threshold value as a candidate redundant object table; and determining the redundant object table from the candidate redundant object tables according to the checking result of the candidate redundant object tables.
Optionally, the first determining unit includes:
the system comprises an acquisition module, a data query module and a data query module, wherein the acquisition module is used for acquiring first-class operation data and second-class operation data from the operation data set, the first-class operation data comprises operation records for a user to manually execute data operation, and the second-class operation data comprises operation records for a data query tool to automatically execute data operation;
the preprocessing module is used for preprocessing the first-type operation data and the second-type operation data to obtain the multi-item label operation record.
Optionally, the preprocessing module includes:
the traversing sub-module is used for traversing the first-type operation data and the second-type operation data, and screening out a reference operation record containing a reference operation object table from the first-type operation data and the second-type operation data, wherein a database to which the reference operation object table belongs is the target database;
the searching sub-module is used for searching and deleting redundant characters in the reference operation record, wherein the redundant characters comprise a line feed character and an annotation character;
and the determining submodule is used for determining the processed reference operation record as the target operation record.
Optionally, the second determining unit includes:
The circulation module is used for circularly executing the following operations until the multi-item label operation record is traversed: acquiring a label operation record as a current operation record; acquiring a current operation object table corresponding to the historical data operation in the current operation record under the condition that the operation type of the historical data operation included in the current operation record is a target operation type; under the condition that the searching of the current operation object table in the statistics list is successful, updating the accumulated operation times and the historical operation time of the current operation object table in the statistics list; under the condition that searching the current operation object table in the statistical list fails, newly adding the current operation object table, corresponding accumulated operation times and historical operation time in the statistical list; and acquiring a next item target operation record under the condition that the operation type of the historical data operation included in the current operation record is not the target operation type.
It should be noted that each of the above modules may be implemented by software or hardware, and for the latter, it may be implemented by, but not limited to: the modules are all located in the same processor; alternatively, the above modules may be located in different processors in any combination.
Example 3
Embodiments of the present invention also provide a computer readable storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the method embodiments described above when run.
Alternatively, in the present embodiment, the above-described storage medium may be configured to store a computer program for performing the steps of:
s11, determining a multi-item target operation record to be managed from an operation data set, wherein the operation data set comprises a plurality of operation records, the operation records comprise historical data operations executed on an operation object table, and a database which is included in the multi-item target operation record and belongs to the target operation object table is a target database;
s12, counting table parameter information corresponding to a target operation object table included in the multi-item target operation record, wherein the table parameter information comprises accumulated operation times and historical operation time, the accumulated operation times indicate the operation times of the corresponding target operation object table for executing the historical data operation, and the historical operation time indicates the operation time of the corresponding target operation object table for executing the historical data operation last time;
S13, determining a target operation object table with the table parameter information meeting the screening condition as a redundant object table, and deleting the target operation record corresponding to the redundant object table from the operation data set.
Alternatively, in the present embodiment, the storage medium may include, but is not limited to: a usb disk, a Read-On-y Memory (ROM), a random access Memory (Random Access Memory, RAM), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing a computer program.
Example 4
An embodiment of the invention also provides an electronic device comprising a memory having stored therein a computer program and a processor arranged to run the computer program to perform the steps of any of the method embodiments described above.
Optionally, the electronic apparatus may further include a transmission device and an input/output device, where the transmission device is connected to the processor, and the input/output device is connected to the processor.
Alternatively, in the present embodiment, the above-described processor may be configured to execute the following steps by a computer program:
s11, determining a multi-item target operation record to be managed from an operation data set, wherein the operation data set comprises a plurality of operation records, the operation records comprise historical data operations executed on an operation object table, and a database which is included in the multi-item target operation record and belongs to the target operation object table is a target database;
S12, counting table parameter information corresponding to a target operation object table included in the multi-item target operation record, wherein the table parameter information comprises accumulated operation times and historical operation time, the accumulated operation times indicate the operation times of the corresponding target operation object table for executing the historical data operation, and the historical operation time indicates the operation time of the corresponding target operation object table for executing the historical data operation last time;
s13, determining a target operation object table with the table parameter information meeting the screening condition as a redundant object table, and deleting the target operation record corresponding to the redundant object table from the operation data set.
Alternatively, specific examples in this embodiment may refer to examples described in the foregoing embodiments and optional implementations, and this embodiment is not described herein.
It will be appreciated by those skilled in the art that the modules or steps of the invention described above may be implemented in a general purpose computing device, they may be concentrated on a single computing device, or distributed across a network of computing devices, they may alternatively be implemented in program code executable by computing devices, so that they may be stored in a memory device for execution by computing devices, and in some cases, the steps shown or described may be performed in a different order than that shown or described, or they may be separately fabricated into individual integrated circuit modules, or multiple modules or steps within them may be fabricated into a single integrated circuit module for implementation. Thus, the present invention is not limited to any specific combination of hardware and software.
The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the principle of the present invention should be included in the protection scope of the present invention.
Claims (10)
1. A method of managing a data table, comprising:
determining a multi-item target operation record to be managed from an operation data set, wherein the operation data set comprises a plurality of operation records, the operation records comprise historical data operations executed on an operation object table, and a database to which a target operation object table contained in the multi-item target operation record belongs is a target database;
counting table parameter information corresponding to a target operation object table included in the multi-item target operation record, wherein the table parameter information comprises accumulated operation times and historical operation time, the accumulated operation times indicate the operation times of the corresponding target operation object table for executing the historical data operation, and the historical operation time indicates the operation time of the corresponding target operation object table for executing the historical data operation last time;
And determining a target operation object table of which the table parameter information meets the screening condition as a redundant object table, and deleting the target operation record corresponding to the redundant object table from the operation data set.
2. The method of claim 1, wherein determining the target operation object table for which the table parameter information satisfies the filtering condition as the redundant object table comprises:
sorting a plurality of target operation object tables included in the multi-item target operation record according to the sequence of the historical operation time, and determining a target table sequence;
and determining a target operation object table which is in the target order in the target table sequence and the corresponding accumulated operation times is smaller than or equal to a target threshold value as the redundant object table.
3. The method of claim 2, wherein determining the target operation object table for which the table parameter information satisfies the filtering condition as the redundant object table comprises:
under the condition that a first type of operation object table exists in the plurality of target operation object tables, sequencing the first type of operation object tables according to the sequence of the historical operation time, and determining a first type of table sequence, wherein data in the first type of operation object tables are temporary data; determining a target operation object table which is in the first target order and has the accumulated operation times smaller than or equal to a first target threshold value in the first class table sequence as the redundant object table;
Under the condition that a second type of operation object table exists in the plurality of target operation object tables, sorting the second type of operation object tables according to the sequence of the historical operation time, and determining a second type of table sequence, wherein data in the second type of operation object tables are common data; determining a target operation object table which is in the second class table sequence and is in a second target order and has the accumulated operation times smaller than or equal to a second target threshold value as a candidate redundant object table; and determining the redundant object table from the candidate redundant object table according to the checking result of the candidate redundant object table.
4. The method of claim 1, wherein determining the multi-label operation record to be managed from the operation data set comprises:
acquiring first-class operation data and second-class operation data from the operation data set, wherein the first-class operation data comprises operation records for users to manually execute data operation, and the second-class operation data comprises operation records for a data query tool to automatically execute data operation;
and preprocessing the first type of operation data and the second type of operation data to obtain the multi-item label operation record.
5. The method of claim 4, wherein preprocessing the first type of operation data and the second type of operation data to obtain the multi-label operation record comprises:
traversing the first type of operation data and the second type of operation data, and screening out a reference operation record containing a reference operation object table from the first type of operation data and the second type of operation data, wherein a database to which the reference operation object table belongs is the target database;
searching and deleting redundant characters in the reference operation record, wherein the redundant characters comprise a line feed character and an annotation character;
and determining the processed reference operation record as the target operation record.
6. The method of claim 4, wherein counting the cumulative number of operations and the historical operating time of the plurality of target operation object tables included in the multi-entry target operation record comprises:
the following operations are circularly executed until the multi-label operation record is traversed:
acquiring a label operation record as a current operation record;
acquiring a current operation object table corresponding to the historical data operation in the current operation record under the condition that the operation type of the historical data operation included in the current operation record is a target operation type; under the condition that searching the current operation object table in the statistics list is successful, updating the accumulated operation times and the historical operation time of the current operation object table in the statistics list; under the condition that searching the current operation object table in the statistical list fails, newly adding the current operation object table, corresponding accumulated operation times and historical operation time in the statistical list;
And acquiring a next item target operation record under the condition that the operation type of the historical data operation included in the current operation record is not the target operation type.
7. A data table management apparatus comprising:
the first determining unit is used for determining a multi-item target operation record to be managed from an operation data set, wherein the operation data set comprises a plurality of operation records, the operation records comprise historical data operations executed on an operation object table, and a database which is included in the multi-item target operation record and belongs to the target operation object table is a target database;
a statistics unit, configured to count table parameter information corresponding to a target operation object table included in the multi-entry target operation record, where the table parameter information includes a cumulative operation number and a historical operation time, the cumulative operation number indicates an operation number of times the corresponding target operation object table is executed with the historical data operation, and the historical operation time indicates an operation time of the corresponding target operation object table executed with the historical data operation last time;
and the second determining unit is used for determining a target operation object table, of which the table parameter information meets the screening condition, as a redundant object table and deleting the target operation record corresponding to the redundant object table from the operation data set.
8. The apparatus according to claim 7, wherein the second determining unit includes:
the sorting module is used for sorting a plurality of target operation object tables included in the multi-item target operation record according to the sequence of the historical operation time, and determining a target table sequence;
and the determining module is used for determining a target operation object table which is in the target order and the corresponding accumulated operation times of which are smaller than or equal to a target threshold value in the target table sequence as the redundant object table.
9. A computer-readable storage medium, characterized in that the storage medium has stored therein a computer program, wherein the computer program is arranged to execute the method of any of the claims 1 to 6 when run.
10. An electronic device comprising a memory and a processor, characterized in that the memory has stored therein a computer program, the processor being arranged to run the computer program to perform the method of any of the claims 1 to 6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211732116.5A CN116010340A (en) | 2022-12-30 | 2022-12-30 | Data table management method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211732116.5A CN116010340A (en) | 2022-12-30 | 2022-12-30 | Data table management method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116010340A true CN116010340A (en) | 2023-04-25 |
Family
ID=86035125
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211732116.5A Pending CN116010340A (en) | 2022-12-30 | 2022-12-30 | Data table management method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116010340A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117573357A (en) * | 2023-11-27 | 2024-02-20 | 北京宝联之星科技股份有限公司 | Cloud edge collaborative caching method, system and medium based on perceptual redundancy |
-
2022
- 2022-12-30 CN CN202211732116.5A patent/CN116010340A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117573357A (en) * | 2023-11-27 | 2024-02-20 | 北京宝联之星科技股份有限公司 | Cloud edge collaborative caching method, system and medium based on perceptual redundancy |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111459985B (en) | Identification information processing method and device | |
US10977256B2 (en) | System for aggregation and prioritization of IT asset field values from real-time event logs and method thereof | |
CN102117303A (en) | Patent data analysis method and system | |
US10929370B2 (en) | Index maintenance management of a relational database management system | |
CN109271435A (en) | A kind of data pick-up method and system for supporting breakpoint transmission | |
CN111400288A (en) | Data quality inspection method and system | |
CN108846121A (en) | A kind of data search method and device | |
CN111506569A (en) | Data storage method and device and electronic device | |
CN112685370B (en) | Log collection method, device, equipment and medium | |
CN111913860A (en) | Operation behavior analysis method and device | |
CN116010340A (en) | Data table management method and device | |
CN109885642A (en) | Classification storage method and device towards full-text search | |
JP6642435B2 (en) | Data processing device, data processing method, and program | |
CN106919566A (en) | A kind of query statistic method and system based on mass data | |
CN105930504B (en) | A kind of network management timing file cocurrent processing system and concurrent processing method | |
CN110716938A (en) | Data aggregation method and device, storage medium and electronic device | |
CN112596851A (en) | Multi-source heterogeneous data batch extraction method and analysis method of simulation platform | |
CN113434492B (en) | Data detection method and device, storage medium and electronic device | |
CN110321358A (en) | User data editing method and device | |
CN114036104A (en) | Cloud filing method, device and system for re-deleted data based on distributed storage | |
EP3436988B1 (en) | "methods and systems for database optimisation" | |
CN104951869A (en) | Workflow-based public opinion monitoring method and workflow-based public opinion monitoring device | |
CN118260366B (en) | Data life cycle analysis method based on audit log device and product | |
CN117725074A (en) | Database updating method and device for storage file, storage medium and electronic equipment | |
CN111931502B (en) | Word segmentation processing method and system and word segmentation searching method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |