CN116049102A

CN116049102A - Data processing method, device, electronic equipment and storage medium

Info

Publication number: CN116049102A
Application number: CN202310157142.8A
Authority: CN
Inventors: 林鹏程
Original assignee: Hangzhou Dt Dream Technology Co Ltd
Current assignee: Hangzhou Dt Dream Technology Co Ltd
Priority date: 2023-02-20
Filing date: 2023-02-20
Publication date: 2023-05-02

Abstract

The application provides a data processing method, a data processing device and electronic equipment, and relates to the technical field of data processing, wherein the method comprises the following steps: acquiring a plurality of file name sets; acquiring a plurality of data sets; determining a target data set similar to any file name set from a plurality of data sets according to each file name in the any file name set aiming at the any file name set; and generating associated blood edge information according to the table name of the data table where the target data set is, any data field and the file name under the condition that the file name is matched with any data field in the file name set aiming at any data field in the target data set. Therefore, a mode of carrying out similarity calculation on file names of unstructured files clustered in a data lake and data fields in the same column or the same row in a structured data table can be realized, and associated blood-edge information is obtained, so that mapping associated information between the structured data table and the unstructured files can be effectively established.

Description

Data processing method, device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of data processing technologies, and in particular, to a data processing method, a data processing device, an electronic device, and a storage medium.

Background

A data lake is a repository or system that stores data in a raw format, which stores data as it is, without requiring prior structuring of the data. In data lake traffic, there are situations where structured data (e.g., a data table) and unstructured data (e.g., files) are associated, for example, personal information is recorded in the data table, personal information in the data table needs to be associated with unstructured data (e.g., personal identification card photographs, vehicle photographs, other text files, etc.), and unstructured data needs to be stored in object storage.

It is important how to establish association blood-edge information (i.e., map association information) between structured data (e.g., a data table) and unstructured data (e.g., files).

Disclosure of Invention

The object of the present application is to solve at least to some extent one of the above technical problems.

Therefore, the application provides a data processing method, a device, electronic equipment and a storage medium, so that the method for carrying out similarity calculation on file names of unstructured files clustered in a data lake and data fields in the same column or the same row in a structured data table is realized, and associated blood-source information is obtained, and therefore mapping associated information between the structured data table and the unstructured files can be effectively established, and the validity of unstructured file inquiry or access is improved.

An embodiment of a first aspect of the present application provides a data processing method, including:

acquiring a plurality of first file name sets, wherein the first file name sets comprise file names of files in the same cluster, and the cluster is obtained by clustering the files in at least one storage bucket in a data lake;

acquiring a plurality of first data sets, wherein the first data sets comprise data fields in the same column or the same row in the same data table in the data lake;

determining a target data set similar to any first file name set from the plurality of first data sets according to each file name in any first file name set;

and generating first association blood edge information according to the table name of the data table where the target data set is located, any first data field and the first file name under the condition that the first file name is matched with any first data field in any first file name set aiming at any first data field in the target data set.

An embodiment of a second aspect of the present application proposes a data processing apparatus, including:

the first acquisition module is used for acquiring a plurality of first file name sets, wherein the first file name sets comprise file names of files in the same cluster, and the cluster is obtained by clustering the files in at least one storage bucket in a data lake;

The second acquisition module is used for acquiring a plurality of first data sets, wherein the first data sets comprise data fields in the same column or the same row in the same data table in the data lake;

a determining module, configured to determine, for any first set of file names, a target data set similar to any first set of file names from the plurality of first data sets according to each file name in the any first set of file names;

the generation module is used for generating first associated blood-edge information according to the table name of the data table where the target data set is located, any first data field and the first file name when the first file name is matched with any first data field in any first file name set aiming at any first data field in the target data set.

An embodiment of a third aspect of the present application provides an electronic device, including: a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the data processing method according to the first aspect when executing the program.

An embodiment of a fourth aspect of the present application proposes a non-transitory computer readable storage medium, on which a computer program is stored, which program, when being executed by a processor, implements a data processing method according to the first aspect.

An embodiment of a fifth aspect of the present application proposes a computer program product comprising a computer program which, when executed by a processor, implements a data processing method according to the first aspect of the present application.

The technical scheme provided by the embodiment of the application at least brings the following beneficial effects:

acquiring a plurality of first file name sets and a plurality of first data sets; determining a target data set similar to any first file name set from the plurality of first data sets according to each file name in any first file name set aiming at any first file name set; and generating first association blood-edge information according to the table name of the data table where the target data set is located, any first data field and the first file name under the condition that the first file name is matched with any first data field in any first file name set aiming at any first data field in the target data set. Therefore, a mode of carrying out similarity calculation on file names of unstructured files clustered in a data lake and data fields in the same column or the same row in a structured data table can be realized, and associated blood-edge information is obtained, so that mapping associated information between the structured data table and the unstructured files can be effectively established, and the validity of unstructured file inquiry or access is improved.

Additional aspects and advantages of the application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the application.

Drawings

The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings, in which:

fig. 1 is a schematic flow chart of a data processing method according to an embodiment of the present application;

FIG. 2 is a flowchart of another data processing method according to an embodiment of the present disclosure;

FIG. 3 is a flowchart illustrating another data processing method according to an embodiment of the present disclosure;

FIG. 4 is a flowchart illustrating another data processing method according to an embodiment of the present disclosure;

FIG. 5 is a flowchart illustrating another data processing method according to an embodiment of the present disclosure;

FIG. 6 is a schematic diagram of a data processing apparatus according to one embodiment of the present application;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Embodiments of the present application are described in detail below, examples of which are illustrated in the accompanying drawings, wherein the same or similar reference numerals refer to the same or similar elements or elements having the same or similar functions throughout. The embodiments described below by referring to the drawings are exemplary and intended for the purpose of explaining the present application and are not to be construed as limiting the present application.

By recording the file names stored in the object store in the data table, the file corresponding to the file names in the object store can be accessed and acquired according to the file names by searching the file names in the data table.

Because object storage cannot be operated by SQL (Structured Query Language ) statements, and the record of file names in a data table is realized by an application program, the record cannot be perceived at a data layer, so that the existing data blood-edge realization means cannot effectively establish mapping association information (i.e. association blood-edge information) between the data table and unstructured data (such as files).

In view of the above problems, embodiments of the present application provide a data processing method, apparatus, and electronic device. Before describing the embodiments of the present application in detail, for ease of understanding, the general technical words are first introduced:

object storage, also known as object-based storage, is a generic term used to describe a method of resolving and processing discrete units, which are referred to as objects. The object is the same as a file in that: all contain data, differing from the document in that: objects will no longer have a hierarchy in a hierarchy. Each object is in the same level of a flat address space called a memory pool, and one object does not belong to the next level of another object.

Wherein both the file and the object have metadata associated with their own contained data, and the object is characterized by extended metadata. Each object is assigned a unique identifier that allows a server or end user to retrieve the object without having to know the physical address of the data.

The folder in the object storage is just a logical concept, and when the folder is set by means of API (Application Programming Interface, application program interface)/SDK (Software Development Kit ), a key value (such as abc/1. Jpg) corresponding to the object can be specified, so that the function of logically forming the folder can be realized. For example, defining the key of the object as abc/1.Jpg creates a folder of abc under the bucket and a 1.Jpg file under the folder.

The folder in the object store is in fact an empty file of size 0KB, so that when the user creates an object with a key value of 1/folder 1 is defined, and if the user creates the file abc/1.Jpg, the system will not create abc/this folder, so that after deleting abc/1.Jpg, there will be no more abc this folder.

It should be noted that, since the object storage is in a distributed storage manner, the object objects are not physically stored according to folders, that is, not all files under one folder are stored together. In the process of back-end storage, files under different folders are only different in key value prefix, so that under the framework, summary information under a certain folder, such as the size of the folder, the access frequency of the folder and the like, cannot be counted conveniently. If it is desired to traverse all files under a certain folder, the key values of all files under the folder need to be obtained first through the ListObject interface (here, the folder needs to be specified by prefix), and then the operation is performed. If it is desired to access a file in the object store, the object is manipulated via the REST interface, which is described by the verb (GET, POST, PUT, DELETE, etc.) of HTTP (Hyper Text Transfer Protocol ).

The data processing method provided in the present application is described in detail below with reference to fig. 1.

Fig. 1 is a flow chart of a data processing method according to an embodiment of the present application.

The data processing method of the embodiment of the application may be executed by the data processing apparatus provided by the embodiment of the application. The data processing device can be applied to the electronic equipment to execute the data processing function. Alternatively, the data processing apparatus may be configured in an application of the electronic device so that the application may perform the data processing function.

The electronic device may be any device with computing capabilities, which device or an application in the device is capable of performing data processing functions. The device with computing capability may be, for example, a personal computer (Personal Computer, abbreviated as PC), a mobile terminal, a server, etc., and the mobile terminal may be, for example, a vehicle-mounted device, a mobile phone, a tablet computer, a personal digital assistant, a wearable device, etc., and may be a hardware device with various operating systems, a touch screen, and/or a display screen.

As shown in fig. 1, the data processing method includes the steps of:

step S101, a plurality of first file name sets are obtained, wherein the plurality of first file name sets comprise file names of files in the same cluster, and the cluster is obtained by clustering the files in at least one storage bucket in a data lake.

In embodiments of the present application, files may include, but are not limited to: picture files, document files, PDF (Portable Document Format, portable file format) files, audio files, video files, etc.

In the embodiment of the application, the files in at least one storage bucket in the data lake can be clustered to obtain a plurality of clusters. As an example, for each bucket in a data lake, the files in that bucket may be clustered individually to obtain at least one cluster. Thus, in the present application, a first set of file names may be generated according to the file names of the files in the same cluster.

In step S102, a plurality of first data sets are acquired, where the first data sets include data fields in the same column or row in the same data table in the data lake.

In this embodiment of the present application, if attribute values corresponding to attribute fields (name, age, data type, file name, etc.) are stored in rows in the data table, the data fields in at least one data table in the data lake may be divided into rows to obtain a plurality of first data sets, where each first data set includes each data field in a same row in one data table.

In this embodiment of the present application, if the attribute values corresponding to the attribute fields are stored in columns in the data table, the data fields in at least one data table in the data lake may be divided in columns to obtain a plurality of first data sets, where each first data set includes each data field in the same column in one data table.

Step S103, for any first file name set, determining a target data set similar to any first file name set from a plurality of first data sets according to each file name in the any first file name set.

In the embodiment of the present application, for any one of the plurality of first filename sets, a first data set similar to any one of the plurality of first filename sets may be determined from the plurality of first data sets according to each filename in the any one of the first filename sets, and may be used as the target data set.

Step S104, for any first data field in the target data set, when a first file name matching any first data field exists in any first file name set, generating first association blood-edge information according to the table name of the data table in which the target data set is located, any first data field and first file name.

In this embodiment of the present application, for any data field in the target data set (herein denoted as a first data field), it may be determined whether any of the above-mentioned first filenames matches or matches any of the first data fields (i.e., the filenames exactly match the first data fields), and if any of the above-mentioned first filenames matches or matches any of the first data fields, then relevant blood-edge information (i.e., mapping relevant information) may be generated according to the first filename, the table name of the data table in which the target data set is located, and any of the first data fields.

As an example, for a certain first data field in the target data set, the first data field may be used as a keyword, and any first filename set may be searched to determine whether a first filename matched or consistent with the first data field exists in any first filename set, and if a first filename matched or consistent with the first data field exists in any first filename set, a table name of a data table in which the target data set is located, the first data field, and a first filename matched with the first data field are recorded, and associated blood-edge information is formed.

As a possible implementation, the associated blood-lineage information can also be stored in a blood-lineage information database.

As another possible implementation manner, the associated blood-edge information may also be updated into the file metadata of the first file, that is, the associated blood-edge information may be added to the file metadata of the first file.

Therefore, the association graph of the file can be formed through associating the blood-source information, so that the associated file can be found based on the association graph, namely, a data table associated with the file is recorded in the file metadata, and the similar file associated with the data table can be found through the recording of the data table.

According to the data processing method, a plurality of first file name sets are obtained, and a plurality of first data sets are obtained; determining a target data set similar to any first file name set from the plurality of first data sets according to each file name in any first file name set aiming at any first file name set; and generating first association blood-edge information according to the table name of the data table where the target data set is located, any first data field and the first file name under the condition that the first file name is matched with any first data field in any first file name set aiming at any first data field in the target data set. Therefore, a mode of carrying out similarity calculation on file names of unstructured files clustered in a data lake and data fields in the same column or the same row in a structured data table can be realized, and associated blood-edge information is obtained, so that mapping associated information between the structured data table and the unstructured files can be effectively established, and the validity of unstructured file inquiry or access is improved.

In order to clearly illustrate how the target data set similar to the first file name set is determined from the plurality of first data sets in the above embodiment of the present application, a data processing method is also proposed.

Fig. 2 is a flow chart of another data processing method according to an embodiment of the present application.

As shown in fig. 2, the data processing method may include the steps of:

step S201, a plurality of first file name sets are obtained, wherein the first file name sets comprise file names of files in the same cluster, and the cluster is obtained by clustering the files in at least one storage bucket in a data lake.

In step S202, a plurality of first data sets are acquired, where the first data sets include data fields in the same column or row in the same data table in the data lake.

The explanation of steps S201 to S202 may be referred to the related description in any embodiment of the present application, and will not be repeated here.

Step S203, for any first filename set, determining a first similarity between the any first filename set and each first data set according to each filename in the any first filename set.

In this embodiment of the present application, for any one of a plurality of first filename sets, the similarity between the any one of the first filename sets and each of the first data sets may be calculated according to each filename in the any one of the first filename sets based on a similarity calculation algorithm, which is herein denoted as first similarity.

In one possible implementation manner of the embodiment of the present application, in order to improve accuracy of the similarity calculation result, a feature vector of the arbitrary first filename set may be obtained based on a deep learning technology, and feature vectors of each first data set may be obtained, so that the first similarity between the arbitrary first filename set and each first data set may be determined based on the similarity between the feature vectors.

As an example, feature extraction may be performed on each file name in the arbitrary first file name set to obtain a first text feature of the arbitrary first file name set, feature extraction may be performed on each data field in each first data set to obtain a second text feature of each first data set, and a similarity between the first file feature and each second text feature is calculated, which is referred to as a second similarity in this application, so that the first similarity between the arbitrary first file name set and each first data set may be determined according to the second similarity of each second text feature.

For example, for any second text feature, the second similarity of the second text feature may be used as the first similarity between the first data set corresponding to the second text feature and any of the first filename sets.

It should be noted that, the number of data fields in the same row or the same column in the data table is larger, that is, the number of data fields in the first data set is larger, if all the data fields in the first data set participate in the similarity calculation, more calculation resources are required to be consumed, and the calculation efficiency is lower, so in the above situation, in a possible implementation manner of the embodiment of the present application, the data fields in the first data set may be sampled or sampled, and the similarity between the first file name set and the first data set is calculated only according to the sampled or sampled first data set.

As an example, for any first data set, each data field in the any first data set may be sampled to obtain a sampled any first data set, where the sampled any first data set includes a set number (1 ten thousand, 5 thousand, etc.) of data fields, so that the first similarity between the any first data set and any first filename set may be calculated according to the data fields included in the sampled any first data set and each filename in the any first filename set.

For example, feature extraction may be performed on each data field included in the sampled any first data set to obtain a second text feature of the any first data set, so that in the present application, the first similarity between the any first filename set and each first data set may be determined according to the second similarity between the first text feature and each second text feature.

Therefore, each first data set is sampled, and similarity calculation is carried out according to the sampled first data sets, so that the calculation efficiency can be improved, and the calculation resources can be saved.

Step S204, determining candidate data sets from the first data sets according to the first similarity of the first data sets, wherein the first similarity of the candidate data sets is higher than a set similarity threshold.

In this embodiment of the present application, for any one first data set, it may be determined whether the first similarity of the first data set is greater than a set similarity threshold, if the first similarity of the first data set is greater than the set similarity threshold, the first data set may be used as a candidate data set, and if the first similarity of the first data set is less than or equal to the similarity threshold, no processing may be required.

In step S205, a target data set is determined from the candidate data sets.

In the embodiment of the application, a target data set similar to any of the first file name sets can be determined from each candidate data set.

In one possible implementation manner of the embodiment of the present application, the determining manner of the target data set may be, for example: for any candidate data set, determining the hit number of each file name in any first file name set in any candidate data set, so that in the application, a target data set similar to any first file name set can be determined from each candidate data set according to the hit number of each candidate data set. For example, the candidate data set with the largest hit number may be used as the target data set similar to any of the first filename sets.

As an example, for any one candidate data set, any one of the first file name sets and the candidate data set may be stored in an analytical database (such as ES (elastic search, a non-relational distributed full text retrieval framework)), and the candidate data set is determined by using the file name in any one of the first file names as a keyword and by means of reverse indexing, the hit number of each file name in any one of the first file names in the candidate data set, and the candidate data set with the highest hit number is used as the target data set.

For example, if any one of the first file names set includes 50 file names, and 30 data fields in the candidate data set 1 hit 30 file names in any one of the first file names set, and 20 data fields in the candidate data set 2 hit 20 file names in any one of the first file names set, the candidate data set 1 may be regarded as the target data set.

In another possible implementation manner of the embodiment of the present application, the determining manner of the target data set may be, for example: and taking the candidate data set with the maximum first similarity as a target data set similar to any one of the first file name sets.

Therefore, the target data set similar to any one of the first file name sets can be determined in different modes, and the flexibility and applicability of the method can be improved.

Step S206, for any first data field in the target data set, generating first association blood-edge information according to the table name of the data table in which the target data set is located, any first data field and the first file name when the first file name is matched with any first data field in any first file name set.

The explanation of step S206 may be referred to the related description in any embodiment of the present application, and will not be repeated here.

According to the data processing method, according to the similarity between the first file name set and each first data set, the target data set similar to the first file name set is determined from each first data set, and accuracy of a target data set determining result can be improved.

In order to clearly illustrate how to obtain the plurality of first filename sets in any embodiment of the present application, a data processing method is further provided.

Fig. 3 is a flow chart of another data processing method according to an embodiment of the present application.

As shown in fig. 3, the data processing method may include the steps of:

step S301, for any storage bucket in the data lake, acquiring file information of each file in the any storage bucket.

Wherein the file information includes at least one of: file name, file type, and file metadata. Among others, file metadata may include, but is not limited to: creation time, creator, file size, number of hard links to the file, access time, modification time for file metadata, modification time for the file, etc.

In this embodiment of the present application, for any bucket in a data lake, file information of each file in the any bucket may be obtained.

Step S302, clustering the files in any storage bucket according to the file information of the files in any storage bucket to obtain at least one cluster.

In this embodiment of the present application, based on a clustering algorithm, each file in any one of the buckets may be clustered according to the file information of each file in the any one of the buckets, to obtain at least one cluster.

Among them, clustering algorithms include, but are not limited to: k-means clustering algorithm (K-means clustering algorithm ), KNN (K-nearest neighbor) classification algorithm, and the like.

As an example, files in any bucket may be clustered according to their file names such that files with similar file names are clustered into the same cluster.

It should be noted that, the file names of the files in the data lake may be named according to a specified naming rule, or may not be named according to a specified naming rule, for example, may be named according to actual application requirements or service requirements, which is not limited in this application.

As another example, files in any bucket may be clustered according to file types of files in the bucket such that files of the same file type are clustered into the same cluster.

As yet another example, files in any bucket may be clustered according to file metadata of the files in the bucket such that files with similar file metadata are clustered into the same cluster.

Step S303, a first file name set is generated according to the file names of the files in the same cluster.

In the embodiment of the present application, the file name of each file in the same cluster may be used as a first file name set.

In step S304, a plurality of first data sets are acquired, where the first data sets include data fields in the same column or row in the same data table in the data lake.

Step S305, for any first set of file names, determining a target data set similar to any first set of file names from the plurality of first data sets according to each file name in the any first set of file names.

Step S306, for any first data field in the target data set, generating first association blood-edge information according to the table name of the data table in which the target data set is located, any first data field and the first file name when the first file name in the any first file name set is matched with any first data field.

The explanation of steps S304 to S306 may be referred to the related description in any embodiment of the present application, and will not be repeated here.

According to the data processing method, the files in the storage barrel can be clustered according to the file names, the file types or the file metadata of the files in the storage barrel, so that a plurality of first file name sets are obtained, namely, the files in the storage barrel can be clustered according to different file information, and the flexibility and the applicability of the method can be improved.

In order to clearly illustrate how the plurality of first data sets are acquired in any embodiment of the present application, a data processing method is also provided.

Fig. 4 is a flow chart of another data processing method according to an embodiment of the present application.

As shown in fig. 4, the data processing method may include the steps of:

step S401, a plurality of first file name sets are obtained, wherein the first file name sets comprise file names of files in the same cluster, and the cluster is obtained by clustering the files in at least one storage bucket in a data lake.

The explanation of step S401 may be referred to the related description in any embodiment of the present application, and will not be repeated here.

In step S402, a plurality of initial data sets are acquired, where each initial data set includes all data fields in a same column or a same row in a same data table in a data lake.

In this embodiment of the present application, if attribute values corresponding to attribute fields (name, age, data type, file name, etc.) are stored in rows in the data table, the data fields in at least one data table in the data lake may be divided into rows to obtain a plurality of initial data sets, where each initial data set includes all the data fields in the same row in the same data table.

In this embodiment of the present application, if the attribute values corresponding to the attribute fields are stored in columns in the data table, the data fields in at least one data table in the data lake may be divided in columns to obtain a plurality of initial data sets, where each initial data set includes all the data fields in the same column in the same data table.

Step S403, for any initial data set, determining a retention score of the any initial data set according to the repetition rate of each data field in the any initial data set.

Wherein, the reserve score and the repetition rate are in a negative correlation, i.e. the lower the repetition rate is, the larger the reserve score is, and conversely, the higher the repetition rate is, the smaller the reserve score is.

In this embodiment of the present application, for any one initial data set of a plurality of initial data sets, the repetition rate of each data field in the any one initial data set may be determined, for example, the any one initial data set includes 50 data fields, and if there are 40 data fields that are repeated, the repetition rate of the any one initial data set may be determined to be 0.8. In the present application, the retention score of any initial data set may be determined according to the repetition rate of each data field in the initial data set. Wherein, the reserve score and the repetition degree are in a negative correlation, namely, the smaller the repetition rate is, the larger the reserve score is, and on the contrary, the larger the repetition rate is, the smaller the reserve score is.

Step S404, determining each first data set from each initial data set according to the retention score of each initial data set.

In embodiments of the present application, each first data set may be determined from each initial data set according to a reserve score for each initial data set.

As an example, an initial data set with a reserve score above a set score threshold may be used as the first data set.

As another example, each initial data set may be sorted from large to small according to the value of the reserve score, and the initial data set of the set number (e.g., 2, 3, 4, etc.) sorted in front is selected as the first data set.

For example, data fields in the data table having repeatability such as enumerated values, boolean values, date types, etc., cannot be used as data fields matching file names, and thus the initial data set containing these data fields may be filtered to preserve the initial data set containing data fields that may be used as file names.

Step S405, for any first set of file names, determining a target data set similar to any first set of file names from the plurality of first data sets according to each file name in the any first set of file names.

Step S406, for any first data field in the target data set, when there is a match between the first file name in the any first file name set and the any first data field, generating first associated blood-edge information according to the table name of the data table in which the target data set is located, the any first data field and the first file name.

The explanation of steps S405 to S406 may be referred to the related description in any embodiment of the present application, and will not be repeated here.

According to the data processing method, screening of all initial data sets can be achieved, so that the initial data sets containing data fields incapable of being used as file names can be filtered, on one hand, the calculated amount can be reduced, and on the other hand, the accuracy of establishing associated blood-edge information can be improved.

For clarity of explanation of any embodiment of the present application, the present application also proposes a data processing method.

Fig. 5 is a flow chart of another data processing method according to an embodiment of the present application.

As shown in fig. 5, on the basis of the embodiment shown in any one of fig. 1 to 4, the data processing method may further include the steps of:

in step S501, any one of the first data fields in the data table where the target data set is located is marked.

In this embodiment of the present application, after data comparison is performed on each first filename set and a target data set similar to the first filename set, and a filename is associated with a data field and a table name, for a file name that is not associated, a data field that does not have a filename matching the data field may be extracted from the file name and the data field, that is, an associated file name and a data field are excluded, and the unassociated file name and data field are extracted from the data field.

As an example, a first data field in which an associated file name exists in a data table in which a target data set is located may be marked, where the marking functions as: it is clear that the first data field has an associated file name.

Step S502, marking the file corresponding to the first file name in at least one storage bucket.

Likewise, a file in at least one bucket in the data lake corresponding to the first file name may be marked, where the marking functions as: and determining that the file corresponding to the first file name has the associated data field and the data table.

Step S503, executing at least one round of iterative process according to the unlabeled data fields in each data table and the unlabeled files in each storage bucket until the set iteration stopping condition is met.

In an embodiment of the present application, the stop iteration condition may include at least one of:

a first item: no unlabeled files are present in each bucket, i.e., the file names of all files in the bucket are associated with data fields in the data table, or the number of unlabeled files is less than a set first number threshold.

The second item: there are no unlabeled data fields in each first data set, i.e. all data fields in the first data set have associated file names, or the number of unlabeled data fields is less than a set second number threshold.

Third item: the file name of each unlabeled file does not match or agree with the unlabeled data fields in the first dataset.

Fourth item: in any iteration process, before matching or similarity calculation is performed on each second file name set and each second data set, it is determined that the number of file names contained in each second file name set is smaller than a set third number threshold.

In an embodiment of the present application, for a first round of iterative process, the following steps may be included:

1. clustering each unlabeled file in each storage bucket to obtain at least one class cluster, and generating at least one second file name set corresponding to the first round of iterative process according to the file name of each file in the at least one class cluster, namely, the file name of each file in the same class cluster can be used as one second file name set.

2. And dividing unlabeled data fields in each data table according to columns or rows to obtain at least one second data set corresponding to the first round of iterative process.

3. For any second file name set, according to each file name in any second file name set, determining a second data set similar to any second file name set from each second data set, and taking the second data set as a similar data set.

4. And generating second association blood-edge information according to the table name of the data table where the similar data set is located, any second data field and the second file name under the condition that the second file name is matched with any second data field in any second file name set aiming at any second data field in the similar data set.

5. And marking a second data field with an associated file name in a data table where the similar data set is located so as to obtain unlabeled data fields in each data table obtained by updating the iterative process of the round.

6. And marking the files corresponding to the second file names of the associated second data fields in each storage bucket to obtain unlabeled files in each storage bucket updated in the iterative process of the round.

For non-first round iterative processes, the following steps may be included:

1. and determining whether the set iteration conditions are met or not according to each unlabeled data field in each data table and each unlabeled file in each storage barrel obtained by updating in the previous iteration process, stopping iteration if the set iteration conditions are met, and executing the next step if the set iteration conditions are not met.

2. Clustering each unlabeled file in each storage barrel obtained by updating in the previous iteration process to obtain at least one class cluster, and according to the file name of each file in the at least one class cluster, and generating at least one second file name set corresponding to the iterative process, namely, the file names of the files in the same class of clusters can be used as one second file name set.

3. And dividing unlabeled data fields in each data table obtained by updating the previous iteration process according to columns or rows to obtain at least one second data set corresponding to the current iteration process.

4. For any second file name set, according to each file name in any second file name set, determining a second data set similar to any second file name set from each second data set, and taking the second data set as a similar data set.

5. And generating second association blood-edge information according to the table name of the data table where the similar data set is located, any second data field and the second file name under the condition that the second file name is matched with any second data field in any second file name set aiming at any second data field in the similar data set.

6. And marking a second data field with an associated file name in a data table where the similar data set is located so as to obtain unlabeled data fields in each data table obtained by updating the iterative process of the round.

7. And marking the files corresponding to the second file names of the associated second data fields in each storage bucket to obtain unlabeled files in each storage bucket updated in the iterative process of the round.

It should be noted that, in actual application, the condition of stopping iteration may also be other conditions, for example, the condition of stopping iteration may also include: any second filename set in any round of iterative process is dissimilar to each second data set, for example, the similarity between any second filename set and each second data set is smaller than a set similarity threshold.

In conclusion, iterative computation can be carried out on the unlabeled file object based on unlabeled data fields, and richness and completeness of associated blood edge information generation are improved.

In any embodiment of the application, unstructured data stored in a data lake can be classified based on a storage Bucket (Bucket) through an intelligent algorithm, similarity analysis is performed on the unstructured data and data fields in a data table, a column which can possibly record file names is found, and then the corresponding data table and column are accurately obtained based on a keyword reverse index, so that the correlation blood edges of the data table and the unstructured data are realized. The method specifically comprises the following steps:

step 1, for unstructured data stored in each storage Bucket (Bucket) in a data lake, clustering and classifying the unstructured data into a plurality of file sets (marked as clusters in the application) through a clustering algorithm of machine learning such as a k-means clustering algorithm according to file names, file types or other metadata information (including creation time, creator and the like), and extracting file names of all files in the file sets to obtain file name sets.

And 2, performing similarity comparison on the file name set and a data table stored in the data lake. Firstly, filtering data fields in a data table to remove data fields which cannot be used as file names, such as data fields with repeatability, such as enumeration values, boolean values, date types and the like; then, according to each column which can participate in calculation, as a data set, at this time, a plurality of data sets exist, and considering that the data amount is huge and the calculation speed is influenced, in the application, when the data amount of the data fields in the data set is overlarge, sampling calculation can be performed, for example, 1 ten thousand data fields are extracted and taken as data set samples; then, similarity calculation can be performed on the file name sets and the data sets, for example, through an embedded Embedding algorithm in deep learning, the similarity between each file name set and each data set is calculated, so that the data set similar to each file name set is obtained, and table names of a data table where the data set similar to the file name set is located are obtained.

Step 3, when the file name set is similar to a plurality of data sets, the data sets can be further analyzed to find a more accurate data set. For example, the file name set and the data set to be compared further may be stored in an ES analysis database, the file names are used as keywords, the data sets are implemented by reverse indexes, the hit condition of each file name in the file name set in the similar data sets is obtained, and the data set with the highest hit rate or hit number is used as the data set to be correlated finally.

And 4, after the file name set, the data set similar to the file name set and the table name of the data table where the data set is located are obtained, searching the file name set by taking the data field in the data table as a keyword, determining whether the file name matched or consistent with the data field exists in the file name set, and if the file name matched with the data field exists, recording the data table, the data field and the existing file name to form associated blood-edge information, and entering the blood-edge information database. And, the associated blood-edge information can also be synchronously updated into the file metadata. If there is no file name matching the data field, the file is first emptied.

And 5, establishing association blood-edge information for the file name, the data table and the data field which are associated after comparing and associating each file name set, the associated data table and the data field. Extracting data from the file names and data fields which are not associated, removing the file names and data fields which are already associated, iteratively calculating the residual data by repeating the steps 1 to 4, and attempting to further associate the residual data until the file name set and the data set are far in similarity, and stopping iterative calculation.

And 6, through multiple iterations, finally, most unstructured data can establish the associated blood margin with the corresponding data table.

Step 7, the above is a method of association blood-cause self-learning without manual intervention, if the file names of the files in the data lake are named according to the appointed naming rule, the above steps 2 and 3 can be simplified in the mode, the process of association blood-cause establishment is quickened, but the data table entering the data lake is required to have higher quality requirements.

In summary, in the data lake, the associated blood-lineage information can be obtained through analysis of unstructured data and data tables through intelligent algorithms.

Corresponding to the data processing methods provided in the above embodiments, an embodiment of the present application further provides a data processing apparatus. Since the data processing apparatus provided in the embodiment of the present application corresponds to the data processing method provided in the above-described several embodiments, implementation of the data processing method is also applicable to the data processing apparatus provided in the embodiment, and will not be described in detail in the embodiment.

Fig. 6 is a schematic structural view of a data processing apparatus according to an embodiment of the present application.

As shown in fig. 6, the data processing apparatus 600 may include: a first acquisition module 601, a second acquisition module 602, a determination module 603, and a generation module 604.

The first obtaining module 601 is configured to obtain a plurality of first filename sets, where the first filename sets include filenames of files in a same cluster, and the cluster is obtained by clustering each file in at least one storage bucket in the data lake.

A second obtaining module 602, configured to obtain a plurality of first data sets, where the first data sets include data fields in a same column or a same row in a same data table in a data lake.

A determining module 603, configured to determine, for any first set of file names, a target data set similar to any first set of file names from the plurality of first data sets according to each file name in any first set of file names.

The generating module 604 is configured to generate, for any first data field in the target data set, first association blood-edge information according to a table name of a data table in which the target data set is located, any first data field, and the first file name when there is a match between the first file name and any first data field in any first file name set.

As a possible implementation manner of the embodiment of the present application, the determining module 603 is specifically configured to: determining a first similarity between any first file name set and each first data set according to each file name in any first file name set; determining candidate data sets from the first data sets according to the first similarity of the first data sets, wherein the first similarity of the candidate data sets is higher than a set similarity threshold; a target data set is determined from the candidate data sets.

As a possible implementation manner of the embodiment of the present application, the determining module 603 is specifically configured to: determining the hit number of each file name in any first file name set in any candidate data set; determining a target data set from the candidate data sets according to the hit number of the candidate data sets; or the candidate data set with the largest first similarity is taken as the target data set.

As a possible implementation manner of the embodiment of the present application, the determining module 603 is specifically configured to: extracting features of each file name in any first file name set aiming at any first file name set to obtain first text features of any first file name set; extracting features of each data field in each first data set to obtain second text features of each first data set; and determining the first similarity between any one of the first file name sets and each of the first data sets according to the second similarity between the first text features and each of the second text features.

As a possible implementation manner of the embodiment of the present application, the determining module 603 is specifically configured to: sampling each data field in any first data set aiming at any first data set to obtain any first data set after sampling, wherein any first data set after sampling contains a set number of data fields; and extracting the characteristics of each data field contained in any one of the sampled first data sets to obtain the second text characteristics of any one of the first data sets.

As a possible implementation manner of the embodiment of the present application, the second obtaining module 602 is specifically configured to: acquiring a plurality of initial data sets, wherein each initial data set comprises all data fields in the same column or row in the same data table in a data lake; determining a reserved score of any initial data set according to the repetition rate of each data field in any initial data set, wherein the reserved score and the repetition rate are in a negative correlation; each first data set is determined from each initial data set based on the reserve scores of each initial data set.

As a possible implementation manner of the embodiment of the present application, the first obtaining module 601 is specifically configured to: acquiring file information of each file in any storage bucket aiming at any storage bucket in a data lake; clustering each file in any storage barrel according to the file information of each file in any storage barrel to obtain at least one cluster; generating a first file name set according to the file names of the files in the same cluster; wherein the file information includes at least one of: file name, file type, and file metadata.

As a possible implementation manner of the embodiment of the present application, the data processing apparatus 600 may further include:

and the updating module is used for updating the file metadata of the file corresponding to the first file name according to the first association blood-source information.

and the first marking module is used for marking any first data field in the data table where the target data set is located.

And the second marking module is used for marking the file corresponding to the first file name in at least one storage bucket.

And the execution module is used for executing at least one round of iteration process according to the unlabeled data fields in each data table and the unlabeled files in each storage bucket until the set iteration stopping condition is met.

Wherein the stop iteration condition includes at least one of: the unlabeled files do not exist in each storage bucket; the first data sets have no untagged data fields; the unmarked data fields in the first dataset do not match the file names of the unmarked files.

As one possible implementation manner of the embodiment of the present application, any round of iterative process includes:

Acquiring at least one second file name set, wherein the second file name set comprises file names of all files in the same file set, and the file set is obtained by clustering all unlabeled files in all storage barrels obtained by updating in the previous iteration process;

acquiring at least one second data set, wherein the second data set comprises unlabeled data fields in the same column or the same row in the same data table updated in the previous iteration process;

determining a similar data set similar to any second file name set from each second data set according to each file name in any second file name set aiming at any second file name set;

for any second data field in the similar data set, generating second association blood-edge information according to the table name of the data table where the similar data set is located, any second data field and the second file name under the condition that the second file name is matched with any second data field in any second file name set;

marking any second data field in the data table where the similar data set is located to obtain each unmarked data field in each data table obtained by updating the iterative process of the round;

And marking the files corresponding to the second file name in at least one storage barrel to obtain unlabeled files in each storage barrel updated in the iterative process of the round.

The data processing device in the embodiment of the application obtains a plurality of first file name sets and obtains a plurality of first data sets; determining a target data set similar to any first file name set from the plurality of first data sets according to each file name in any first file name set aiming at any first file name set; and generating first association blood-edge information according to the table name of the data table where the target data set is located, any first data field and the first file name under the condition that the first file name is matched with any first data field in any first file name set aiming at any first data field in the target data set. Therefore, a mode of carrying out similarity calculation on file names of unstructured files clustered in a data lake and data fields in the same column or the same row in a structured data table can be realized, and associated blood-edge information is obtained, so that mapping associated information between the structured data table and the unstructured files can be effectively established, and the validity of unstructured file inquiry or access is improved.

In order to implement the above embodiment, the present application further provides an electronic device, and fig. 7 is a schematic structural diagram of the electronic device provided in the embodiment of the present application. The electronic device includes:

memory 701, processor 702, and computer programs stored on memory 701 and executable on processor 702.

The processor 702, when executing the programs, implements the data processing methods provided in any of the embodiments described above.

Further, the electronic device further includes:

a communication interface 703 for communication between the memory 701 and the processor 702.

Memory 701 for storing a computer program executable on processor 702.

The memory 701 may include a high-speed RAM memory or may further include a non-volatile memory (non-volatile memory), such as at least one magnetic disk memory.

A processor 702, configured to implement the data processing method according to any one of the foregoing embodiments when executing the program.

If the memory 701, the processor 702, and the communication interface 703 are implemented independently, the communication interface 703, the memory 701, and the processor 702 may be connected to each other through a bus and perform communication with each other. The bus may be an industry standard architecture (Industry Standard Architecture, abbreviated ISA) bus, an external device interconnect (Peripheral Component, abbreviated PCI) bus, or an extended industry standard architecture (Extended Industry Standard Architecture, abbreviated EISA) bus, among others. The buses may be classified as address buses, data buses, control buses, etc. For ease of illustration, only one thick line is shown in fig. 7, but not only one bus or one type of bus.

Alternatively, in a specific implementation, if the memory 701, the processor 702, and the communication interface 703 are integrated on a chip, the memory 701, the processor 702, and the communication interface 703 may communicate with each other through internal interfaces.

The processor 702 may be a central processing unit (Central Processing Unit, abbreviated as CPU) or an application specific integrated circuit (Application Specific Integrated Circuit, abbreviated as ASIC) or one or more integrated circuits configured to implement embodiments of the present application.

In order to implement the above embodiments, the embodiments of the present application also propose a non-transitory computer-readable storage medium, on which a computer program is stored, which when being executed by a processor implements a data processing method as provided in any of the embodiments above.

In order to implement the above embodiments, the embodiments of the present application further propose a computer program product, which when executed by an instruction processor in the computer program product, implements the data processing method provided in any of the above embodiments.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.

Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present application, the meaning of "plurality" is at least two, such as two, three, etc., unless explicitly defined otherwise.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and additional implementations are included within the scope of the preferred embodiment of the present application in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the embodiments of the present application.

Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). In addition, the computer readable medium may even be paper or other suitable medium on which the program is printed, as the program may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory.

It is to be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. As with the other embodiments, if implemented in hardware, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.

Those of ordinary skill in the art will appreciate that all or a portion of the steps carried out in the method of the above-described embodiments may be implemented by a program to instruct related hardware, where the program may be stored in a computer readable storage medium, and where the program, when executed, includes one or a combination of the steps of the method embodiments.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing module, or each unit may exist alone physically, or two or more units may be integrated in one module. The integrated modules may be implemented in hardware or in software functional modules. The integrated modules may also be stored in a computer readable storage medium if implemented in the form of software functional modules and sold or used as a stand-alone product.

The above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, or the like. Although embodiments of the present application have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the application, and that variations, modifications, alternatives, and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the application.

Claims

1. A method of data processing, the method comprising:

2. The method of claim 1, wherein the determining, for any one of the first sets of filenames, a target set of data similar to the any one of the first sets of filenames from the plurality of first sets of data according to each filename in the any one of the first sets of filenames, comprises:

for any first file name set, determining a first similarity between any first file name set and each first data set according to each file name in any first file name set;

determining a candidate data set from each first data set according to the first similarity of each first data set, wherein the first similarity of the candidate data set is higher than a set similarity threshold;

the target data set is determined from each of the candidate data sets.

3. The method of claim 2, wherein said determining said target data set from each of said candidate data sets comprises:

determining the hit number of each file name in any candidate data set in any first file name set;

determining the target data set from each candidate data set according to the hit number of each candidate data set;

Or,

and taking the candidate data set with the maximum first similarity as the target data set.

4. The method of claim 2, wherein the determining, for any first set of filenames, a first similarity between the any first set of filenames and each of the first data sets according to each of the filenames in the any first set of filenames comprises:

extracting features of each file name in any first file name set aiming at any first file name set to obtain first text features of any first file name set;

extracting features of each data field in each first data set to obtain second text features of each first data set;

and determining the first similarity between any one of the first file name sets and each of the first data sets according to the second similarity between the first text features and each of the second text features.

5. The method of claim 4, wherein the feature extraction of each data field in each of the first data sets to obtain the second text feature of each of the first data sets comprises:

sampling each data field in any first data set aiming at any first data set to obtain the sampled any first data set, wherein the sampled any first data set comprises a set number of data fields;

And extracting the characteristics of each data field contained in any one of the sampled first data sets to obtain the second text characteristics of any one of the first data sets.

6. The method of claim 1, wherein the acquiring a plurality of first data sets comprises:

acquiring a plurality of initial data sets, wherein each initial data set comprises all data fields in the same column or the same row in the same data table in the data lake;

determining a reserved score of any initial data set according to the repetition rate of each data field in the any initial data set, wherein the reserved score and the repetition rate are in a negative correlation;

each of the first data sets is determined from each of the initial data sets based on a reserve score for each of the initial data sets.

7. The method of claim 1, wherein the obtaining a plurality of first filename sets comprises:

acquiring file information of each file in any storage bucket in the data lake aiming at any storage bucket in the data lake;

clustering each file in any storage bucket according to the file information of each file in any storage bucket to obtain at least one cluster;

Generating a first file name set according to the file names of the files in the same cluster;

wherein the file information includes at least one of: file name, file type, and file metadata.

8. The method according to any one of claims 1-7, further comprising:

and updating the file metadata of the file corresponding to the first file name according to the first association blood-source information.

9. The method according to any one of claims 1-7, further comprising:

marking any first data field in a data table where the target data set is located;

marking a file corresponding to the first file name in the at least one storage bucket;

executing at least one round of iteration process according to the unlabeled data fields in each data table and the unlabeled files in each storage bucket until the set iteration stopping condition is met;

wherein the stop iteration condition includes at least one of:

the number of unlabeled files in each storage bucket is smaller than a set first number threshold;

the number of unlabeled data fields in each of the first data sets is less than a set second number threshold;

The filename of each of the untagged files does not match the untagged data fields in the first dataset.

10. The method of claim 9, wherein the iterative process of any one round comprises:

acquiring at least one second file name set, wherein the second file name set comprises file names of all files in the same file set, and the file set is obtained by clustering all unlabeled files in all storage barrels obtained by updating the iteration process in the previous round;

acquiring at least one second data set, wherein the second data set comprises unlabeled data fields in the same column or the same row in the same data table obtained by updating the iteration process in the previous round;

for any second data field in the similar data set, generating second associated blood-edge information according to the table name of the data table where the similar data set is located, any second data field and the second file name when the second file name is matched with any second data field in any second file name set;

Marking any second data field in the data table where the similar data set is located to obtain each unlabeled data field in each data table obtained by updating the iterative process of the round;

and marking the files corresponding to the second file name in the at least one storage barrel to obtain unlabeled files in each storage barrel updated in the iterative process of the round.

11. A data processing apparatus, the apparatus comprising:

12. An electronic device, comprising:

memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the data processing method according to any one of claims 1-10 when executing the program.

13. A non-transitory computer readable storage medium, on which a computer program is stored, characterized in that the program, when executed by a processor, implements the data processing method according to any one of claims 1-10.