CN113220688A - Method and device for splicing data records - Google Patents
Method and device for splicing data records Download PDFInfo
- Publication number
- CN113220688A CN113220688A CN202110564742.7A CN202110564742A CN113220688A CN 113220688 A CN113220688 A CN 113220688A CN 202110564742 A CN202110564742 A CN 202110564742A CN 113220688 A CN113220688 A CN 113220688A
- Authority
- CN
- China
- Prior art keywords
- field
- output
- data
- fields
- data table
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 43
- 238000012545 processing Methods 0.000 claims abstract description 89
- 230000002776 aggregation Effects 0.000 claims description 87
- 238000004220 aggregation Methods 0.000 claims description 87
- 238000000605 extraction Methods 0.000 claims description 11
- 238000004590 computer program Methods 0.000 claims description 7
- 238000012935 Averaging Methods 0.000 claims description 6
- 238000006116 polymerization reaction Methods 0.000 claims 2
- 238000010801 machine learning Methods 0.000 description 10
- 238000010586 diagram Methods 0.000 description 5
- 238000004422 calculation algorithm Methods 0.000 description 4
- 230000002452 interceptive effect Effects 0.000 description 2
- 238000003672 processing method Methods 0.000 description 2
- 230000000007 visual effect Effects 0.000 description 2
- 238000003491 array Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 239000002994 raw material Substances 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2282—Tablespace storage structures; Management thereof
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A method and apparatus for stitching data records is provided. The method comprises the following steps: a data table specifying step of specifying at least two data tables to be subjected to data record splicing according to a data table specifying operation of a user; an association field specifying step, in which corresponding association fields are respectively specified in fields of each data table according to association field specifying operation of a user; an output field configuration step, namely configuring a source field of an output field and a processing mode aiming at the source field according to the output field configuration operation of a user; and an output field generating step, namely processing the field value of the configured source field according to the configured processing mode aiming at the data records to be spliced, of which the corresponding associated fields in each data table have the same field value, so as to generate the field value of the output field. According to the method and the device, the flexibility and the diversity of data record splicing are improved.
Description
The application is a divisional application of patent applications with application date of 2017, 7, month and 4, application number of 201710538681.0, entitled "method and device for splicing data records".
Technical Field
The present invention relates generally to the field of information technology, and more particularly, to a method and apparatus for splicing data records.
Background
With the appearance of mass data in various industries, various processing needs to be performed on data in more and more scenes. For example, machine learning techniques are utilized to mine the value of the data. Machine learning is a necessary product of the development of artificial intelligence research to a certain stage, and aims to improve the performance of the system by means of calculation and by using experience. In a computer system, "experience" is usually in the form of "data" from which a "model" can be generated by a machine learning algorithm, i.e. by providing empirical data to a machine learning algorithm, a model can be generated based on these empirical data, which provides a corresponding judgment, i.e. a prediction, in the face of a new sample. It can be seen that the data, as a raw material for machine learning, affects the final effect of machine learning. Therefore, data needs to be continuously accumulated, updated or expanded, which greatly demands an efficient and flexible data record splicing method.
The commonly used data record splicing modes at the present stage mainly comprise: using SQL (Structured Query Language) statement to write program; alternatively, products such as the Aliskive cloud big data platform "data plus" and Microsoft cloud computing System "Azure" provide visual stitching functionality.
However, the data records are spliced by using the SQL statements, which requires a user to master the SQL syntax and requires a high learning cost. The sum and Azure provides a visual interactive interface, so that the threshold of a user is reduced, but the problem that the splicing scene which can be processed is too single and is not flexible exists.
Disclosure of Invention
An exemplary embodiment of the present invention is to provide a method and an apparatus for splicing data records, so as to solve the above problems in the prior art.
According to an exemplary embodiment of the invention, a method of splicing data records is provided, comprising: a data table specifying step, in which at least two data tables to be subjected to data record splicing are specified according to data table specifying operation of a user, wherein one row of the data table corresponds to one data record, and one column of the data table corresponds to one field; an association field specifying step, in which corresponding association fields are respectively specified in fields of each data table according to association field specifying operation of a user; configuring an output field, namely configuring a source field of the output field and a processing mode aiming at the source field according to the configuration operation of the output field of a user, wherein the output field is a field of an output data record serving as a data record splicing result, and the source field is a field in a data table according to the output field; and an output field generating step, namely processing the field value of the configured source field according to the configured processing mode aiming at the data records to be spliced, of which the corresponding associated fields in each data table have the same field value, so as to generate the field value of the output field.
Optionally, the method further comprises: and an output data record generation step of generating output data records in the output data table based on the field values of the generated output fields.
Optionally, the arrangement order of the output fields in the output data table is set according to the output field configuration operation of the user; or the arrangement sequence of each output field in the output data table is set according to the arrangement sequence of the at least two data tables and the arrangement sequence of the source field of each output field in each data table.
Optionally, the at least two data tables comprise a main table and at least one splicing table, wherein the output field configuring step is performed only for the at least one splicing table, and wherein in the output data record generating step, the output data records in the output data table are generated by appending field values of the generated respective output fields to the data records to be spliced in the main table.
Optionally, the source field further defaults to include at least one corresponding associated field, where the source field is set for a position of an output field of the corresponding associated field in the output data table according to an output field configuration operation of a user or a preset position.
Optionally, in the output field configuration step, the name of the output field is also configured according to the output field configuration operation of the user.
Optionally, the processing mode includes a direct extraction mode and/or an aggregation processing mode, where in the direct extraction mode, a field value of a source field of a single data record to be spliced in the data table is directly used as a field value of an output field; and under the aggregation processing mode, performing aggregation operation on the field value of the source field of at least one of the data records to be spliced in the data table to be used as the field value of the output field.
Optionally, the aggregation processing mode includes a direct aggregation processing mode, where in the direct aggregation processing mode, an aggregation operation is performed on field values of source fields of the multiple data records to be spliced in the data table to serve as field values of the output fields.
Optionally, the at least two data tables include a main table and at least one splicing table, and the aggregation processing manner includes a time-series aggregation processing manner, wherein, when configuring the time sequence aggregation processing mode, the basic cursor field, the splicing cursor field, the aggregation range and the aggregation operation are configured according to the output field configuration operation of the user, and, under the time sequence aggregation processing mode, performing aggregation operation on the field values of the source fields of the data records to be spliced which accord with the time sequence range in the data records to be spliced in the splicing table to be used as the field values of the output fields, the data records to be spliced which accord with the time sequence range refer to the data records to be spliced, wherein the field value of the splicing vernier field is in the range determined by the forward and/or backward aggregation range based on the field value of the basic vernier field of the data records to be spliced in the main table.
Optionally, the aggregation operation comprises at least one of: summing, averaging, taking the maximum value, taking the minimum value and calculating the number.
According to another exemplary embodiment of the invention, a computer-readable storage medium is provided, in which a computer program is stored, wherein the computer program is configured to cause a processor of a computer to perform the following steps: a data table specifying step, in which at least two data tables to be subjected to data record splicing are specified according to data table specifying operation of a user, wherein one row of the data table corresponds to one data record, and one column of the data table corresponds to one field; an association field specifying step, in which corresponding association fields are respectively specified in fields of each data table according to association field specifying operation of a user; configuring an output field, namely configuring a source field of the output field and a processing mode aiming at the source field according to the configuration operation of the output field of a user, wherein the output field is a field of an output data record serving as a data record splicing result, and the source field is a field in a data table according to the output field; and an output field generating step, namely processing the field value of the configured source field according to the configured processing mode aiming at the data records to be spliced, of which the corresponding associated fields in each data table have the same field value, so as to generate the field value of the output field.
According to another exemplary embodiment of the present invention, there is provided an apparatus for splicing data records, including: the data table specifying unit is configured to specify at least two data tables to be subjected to data record splicing according to data table specifying operation of a user, wherein one row of the data tables corresponds to one data record, and one column of the data tables corresponds to one field; the associated field specifying unit is configured to respectively specify corresponding associated fields in fields of each data table according to associated field specifying operation of a user; the output field configuration unit is configured to configure a source field of the output field and a processing mode aiming at the source field according to the output field configuration operation of a user, wherein the output field is a field of an output data record serving as a data record splicing result, and the source field is a field in a data table according to which the output field is based; and the output field generating unit is configured to process the field value of the configured source field according to the configured processing mode aiming at the data records to be spliced, of which the corresponding associated fields in each data table have the same field value, so as to generate the field value of the output field.
Optionally, the apparatus further comprises: an output data record generating unit configured to generate output data records in an output data table based on the generated field values of the respective output fields.
Optionally, the arrangement order of the output fields in the output data table is set according to the output field configuration operation of the user; or the arrangement sequence of each output field in the output data table is set according to the arrangement sequence of the at least two data tables and the arrangement sequence of the source field of each output field in each data table.
Optionally, the at least two data tables include a main table and at least one splicing table, wherein the output field configuration unit performs an output field configuration operation only for the at least one splicing table, and the output data record generation unit generates the output data records in the output data table by appending the generated field values of the respective output fields to the data records to be spliced in the main table.
Optionally, the source field further defaults to include at least one corresponding associated field, where the source field is set for a position of an output field of the corresponding associated field in the output data table according to an output field configuration operation of a user or a preset position.
Optionally, the output field configuration unit further configures the name of the output field according to an output field configuration operation of the user.
Optionally, the processing mode includes a direct extraction mode and/or an aggregation processing mode, where the output field generating unit directly uses a field value of a source field of a single data record to be spliced in the data table as a field value of the output field in the direct extraction mode; the output field generating unit performs aggregation operation on the field value of the source field of at least one of the plurality of data records to be spliced in the data table in an aggregation processing mode to serve as the field value of the output field.
Optionally, the aggregation processing mode includes a direct aggregation processing mode, where the output field generating unit performs aggregation operation on field values of source fields of the multiple data records to be spliced in the data table in the direct aggregation processing mode to serve as the field values of the output fields.
Optionally, the at least two data tables include a main table and at least one splicing table, and the aggregation processing manner includes a time-series aggregation processing manner, wherein, when configuring the time sequence aggregation processing mode, the output field configuration unit configures the basic cursor field, the splicing cursor field, the aggregation range and the aggregation operation according to the output field configuration operation of the user, and the output field generation unit configures the basic cursor field, the splicing cursor field, the aggregation range and the aggregation operation in the time sequence aggregation processing mode, performing aggregation operation on the field values of the source fields of the data records to be spliced which accord with the time sequence range in the data records to be spliced in the splicing table to be used as the field values of the output fields, the data records to be spliced which accord with the time sequence range refer to the data records to be spliced, wherein the field value of the splicing vernier field is in the range determined by the forward and/or backward aggregation range based on the field value of the basic vernier field of the data records to be spliced in the main table.
Optionally, the aggregation operation comprises at least one of: summing, averaging, taking the maximum value, taking the minimum value and calculating the number.
According to the method and the device for splicing the data records, the data record splicing process with higher efficiency, more diversified use scenes and more flexibility is provided, and a user can finish the data record splicing process by only specifying a data table, setting the splicing association condition and configuring and outputting according to needs. Furthermore, the data records in different data tables can be subjected to indirect operation splicing processing according to the requirements of users, and particularly, the splicing processing related to time sequence can be carried out.
Additional aspects and/or advantages of the present general inventive concept will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the general inventive concept.
Drawings
The above and other objects and features of exemplary embodiments of the present invention will become more apparent from the following description taken in conjunction with the accompanying drawings which illustrate exemplary embodiments, wherein:
FIG. 1 shows a flow diagram of a method of splicing data records according to an exemplary embodiment of the present invention;
FIG. 2 shows a flow diagram of a method of splicing data records according to another exemplary embodiment of the present invention;
FIG. 3 illustrates an example of a user specifying data tables and corresponding associated fields through a graphical user interface according to an illustrative embodiment of the present invention;
FIG. 4 illustrates an example of a user configuring an output field through a graphical user interface according to an exemplary embodiment of the present invention;
FIG. 5 illustrates another example of a user specifying data tables and corresponding associated fields through a graphical user interface according to an exemplary embodiment of the present invention;
FIG. 6 illustrates another example of a user configuring an output field through a graphical user interface according to an exemplary embodiment of the present invention;
FIG. 7 illustrates a block diagram of an apparatus for stitching data records according to an exemplary embodiment of the present invention;
fig. 8 shows a block diagram of an apparatus for splicing data records according to another exemplary embodiment of the present invention.
Detailed Description
Reference will now be made in detail to the embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the like elements throughout. The embodiments are described below in order to explain the present invention by referring to the figures.
Fig. 1 shows a flowchart of a method of splicing data records according to an exemplary embodiment of the present invention. The method may be performed by a computer program or by a dedicated device for stitching data records.
In step S10, at least two data tables to be subjected to data record splicing are specified according to a user' S data table specifying operation. Here, one row of the data table corresponds to one data record, and one column of the data table corresponds to one field. In other words, each data record in the data table has a respective field and a corresponding field value. By way of example, a field in a data table may be used to describe information about an aspect (e.g., name, age, occupation, etc.), at least one data record in the data table may be used to describe information about at least one aspect of an object, e.g., multiple data records in the data table may be used to describe the same object.
By way of example, the master table and the at least one split table to be spliced to data records may be specified in accordance with a user's data table specifying operation.
In the prior art, if a user needs to splice a plurality of data tables, the splicing can be realized only by splicing every two data tables for many times. The method for splicing data records according to the exemplary embodiment of the invention can assign a plurality of data tables at one time to splice data records, thereby improving the efficiency of splicing data records.
In step S20, corresponding associated fields are respectively specified among the fields of the respective data tables in accordance with the associated field specifying operation by the user. Here, the corresponding association field is used to correspond the data records in each data table to determine the corresponding data records to be spliced in each data table, so that an output data record can be spliced based on the corresponding data records to be spliced in each data table. Specifically, the corresponding data to be spliced is recorded as: and the corresponding associated fields in each data table have data records with the same field value.
It should be understood that the information described by the corresponding association fields respectively specified in the different data tables should be substantially the same, so that the data records in the different data tables can be associated based on the corresponding association fields in the different data tables. However, the names of the corresponding associated fields respectively specified in the different data tables may be the same or different. For example, the data table a may specify the corresponding field as ID, and the data table b may specify the corresponding field as UserID, and the two are different in name but substantially the same in description information, and both are ID numbers describing the user.
By way of example, one corresponding association field may be specified in each data table, or multiple corresponding association fields may be specified in each data table. And if a plurality of corresponding associated fields are specified in each data table, taking the data records of which each corresponding associated field has the same field value in the plurality of corresponding associated fields in each data table as the corresponding data records to be spliced. For example, if the corresponding association fields a and B are specified in the data table a, and the corresponding association fields a 'and B' are specified in the data table B, the data records to be spliced in the data table a and the data table B need to satisfy: the field value of the corresponding associated field a is the same value as the field value of the corresponding associated field a ', and the field value of the corresponding associated field B is the same value as the field value of the corresponding associated field B'.
In step S30, the source field of the output field and the processing manner for the source field are configured according to the output field configuration operation of the user, where the output field is a field of the output data record as the data record splicing result, and the source field is a field in the data table according to which the output field is based.
Specifically, the source field in each data table and the processing method thereof are specified according to the output field configuration operation of the user, and each field (i.e., the output field) of the output data record is a field obtained by processing the source field according to the corresponding processing method.
As an example, the name of the source field may be directly used as the name of the output field. As another example, the name of the output field may be configured according to a user's output field configuration operation, thereby enhancing ease of use.
In step S40, for the data records to be spliced in each data table, which have the same field value in the corresponding associated field, the configured field value of the source field is processed according to the configured processing manner to generate the field value of the output field.
As an example, for each group of data records to be spliced in each data table, which have the same field value in the corresponding associated field (that is, the corresponding data records to be spliced in each data table together form a group of data records to be spliced), the field value of the configured source field may be processed in the configured processing manner to generate the field value of the output field forming each output data record.
As an example, the processing manner may include a direct extraction manner and/or an aggregation processing manner. Specifically, in a Direct extraction (Direct) manner, the field value of the source field of a single data record to be spliced in the data table may be directly used as the field value of the output field.
In the aggregation processing mode, the field value of the source field of at least one of the data records to be spliced in the data table may be subjected to aggregation operation to serve as the field value of the output field. Here, the plurality of data records to be spliced are data records to be spliced in the data table, and corresponding associated fields have the same field value.
The prior art can only splice one row of data records in one data table with one row of data records in another data table, so that the problem that the splicing scene is too single and not flexible enough exists. According to the exemplary embodiment of the invention, splicing of multiple rows of data records in one data table with one row of data records or multiple rows of data records in other data tables can be realized, so that various splicing scenes can be supported, and diversified requirements of users can be met.
Further, as an example, the aggregation processing manner may include a direct aggregation processing manner and/or a time-series aggregation processing manner.
As an example, in the direct aggregation processing manner, an aggregation operation may be performed on field values of source fields of a plurality of data records to be spliced in the data table to serve as field values of output fields.
Regarding the sequential aggregation processing manner, as an example, the at least two data tables include a main table and at least one splicing table, when configuring the time sequence aggregation processing mode, the basic cursor field, the splicing cursor field, the aggregation range and the aggregation operation can be configured according to the output field configuration operation of the user, and, in the time sequence aggregation processing mode, the field value of the source field of at least one data record to be spliced which accords with the time sequence range in a plurality of data records to be spliced in the splicing table can be aggregated to be used as the field value of the output field, the data records to be spliced which accord with the time sequence range refer to the data records to be spliced, wherein the field value of the splicing vernier field is in the range determined by the forward and/or backward aggregation range based on the field value of the basic vernier field of the data records to be spliced in the main table.
Here, the base cursor field is a time field (e.g., "Date" field) in the main table, the concatenation cursor field is a time field (e.g., "Date" field) in the concatenation table corresponding to the base cursor field, and the aggregation range may be a certain time range specified based on the field value of the base cursor field, for example, the aggregation range may be a certain time range specified forward or backward from the field value of the base cursor field; alternatively, a certain time range is specified forward and backward with the field value of the base cursor field as the midpoint.
As an example, the aggregation operation may include at least one of: summation (SUM), Averaging (AVG), Maximum (MAX), Minimum (MIN), Count (Count).
Fig. 2 shows a flow chart of a method of splicing data records according to another exemplary embodiment of the present invention. As shown in fig. 2, the method of splicing data records according to another exemplary embodiment of the present invention may further include step S50 in addition to step S10, step S20, step S30, and step S40 shown in fig. 1. Step S10, step S20, step S30 and step S40 can be implemented with reference to the specific embodiment described with reference to fig. 1, and will not be described herein again.
In step S50, output data records in the output data table are generated based on the field values of the respective generated output fields. It should be understood that the field values of the corresponding associated fields corresponding to the field values of the respective output fields in a row of output data records are the same.
As an example, the arrangement order of the respective output fields in the output data table may be set according to the output field configuration operation of the user; alternatively, the arrangement order of the output fields in the output data table may be set according to the arrangement order of the at least two data tables and the arrangement order of the source fields of the output fields in the data tables. For example, the arrangement order of the at least two data tables may be a sequential order of the at least two data tables specified by a data table specifying operation of a user.
As an example, the at least two data tables include a main table and at least one splicing table, the step S40 may be performed only for the at least one splicing table, and, in the step S50, the output data records in the output data table may be generated by appending field values of the generated respective output fields to the data records to be spliced in the main table. In other words, the field values of all fields of the data records to be spliced in the main table may be directly taken as the field values of the output fields in the output data table, and the field values of the respective output fields generated for the at least one splicing table may be attached (e.g., attached on the right side).
As an example, the default source field further includes at least one corresponding associated field, wherein the position of the source field in the output data table for the output field of the corresponding associated field may be set according to the output field configuration operation or the preset position of the user. For example, the source field may be preset such that the output field corresponding to the associated field is located at the leftmost side of the output data table.
As an example, the at least two data tables include a master table and at least one split table, the default source field further including a corresponding association field in the master table and not including a corresponding association field in the split table.
As another example, the default source field further includes corresponding associated fields with different names in the at least two data tables, i.e., corresponding associated fields with different names are selected from corresponding associated fields in each data table as the default source field.
As an example, the output data records in the output data table may be applied as a set of training samples to a machine learning algorithm or other algorithm for data mining. Therefore, the method for splicing the data records according to the exemplary embodiment of the invention can facilitate the user to splice the data records in different data tables according to the requirements before the machine learning, so as to obtain the data records with more complex and comprehensive information for the machine learning.
Further, as an example, the method of splicing data records according to an exemplary embodiment of the present invention shown in conjunction with fig. 1 and 2 may further include: an interface for stitching data records is displayed to a user so that the user performs data table designation operations, associated field designation operations, and output field configuration operations through the interface. As an example, the interface for stitching data records may be a graphical user interface, which may include: a text editing interface for manual editing by a user and/or a selection input type interface for displaying candidates for manual selection by a user. As an example, the text editing interface and the selection input type interface may be switched in response to an interface switching operation input by the user, and the setting result under the interface before switching may be synchronously displayed under the switched interface. The method of splicing data records according to an exemplary embodiment of the present invention reduces the user's threshold by converting the programming language into an interactive interface that is easy for the user to understand and operate.
An example of a user performing a data table specifying operation, an associated field specifying operation, and an output field configuring operation through a graphical user interface according to an embodiment of the present invention is described below with reference to fig. 3 to 6. It should be noted that the graphical user interface is presented here as an example only, and any other form of input interface may be employed with the present invention.
An exemplary embodiment of the present invention is described below with reference to fig. 3-4 and tables 1-3, where fig. 3 shows an example of a graphical user interface for specifying data tables and associated fields, and a user can input the number of data tables to be subjected to data record splicing through the graphical user interface, and specifically specify data table 1 and data table 2 to be subjected to data record splicing. Then, the user can designate the "ID" field in data table 1 and the "ID" field in data table 2 as the corresponding associated fields, respectively, through the graphical user interface. After the above setting is completed, the user may enter the graphical user interface for configuring the output field shown in fig. 4 for subsequent setting.
Table 1: data table 1
ID | | Age | Job | |
1 | Zhang | 30 | blue- |
|
2 | Wang | 27 | technician | |
3 | Li | 40 | management | |
4 | Zhao | 24 | services |
Table 2: data table 2
As shown in fig. 4, the left "candidate field name" area of the gui may display all candidate fields of the data table (i.e., all fields in data tables 1 and 2) from which data records can be spliced for the user to select a source field, the middle "processing mode" area of the gui may display various processing modes that can be provided for the source field, and the right "output field configuration" area of the gui may display various configurations for the output field. For example, fields selected by the user in turn from the "candidate field name" area may be displayed as source fields in the configuration area, or all candidate fields may be displayed in the configuration area, and then fields that are not source fields may be deleted therefrom by the user. The user may specify a corresponding processing manner for each source field displayed in the configuration area (for example, specify the processing manner of the source field "ID", "Name", "Age", "Job" in the data table 1 as a direct extraction manner, and the processing manner of the source field "inclusion" in the data table 2 as an aggregation processing manner "sum"), and may also specify the Name of the output field corresponding to the source field. In addition, the user can also adjust the arrangement order of the rows in the configuration area to set the arrangement order of the corresponding output fields in the output data table according to the arrangement order of the rows.
After completing the corresponding configuration according to the above operation of the user, the output field generating step and the output data record generating step can be executed, for example, for corresponding data records to be spliced (i.e., the first data record in data table 1, the first and second data records in data table 2) having the same field value "1" in both data table 1 and data table 2, processing the field value of the configured source field according to the configured processing mode, specifically, directly taking the field value of the source field "ID", "Name", "Age" or "Job" as the field value of the output field, summing the field values of the source field "Income" (i.e., summing the field value of "3000" and the field value of "4000") results in the field value of "7000" of the output field to result in the first output data record in output Table 1. It can be seen that according to an exemplary embodiment of the present invention, a concatenation of one data record in data table 1 with a plurality of data records in data table 2 is achieved.
Table 3: output table 1
ID | Name | | Job | Income | |
1 | Zhang | 30 | blue-collar | 7000 | |
2 | Wang | 27 | technician | 11000 | |
3 | Li | 40 | management | 6000 |
In the following, another exemplary embodiment of the present invention is described with reference to fig. 5-6 and table 4-6, as shown in fig. 5, a user may input the number of the splicing tables to be spliced with data records through a graphical user interface, and specifically specify the main table and the splicing table to be spliced with data records. Then, the user can respectively designate the "ID" field in the main table and the "ID" field in the spliced table as the corresponding associated fields through the graphical user interface. After the above setting is completed, the user may enter the graphical user interface for configuring the output field shown in fig. 6 for subsequent setting.
Table 4: main watch
ID | Name | | Job | Date | |
1 | Zhang | 30 | blue-collar | 2016.04.25 | |
2 | Wang | 27 | technician | 2016.03.15 | |
3 | Li | 40 | management | 2016.05.17 | |
4 | Zhao | 24 | services | 2016.05.09 |
Table 5: splicing watch
| Income | Date | |
1 | 3000 | 2016.02.20 | |
1 | 4000 | 2016.03.15 | |
1 | 5000 | 2016.05.17 | |
1 | 6000 | 2016.05.20 | |
2 | 4000 | 2016.03.15 | |
3 | 5000 | 2016.05.17 |
As shown in fig. 6, the processing mode of the source field "inclusion" in the spliced table is configured as a time-series aggregation processing mode by the user, the base cursor field is configured as a "Date" field in the main table by the user, the spliced cursor field is configured as a "Date" field in the spliced table by the user, the aggregation range is configured as 30 days (+30D) backward based on the field value of the base cursor field by the user, and the aggregation operation mode is configured as "AVE" by the user.
After completing the corresponding configuration according to the above operation of the user, an output field generation step and an output data record generation step may be performed, for example, corresponding data records to be spliced, in which corresponding associated fields "IDs" in the main table and the splicing table all have the same field value "1", are a first data record in the data table 1 and first to fourth data records in the data table 2, further, a data record to be spliced conforming to a timing range is determined from the first to fourth data records in the splicing table, and accordingly, the data record to be spliced conforming to the timing range is: the field value of the concatenation cursor field is in a range of 30 days later (i.e., 2016.04.25-2016.05.25) based on the field value "2016.04.25" of the base cursor field of the first data record in the main table (i.e., the third and fourth pieces of data in the concatenation table), then, the field value of the configured source field "Income" is processed for the third and fourth pieces of data records according to a configured aggregation operation mode (AVE), i.e., the field values "5000" and "6000" of the source field "Income" of the third and fourth pieces of data records are averaged to obtain the field value "5500" of the corresponding output field, and the generated field value of the output field is appended to the first data record in the main table to generate the first output data record in the output data table 2.
Table 6: output table 2
ID | Name | Age | | Income | Date | |
1 | Zhang | 30 | blue-collar | 5500 | 2016.04.25 | |
2 | Wang | 27 | technician | 4000 | 2016.03.15 | |
3 | Li | 40 | management | 5000 | 2016.05.17 |
A computer-readable storage medium according to an exemplary embodiment of the invention stores a computer program, wherein the computer program is configurable to cause a processor of a computer to perform the method of splicing data records of any of the above-described exemplary embodiments.
Fig. 7 and 8 illustrate block diagrams of an apparatus for splicing data records according to an exemplary embodiment of the present invention.
As shown in fig. 7, an apparatus for splicing data records according to an exemplary embodiment of the present invention includes: a data table specifying unit 10, an associated field specifying unit 20, an output field configuring unit 30, and an output field generating unit 40.
Specifically, the data table specifying unit 10 is configured to specify at least two data tables to be subjected to data record splicing according to a data table specifying operation of a user, wherein one row of the data table corresponds to one data record, and one column of the data table corresponds to one field.
The associated field specifying unit 20 is configured to specify corresponding associated fields among fields of the respective data tables, respectively, according to an associated field specifying operation of a user.
The output field configuration unit 30 is configured to configure a source field of an output field and a processing manner for the source field according to an output field configuration operation of a user, where the output field is a field of an output data record as a result of data record splicing, and the source field is a field in a data table according to which the output field is based.
As an example, the output field configuration unit 30 may also configure the name of the output field according to the output field configuration operation of the user.
The output field generating unit 40 is configured to, for data records to be spliced in each data table, which have the same field value in the corresponding associated field, process the field value of the configured source field in the configured processing manner to generate the field value of the output field.
As an example, the processing manner may include a direct extraction manner and/or an aggregation processing manner, where the output field generating unit 40 may directly use a field value of a source field of a single data record to be spliced in the data table as a field value of an output field in the direct extraction manner; the output field generating unit 40 may perform an aggregation operation on a field value of a source field of at least one of the plurality of data records to be spliced in the data table in an aggregation processing manner to obtain a field value of an output field.
As an example, the aggregation processing manner may include a direct aggregation processing manner, where the output field generating unit 40 may perform an aggregation operation on field values of source fields of a plurality of data records to be spliced in the data table in the direct aggregation processing manner to serve as field values of the output fields.
As an example, the at least two data tables include a main table and at least one splicing table, and the aggregation processing manner may include a time-series aggregation processing manner, wherein, the output field configuration unit 30 can configure the basic cursor field, the splicing cursor field, the aggregation range and the aggregation operation according to the output field configuration operation of the user when configuring the time sequence aggregation processing mode, and the output field generation unit 40 can generate the basic cursor field, the splicing cursor field, the aggregation range and the aggregation operation in the time sequence aggregation processing mode, performing aggregation operation on the field values of the source fields of the data records to be spliced which accord with the time sequence range in the data records to be spliced in the splicing table to be used as the field values of the output fields, the data records to be spliced which accord with the time sequence range refer to the data records to be spliced, wherein the field value of the splicing vernier field is in the range determined by the forward and/or backward aggregation range based on the field value of the basic vernier field of the data records to be spliced in the main table.
As an example, the aggregation operation may include at least one of: summing, averaging, taking the maximum value, taking the minimum value and calculating the number.
As shown in fig. 8, the apparatus for splicing data records according to another exemplary embodiment of the present invention may further include an output data record generating unit 50 in addition to the data table specifying unit 10, the associated field specifying unit 20, the output field configuring unit 30, and the output field generating unit 40 shown in fig. 7. The data table specifying unit 10, the associated field specifying unit 20, the output field configuring unit 30, and the output field generating unit 40 may be implemented by referring to the specific implementation described in fig. 7, and are not described herein again.
The output data record generating unit 50 is configured to generate output data records in the output data table based on the generated field values of the respective output fields.
As an example, the arrangement order of the respective output fields in the output data table may be set according to the output field configuration operation of the user; alternatively, the arrangement order of the output fields in the output data table may be set according to the arrangement order of the at least two data tables and the arrangement order of the source fields of the output fields in the data tables.
As an example, the at least two data tables include a main table and at least one splicing table, wherein the output field configuration unit 30 may perform an output field configuration operation only for the at least one splicing table, and the output data record generation unit 50 generates the output data records in the output data table by appending the generated field values of the respective output fields to the data records to be spliced in the main table.
As an example, the source field may further include at least one corresponding associated field by default, wherein the position of the source field in the output data table for the output field of the corresponding associated field may be set according to the output field configuration operation of the user or a preset position.
It should be understood that specific implementations of the apparatus for splicing data records according to the exemplary embodiment of the present invention may be implemented with reference to the related specific implementations described in conjunction with fig. 1 to 6, and will not be described herein again.
Furthermore, it should be understood that the respective units in the apparatus for splicing data records according to an exemplary embodiment of the present invention may be implemented as hardware components and/or software components. The individual units may be implemented, for example, using Field Programmable Gate Arrays (FPGAs) or Application Specific Integrated Circuits (ASICs), depending on the processing performed by the individual units as defined by the skilled person.
According to the method and the device for splicing the data records, the data record splicing process with higher efficiency, more diversified use scenes and more flexibility is provided, and a user can finish the data record splicing process by only specifying a data table, setting the splicing association condition and configuring and outputting according to needs. Furthermore, the data records in different data tables can be subjected to indirect operation splicing processing according to the requirements of users, and particularly, the splicing processing related to time sequence can be carried out. It should be noted that the exemplary embodiments of the present invention, although applicable to a machine learning platform, are not limited thereto, that is, the exemplary embodiments of the present invention can be applied to any system or technical scheme that requires splicing of data records.
Further, the method of splicing data records according to an exemplary embodiment of the present invention may be implemented as computer code in a computer-readable recording medium. The computer code can be implemented by those skilled in the art from the description of the method above. The computer code when executed in a computer implements the above-described method of the present invention.
Although a few exemplary embodiments of the present invention have been shown and described, it would be appreciated by those skilled in the art that changes may be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the claims and their equivalents.
Claims (10)
1. A method of stitching data records, comprising:
a data table specifying step, in which at least two data tables to be subjected to data record splicing are specified according to data table specifying operation of a user, wherein one row of the data table corresponds to one data record, and one column of the data table corresponds to one field;
an association field specifying step, in which corresponding association fields are respectively specified in fields of each data table according to association field specifying operation of a user;
configuring an output field, namely configuring a source field of the output field and a processing mode aiming at the source field according to the configuration operation of the output field of a user, wherein the output field is a field of an output data record serving as a data record splicing result, and the source field is a field in a data table according to the output field; and
and an output field generating step, namely processing the field value of the configured source field according to the configured processing mode aiming at the data records to be spliced, of which the corresponding associated fields in each data table have the same field value, so as to generate the field value of the output field.
2. The method of claim 1, further comprising:
and an output data record generation step of generating output data records in the output data table based on the field values of the generated output fields.
3. The method of claim 2, wherein,
the arrangement sequence of each output field in the output data table is set according to the output field configuration operation of a user; or,
the arrangement sequence of the output fields in the output data table is set according to the arrangement sequence of the at least two data tables and the arrangement sequence of the source fields of the output fields in the data tables.
4. The method of claim 2, wherein the at least two data tables include a master table and at least one splice table,
wherein the output field configuration step is performed only for the at least one stitching table, and in the output data record generation step, the output data records in the output data table are generated by appending field values of the generated respective output fields to the data records to be stitched in the master table.
5. The method according to claim 1, wherein the processing mode includes a direct extraction mode and/or an aggregation processing mode, wherein in the direct extraction mode, a field value of a source field of a single data record to be spliced in the data table is directly used as a field value of an output field; and under the aggregation processing mode, performing aggregation operation on the field value of the source field of at least one of the data records to be spliced in the data table to be used as the field value of the output field.
6. The method of claim 5, wherein the polymerization treatment comprises a direct polymerization treatment,
in the direct aggregation processing mode, the field values of the source fields of a plurality of data records to be spliced in the data table are aggregated to be used as the field values of the output fields.
7. The method of claim 5, wherein the at least two data tables include a master table and at least one splice table, and the aggregation processing mode includes a time-series aggregation processing mode,
when a time sequence aggregation processing mode is configured, a basic vernier field, a splicing vernier field, an aggregation range and aggregation operation are configured according to output field configuration operation of a user, and in the time sequence aggregation processing mode, aggregation operation is performed on field values of source fields of data records to be spliced, which accord with the time sequence range, in a plurality of data records to be spliced in a splicing table to serve as field values of output fields, wherein the data records to be spliced, which accord with the time sequence range, refer to the data records to be spliced, of which the field values of the splicing vernier field are in a range determined by forward and/or backward aggregation ranges on the basis of the field values of the basic vernier field of the data records to be spliced in a main table.
8. The method according to any of claims 5-7, wherein the aggregation operation comprises at least one of: summing, averaging, taking the maximum value, taking the minimum value and calculating the number.
9. A computer-readable storage medium storing a computer program, wherein the computer program is configured to cause a processor of a computer to perform the steps of:
a data table specifying step, in which at least two data tables to be subjected to data record splicing are specified according to data table specifying operation of a user, wherein one row of the data table corresponds to one data record, and one column of the data table corresponds to one field;
an association field specifying step, in which corresponding association fields are respectively specified in fields of each data table according to association field specifying operation of a user;
configuring an output field, namely configuring a source field of the output field and a processing mode aiming at the source field according to the configuration operation of the output field of a user, wherein the output field is a field of an output data record serving as a data record splicing result, and the source field is a field in a data table according to the output field; and
and an output field generating step, namely processing the field value of the configured source field according to the configured processing mode aiming at the data records to be spliced, of which the corresponding associated fields in each data table have the same field value, so as to generate the field value of the output field.
10. An apparatus for stitching data records, comprising:
the data table specifying unit is configured to specify at least two data tables to be subjected to data record splicing according to data table specifying operation of a user, wherein one row of the data tables corresponds to one data record, and one column of the data tables corresponds to one field;
the associated field specifying unit is configured to respectively specify corresponding associated fields in fields of each data table according to associated field specifying operation of a user;
the output field configuration unit is configured to configure a source field of the output field and a processing mode aiming at the source field according to the output field configuration operation of a user, wherein the output field is a field of an output data record serving as a data record splicing result, and the source field is a field in a data table according to which the output field is based; and
and the output field generating unit is configured to process the field value of the configured source field according to the configured processing mode aiming at the data records to be spliced, of which the corresponding associated fields in each data table have the same field value, so as to generate the field value of the output field.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110564742.7A CN113220688A (en) | 2017-07-04 | 2017-07-04 | Method and device for splicing data records |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110564742.7A CN113220688A (en) | 2017-07-04 | 2017-07-04 | Method and device for splicing data records |
CN201710538681.0A CN107402978A (en) | 2017-07-04 | 2017-07-04 | Splice the method and device of data record |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710538681.0A Division CN107402978A (en) | 2017-07-04 | 2017-07-04 | Splice the method and device of data record |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113220688A true CN113220688A (en) | 2021-08-06 |
Family
ID=60404862
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110564742.7A Pending CN113220688A (en) | 2017-07-04 | 2017-07-04 | Method and device for splicing data records |
CN201710538681.0A Pending CN107402978A (en) | 2017-07-04 | 2017-07-04 | Splice the method and device of data record |
Family Applications After (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710538681.0A Pending CN107402978A (en) | 2017-07-04 | 2017-07-04 | Splice the method and device of data record |
Country Status (1)
Country | Link |
---|---|
CN (2) | CN113220688A (en) |
Families Citing this family (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108228861B (en) * | 2018-01-12 | 2020-09-01 | 第四范式(北京)技术有限公司 | Method and system for performing feature engineering for machine learning |
CN109697066B (en) * | 2018-12-28 | 2021-02-05 | 第四范式(北京)技术有限公司 | Method and system for realizing data sheet splicing and automatically training machine learning model |
CN109739855B (en) * | 2018-12-28 | 2022-03-01 | 第四范式(北京)技术有限公司 | Method and system for realizing data sheet splicing and automatically training machine learning model |
CN110334098A (en) * | 2019-06-27 | 2019-10-15 | 烽火通信科技股份有限公司 | A kind of database combining method and system based on script |
CN110502519B (en) * | 2019-08-26 | 2022-04-29 | 北京启迪区块链科技发展有限公司 | Data aggregation method, device, equipment and storage medium |
CN112115138A (en) * | 2020-08-19 | 2020-12-22 | 第四范式(北京)技术有限公司 | Method, device and equipment for determining association relation between data tables |
CN112131258B (en) * | 2020-09-23 | 2023-03-24 | 创新奇智(重庆)科技有限公司 | Data splicing method, device and equipment and computer storage medium |
CN112817984B (en) * | 2021-02-22 | 2023-10-20 | 杭州数梦工场科技有限公司 | Data processing method and device, and data source acquisition method and device |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1136918A1 (en) * | 1998-08-11 | 2001-09-26 | Shinji Furusho | Method and apparatus for retrieving, accumulating, and sorting table-formatted data |
CN105677353A (en) * | 2016-01-08 | 2016-06-15 | 北京物思创想科技有限公司 | Feature extraction method and machine learning method and device thereof |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104424263B (en) * | 2013-08-29 | 2019-03-01 | 腾讯科技(深圳)有限公司 | A kind of processing method and processing device of data record |
WO2015049797A1 (en) * | 2013-10-04 | 2015-04-09 | 株式会社日立製作所 | Data management method, data management device and storage medium |
-
2017
- 2017-07-04 CN CN202110564742.7A patent/CN113220688A/en active Pending
- 2017-07-04 CN CN201710538681.0A patent/CN107402978A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1136918A1 (en) * | 1998-08-11 | 2001-09-26 | Shinji Furusho | Method and apparatus for retrieving, accumulating, and sorting table-formatted data |
CN105677353A (en) * | 2016-01-08 | 2016-06-15 | 北京物思创想科技有限公司 | Feature extraction method and machine learning method and device thereof |
Non-Patent Citations (1)
Title |
---|
鹰夜八百: "sql游标例子根据一表的数据去筛选另一表的数据", pages 95 - 96, Retrieved from the Internet <URL:《博客园http://www.cnblogs.com/shiratsuki/p/4352733.html》> * |
Also Published As
Publication number | Publication date |
---|---|
CN107402978A (en) | 2017-11-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113220688A (en) | Method and device for splicing data records | |
CN108762740B (en) | Page data generation method and device and electronic equipment | |
CN108958857A (en) | A kind of interface creating method and device | |
CN104978195A (en) | Interface configuration device and method | |
US20180373764A1 (en) | Information processing system, descriptor creation method, and descriptor creation program | |
CN112102441A (en) | Color card manufacturing method and device, electronic equipment and storage medium | |
CN111787240B (en) | Video generation method, apparatus and computer readable storage medium | |
US10496264B2 (en) | Object adjustment tool and object adjustment program | |
CN113168698B (en) | Small batch learning device, working program and working method thereof | |
CN113296760A (en) | Method for generating model code, computer device and readable storage medium | |
CN106776644A (en) | A kind of reporting system collocation method and device | |
CN112634408A (en) | Material selection method, system, device and storage medium | |
CN104424525B (en) | Auxiliary is identified project the method and apparatus of scope | |
CN117909734A (en) | Label generating apparatus, label generating method, electronic device, and computer-readable storage medium | |
US9031349B1 (en) | Median filter for image processing | |
CN112579144A (en) | Data processing method and device | |
CN111984624B (en) | Method and system for data migration through correction migration model | |
CN113900748A (en) | Image processing method and device and computer readable storage medium | |
US20200302661A1 (en) | Information processing apparatus, computer-readable recording medium, and drawing creation support method | |
CN110877332B (en) | Robot dance file generation method and device, terminal device and storage medium | |
CN109242682B (en) | Currency exchange rate conversion method, currency exchange rate conversion device, computer equipment and storage medium | |
CN106528094A (en) | A similarity-based application icon classifying method and system | |
CN114297213A (en) | Settlement method, device, computer equipment and storage medium for realizing flexible configuration of settlement factors | |
WO2020149242A1 (en) | Work assistance device, work assistance method, program, and object detection model | |
CN114615519A (en) | Video processing method, device, equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |