CN111401007A

CN111401007A - Method for converting unstructured data into structured data

Info

Publication number: CN111401007A
Application number: CN202010140412.0A
Authority: CN
Inventors: 罗桂林
Original assignee: Xiamen Yi Lu Mdt Infotech Ltd
Current assignee: Xiamen Yi Lu Mdt Infotech Ltd
Priority date: 2020-03-03
Filing date: 2020-03-03
Publication date: 2020-07-10

Abstract

The invention discloses a method for converting unstructured data into structured data, which relates to the technical field of information processing and comprises the following steps: step 1, reading an electronic document, and acquiring text content according to a file format; step 2, obtaining the row and column coordinates of the text content marks; step 3, matching labels corresponding to corresponding coordinates of the file templates stored in the historical database one by one according to each marked coordinate; step 4, correcting the matched labels and modifying the labels corresponding to the coordinates; step 5, forming a structured field column and a numerical value corresponding to the field column by the label and the text content corresponding to the coordinates; and 6, completing the conversion from the unordered unstructured text to the structured field. According to the invention, the structured data file is generated by structuring the tag data, the conversion from the unstructured data file to the structured data is realized, the document file compiling efficiency is improved, the information making and spreading benefit maximization is realized, and the follow-up data analysis and processing are facilitated.

Description

Method for converting unstructured data into structured data

Technical Field

The invention relates to the technical field of information processing, in particular to a method for converting unstructured data into structured data.

Background

Customs declaration refers to the process of handling goods, goods or transportation in and out procedures and relevant customs affairs by the goods and goods receiving and dispatching persons, the responsible persons of the transportation in and out, the owners of the goods and goods in and out or the agents of the goods and goods in and out or the transportation in and out procedures, and the customs affairs declaration, the acceptance of documents and certificates, the acceptance of customs supervision and inspection and the like; it belongs to one of the necessary links for fulfilling customs entry and exit procedures. With the development of computer technology and network technology, at present, customs in China has basically adopted a paperless electronic customs declaration mode in customs operations of import and export goods, namely customs clearance takes relevant data to declare a customs declaration action, wherein the declaration action comprises links of receiving a receipt, recording the receipt, examining the receipt, checking goods and the like. However, the simple manual and repeated mechanical labor for making the documents leads to long time for preparing the whole clearance data and human errors; data in the clearance process are repeatedly input for many times, the accuracy is poor, the workload in the sheet examination process is increased, and the clearance timeliness is greatly reduced.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a method for converting unstructured data into structured data, which adopts the following technical scheme: the method comprises the following steps:

step 1, reading an electronic document, and acquiring text content according to a file format;

step 2, obtaining the row and column coordinates of the text content marks;

step 3, matching labels corresponding to corresponding coordinates of the file templates stored in the historical database one by one according to each marked coordinate;

step 4, correcting the matched labels and modifying the labels corresponding to the coordinates;

step 5, forming a structured field column and a numerical value corresponding to the field column by the label and the text content corresponding to the coordinates;

and 6, completing the conversion from the unordered unstructured text to the structured field.

Preferably, the electronic document includes a text document, a data document, a Word document, an Excel document, a PDF document, and a picture document.

Preferably, in step 3, if the label in the history database is not matched, the history database is updated, and the updated document template is added.

Preferably, in step 3, the error correction is performed on the tags matched in the history database, and the tags corresponding to the coordinates are modified.

The method for converting the unstructured data into the structured data has the advantages that: by structuring the tag data, the structured data file is generated, the conversion from the unstructured data file to the structured data is realized, the document file compiling efficiency is improved, the maximization of information making and spreading benefits is realized, and the analysis and the processing of subsequent data are facilitated.

Detailed Description

The following further describes the embodiments of the present invention. It should be noted that the description of the embodiments is provided to help understanding of the present invention, but the present invention is not limited thereto. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

At present, documents required by customs clearance provided by enterprises are mostly unstructured data, the unstructured data are data except for structured data, the data structure is not fixed, the documents cannot be stored by using a relational database, and the documents can only be stored in various file forms, such as documents, text files, pictures, PDFs (Portable document Formats), image formats and the like. At that time, unstructured data can only be processed manually, and unstructured data cannot be processed quickly and efficiently in a systematic mode.

Therefore, it is necessary to convert unstructured data into structured data, which is data that has a certain structure, can be divided into fixed basic components, and can be represented by one or more two-dimensional tables. Structured data is typically stored in a database, with some logical structure. The structured data is very convenient to count, and the method is simple to operate and easy to maintain.

The embodiment provides a method for converting unstructured data into structured data, which adopts the following technical scheme: the method comprises the following steps:

step 1, reading an electronic document, wherein the electronic document comprises a text document, a data document, a Word document, an Excel document, a PDF document and a picture document, and acquiring text contents according to a file format;

step 2, obtaining the row and column coordinates of the text content marks;

step 3, matching labels corresponding to corresponding coordinates of the file templates stored in the historical database one by one according to each marked coordinate; if the label in the historical database is not matched, updating the historical database, and adding an updated document template; correcting the error of the matched labels in the historical database, and modifying the labels of the corresponding coordinates;

In step 1, the document, the data document, the Word document, the Excel document, the PDF document, the picture document, etc. are not universal, and can be analyzed and converted into a text document according to a certain feature of a certain file.

For example, a Word document requires a Word API technique to read the contents of the document, and writes the converted data into the document according to the grammatical requirements in the program, thereby completing the conversion from the Word document to the text document; the picture file is a file stored in a binary form, and text content in the picture needs to be converted through an OCR character recognition technology.

The Excel document needs an Excel API for operation, and the Excel API can read the content of the file. The method uses the Excel API to read the contents and formats of all cells in the Excel document, thereby completing the conversion from the Excel document to the text document.

A text file is a computer file made up of lines of characters. Due to the simple structure, text files are widely used for recording information. TEXT files come in many different formats, commonly used formats are ASCII, MIME, TEXT, etc. Files in these formats can be streamed to input and output data. The system finishes reading the text file through the file stream, and takes out the file structure and the file content in the text file to finish reading the electronic document data.

In the steps 2 and 3, the content in the document is read according to rows, in the analysis process, each character in each row needs to be read sequentially by taking the character as a unit, the reading is carried out line by line, the coordinate value of each character is obtained, and the document is ensured to be completely reserved according to the constraint integrity of table mapping and the structural mapping and the semantic mapping of the document with the historical database are complete and effective in the structural conversion process of the document.

The embodiments of the present invention have been described in detail, but the present invention is not limited to the described embodiments. It will be apparent to those skilled in the art that various changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, and the scope of protection is still within the scope of the invention.

Claims

1. A method for converting unstructured data to structured data, comprising the steps of:

step 2, obtaining the row and column coordinates of the text content marks;

2. The method according to claim 1, wherein the electronic document comprises a text document, a data document, a Word document, an Excel document, a PDF document, or a picture document.

3. The method according to claim 1, wherein in step 3, if the label in the history database is not matched, the history database is updated, and the updated document template is added.

4. The method according to claim 3, wherein in step 3, the error correction is performed on the tags matched to the historical database, and the tags corresponding to the coordinates are modified.