Nothing Special   »   [go: up one dir, main page]

CN111401007A - Method for converting unstructured data into structured data - Google Patents

Method for converting unstructured data into structured data Download PDF

Info

Publication number
CN111401007A
CN111401007A CN202010140412.0A CN202010140412A CN111401007A CN 111401007 A CN111401007 A CN 111401007A CN 202010140412 A CN202010140412 A CN 202010140412A CN 111401007 A CN111401007 A CN 111401007A
Authority
CN
China
Prior art keywords
document
data
coordinates
structured
file
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010140412.0A
Other languages
Chinese (zh)
Inventor
罗桂林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen Yi Lu Mdt Infotech Ltd
Original Assignee
Xiamen Yi Lu Mdt Infotech Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen Yi Lu Mdt Infotech Ltd filed Critical Xiamen Yi Lu Mdt Infotech Ltd
Priority to CN202010140412.0A priority Critical patent/CN111401007A/en
Publication of CN111401007A publication Critical patent/CN111401007A/en
Pending legal-status Critical Current

Links

Landscapes

  • Document Processing Apparatus (AREA)

Abstract

The invention discloses a method for converting unstructured data into structured data, which relates to the technical field of information processing and comprises the following steps: step 1, reading an electronic document, and acquiring text content according to a file format; step 2, obtaining the row and column coordinates of the text content marks; step 3, matching labels corresponding to corresponding coordinates of the file templates stored in the historical database one by one according to each marked coordinate; step 4, correcting the matched labels and modifying the labels corresponding to the coordinates; step 5, forming a structured field column and a numerical value corresponding to the field column by the label and the text content corresponding to the coordinates; and 6, completing the conversion from the unordered unstructured text to the structured field. According to the invention, the structured data file is generated by structuring the tag data, the conversion from the unstructured data file to the structured data is realized, the document file compiling efficiency is improved, the information making and spreading benefit maximization is realized, and the follow-up data analysis and processing are facilitated.

Description

Method for converting unstructured data into structured data
Technical Field
The invention relates to the technical field of information processing, in particular to a method for converting unstructured data into structured data.
Background
Customs declaration refers to the process of handling goods, goods or transportation in and out procedures and relevant customs affairs by the goods and goods receiving and dispatching persons, the responsible persons of the transportation in and out, the owners of the goods and goods in and out or the agents of the goods and goods in and out or the transportation in and out procedures, and the customs affairs declaration, the acceptance of documents and certificates, the acceptance of customs supervision and inspection and the like; it belongs to one of the necessary links for fulfilling customs entry and exit procedures. With the development of computer technology and network technology, at present, customs in China has basically adopted a paperless electronic customs declaration mode in customs operations of import and export goods, namely customs clearance takes relevant data to declare a customs declaration action, wherein the declaration action comprises links of receiving a receipt, recording the receipt, examining the receipt, checking goods and the like. However, the simple manual and repeated mechanical labor for making the documents leads to long time for preparing the whole clearance data and human errors; data in the clearance process are repeatedly input for many times, the accuracy is poor, the workload in the sheet examination process is increased, and the clearance timeliness is greatly reduced.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a method for converting unstructured data into structured data, which adopts the following technical scheme: the method comprises the following steps:
step 1, reading an electronic document, and acquiring text content according to a file format;
step 2, obtaining the row and column coordinates of the text content marks;
step 3, matching labels corresponding to corresponding coordinates of the file templates stored in the historical database one by one according to each marked coordinate;
step 4, correcting the matched labels and modifying the labels corresponding to the coordinates;
step 5, forming a structured field column and a numerical value corresponding to the field column by the label and the text content corresponding to the coordinates;
and 6, completing the conversion from the unordered unstructured text to the structured field.
Preferably, the electronic document includes a text document, a data document, a Word document, an Excel document, a PDF document, and a picture document.
Preferably, in step 3, if the label in the history database is not matched, the history database is updated, and the updated document template is added.
Preferably, in step 3, the error correction is performed on the tags matched in the history database, and the tags corresponding to the coordinates are modified.
The method for converting the unstructured data into the structured data has the advantages that: by structuring the tag data, the structured data file is generated, the conversion from the unstructured data file to the structured data is realized, the document file compiling efficiency is improved, the maximization of information making and spreading benefits is realized, and the analysis and the processing of subsequent data are facilitated.
Detailed Description
The following further describes the embodiments of the present invention. It should be noted that the description of the embodiments is provided to help understanding of the present invention, but the present invention is not limited thereto. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
At present, documents required by customs clearance provided by enterprises are mostly unstructured data, the unstructured data are data except for structured data, the data structure is not fixed, the documents cannot be stored by using a relational database, and the documents can only be stored in various file forms, such as documents, text files, pictures, PDFs (Portable document Formats), image formats and the like. At that time, unstructured data can only be processed manually, and unstructured data cannot be processed quickly and efficiently in a systematic mode.
Therefore, it is necessary to convert unstructured data into structured data, which is data that has a certain structure, can be divided into fixed basic components, and can be represented by one or more two-dimensional tables. Structured data is typically stored in a database, with some logical structure. The structured data is very convenient to count, and the method is simple to operate and easy to maintain.
The embodiment provides a method for converting unstructured data into structured data, which adopts the following technical scheme: the method comprises the following steps:
step 1, reading an electronic document, wherein the electronic document comprises a text document, a data document, a Word document, an Excel document, a PDF document and a picture document, and acquiring text contents according to a file format;
step 2, obtaining the row and column coordinates of the text content marks;
step 3, matching labels corresponding to corresponding coordinates of the file templates stored in the historical database one by one according to each marked coordinate; if the label in the historical database is not matched, updating the historical database, and adding an updated document template; correcting the error of the matched labels in the historical database, and modifying the labels of the corresponding coordinates;
step 4, correcting the matched labels and modifying the labels corresponding to the coordinates;
step 5, forming a structured field column and a numerical value corresponding to the field column by the label and the text content corresponding to the coordinates;
and 6, completing the conversion from the unordered unstructured text to the structured field.
In step 1, the document, the data document, the Word document, the Excel document, the PDF document, the picture document, etc. are not universal, and can be analyzed and converted into a text document according to a certain feature of a certain file.
For example, a Word document requires a Word API technique to read the contents of the document, and writes the converted data into the document according to the grammatical requirements in the program, thereby completing the conversion from the Word document to the text document; the picture file is a file stored in a binary form, and text content in the picture needs to be converted through an OCR character recognition technology.
The Excel document needs an Excel API for operation, and the Excel API can read the content of the file. The method uses the Excel API to read the contents and formats of all cells in the Excel document, thereby completing the conversion from the Excel document to the text document.
A text file is a computer file made up of lines of characters. Due to the simple structure, text files are widely used for recording information. TEXT files come in many different formats, commonly used formats are ASCII, MIME, TEXT, etc. Files in these formats can be streamed to input and output data. The system finishes reading the text file through the file stream, and takes out the file structure and the file content in the text file to finish reading the electronic document data.
In the steps 2 and 3, the content in the document is read according to rows, in the analysis process, each character in each row needs to be read sequentially by taking the character as a unit, the reading is carried out line by line, the coordinate value of each character is obtained, and the document is ensured to be completely reserved according to the constraint integrity of table mapping and the structural mapping and the semantic mapping of the document with the historical database are complete and effective in the structural conversion process of the document.
The embodiments of the present invention have been described in detail, but the present invention is not limited to the described embodiments. It will be apparent to those skilled in the art that various changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, and the scope of protection is still within the scope of the invention.

Claims (4)

1. A method for converting unstructured data to structured data, comprising the steps of:
step 1, reading an electronic document, and acquiring text content according to a file format;
step 2, obtaining the row and column coordinates of the text content marks;
step 3, matching labels corresponding to corresponding coordinates of the file templates stored in the historical database one by one according to each marked coordinate;
step 4, correcting the matched labels and modifying the labels corresponding to the coordinates;
step 5, forming a structured field column and a numerical value corresponding to the field column by the label and the text content corresponding to the coordinates;
and 6, completing the conversion from the unordered unstructured text to the structured field.
2. The method according to claim 1, wherein the electronic document comprises a text document, a data document, a Word document, an Excel document, a PDF document, or a picture document.
3. The method according to claim 1, wherein in step 3, if the label in the history database is not matched, the history database is updated, and the updated document template is added.
4. The method according to claim 3, wherein in step 3, the error correction is performed on the tags matched to the historical database, and the tags corresponding to the coordinates are modified.
CN202010140412.0A 2020-03-03 2020-03-03 Method for converting unstructured data into structured data Pending CN111401007A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010140412.0A CN111401007A (en) 2020-03-03 2020-03-03 Method for converting unstructured data into structured data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010140412.0A CN111401007A (en) 2020-03-03 2020-03-03 Method for converting unstructured data into structured data

Publications (1)

Publication Number Publication Date
CN111401007A true CN111401007A (en) 2020-07-10

Family

ID=71430482

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010140412.0A Pending CN111401007A (en) 2020-03-03 2020-03-03 Method for converting unstructured data into structured data

Country Status (1)

Country Link
CN (1) CN111401007A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112364857A (en) * 2020-10-23 2021-02-12 中国平安人寿保险股份有限公司 Image recognition method and device based on numerical extraction and storage medium
CN112861486A (en) * 2021-04-25 2021-05-28 成都淞幸科技有限责任公司 Data integration method, device, equipment and storage medium of semi-structured file
CN113821555A (en) * 2021-08-26 2021-12-21 陈仲永 Unstructured data collection processing method of intelligent supervision black box

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170147540A1 (en) * 2015-11-24 2017-05-25 Bank Of America Corporation Transforming unstructured documents
CN108021632A (en) * 2017-11-23 2018-05-11 中国移动通信集团河南有限公司 Unstructured data and the mutual conversion process method of structural data
CN109840519A (en) * 2019-01-25 2019-06-04 青岛盈智科技有限公司 A kind of adaptive intelligent form recognition input device and its application method
CN110751143A (en) * 2019-09-26 2020-02-04 中电万维信息技术有限责任公司 Electronic invoice information extraction method and electronic equipment

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170147540A1 (en) * 2015-11-24 2017-05-25 Bank Of America Corporation Transforming unstructured documents
CN108021632A (en) * 2017-11-23 2018-05-11 中国移动通信集团河南有限公司 Unstructured data and the mutual conversion process method of structural data
CN109840519A (en) * 2019-01-25 2019-06-04 青岛盈智科技有限公司 A kind of adaptive intelligent form recognition input device and its application method
CN110751143A (en) * 2019-09-26 2020-02-04 中电万维信息技术有限责任公司 Electronic invoice information extraction method and electronic equipment

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112364857A (en) * 2020-10-23 2021-02-12 中国平安人寿保险股份有限公司 Image recognition method and device based on numerical extraction and storage medium
CN112364857B (en) * 2020-10-23 2024-04-26 中国平安人寿保险股份有限公司 Image recognition method, device and storage medium based on numerical extraction
CN112861486A (en) * 2021-04-25 2021-05-28 成都淞幸科技有限责任公司 Data integration method, device, equipment and storage medium of semi-structured file
CN113821555A (en) * 2021-08-26 2021-12-21 陈仲永 Unstructured data collection processing method of intelligent supervision black box

Similar Documents

Publication Publication Date Title
Pletschacher et al. The page (page analysis and ground-truth elements) format framework
CN111401007A (en) Method for converting unstructured data into structured data
US20090089696A1 (en) Graphical creation of a document conversion template
CN111783710B (en) Information extraction method and system for medical photocopy
CN109002425B (en) Method for acquiring upstream and downstream relations of enterprise, terminal device and medium
US20240028832A1 (en) Natural language processing text-image-layout transformer
CN116629227A (en) Method and equipment for converting text into SQL (structured query language) sentence
CN110717333A (en) Method and device for automatically generating article abstract and computer readable storage medium
CN111695330B (en) Method and device for generating table, electronic equipment and computer readable storage medium
CN111651994B (en) Information extraction method and device, electronic equipment and storage medium
CN118194842A (en) Intelligent document identification method and device, electronic equipment and storage medium
Hardisty et al. The specimen data refinery: a canonical workflow framework and FAIR digital object approach to speeding up digital mobilisation of natural history collections
CN113723063B (en) Method for converting RTF (real time transport format) into HTML (hypertext markup language) and realizing effect in PDF (portable document format) file
Thompson et al. Identification of herbarium specimen sheet components from high‐resolution images using deep learning
CN116415562B (en) Method, apparatus and medium for parsing financial data
CN113821555A (en) Unstructured data collection processing method of intelligent supervision black box
CN116127087A (en) Knowledge graph construction method and device, electronic equipment and storage medium
CN114936927A (en) Cross-border remittance document verification method and device
CN107609155B (en) Construction method of data asset model based on XBRL standard
CN111667214A (en) Goods information acquisition method and device based on two-dimensional code and electronic equipment
CN117252201B (en) Knowledge-graph-oriented discrete manufacturing industry process data extraction method and system
CN113836922B (en) Named entity error correction method and device and electronic equipment
Guralnick et al. Ensemble automated approaches for producing high‐quality herbarium digital records
CN117592434A (en) Data document generation method and device, storage medium and electronic equipment
CN117608545B (en) Standard operation program generation method based on knowledge graph

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200710

RJ01 Rejection of invention patent application after publication