CN111401007A - Method for converting unstructured data into structured data - Google Patents
Method for converting unstructured data into structured data Download PDFInfo
- Publication number
- CN111401007A CN111401007A CN202010140412.0A CN202010140412A CN111401007A CN 111401007 A CN111401007 A CN 111401007A CN 202010140412 A CN202010140412 A CN 202010140412A CN 111401007 A CN111401007 A CN 111401007A
- Authority
- CN
- China
- Prior art keywords
- document
- data
- coordinates
- structured
- file
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 24
- 238000006243 chemical reaction Methods 0.000 claims abstract description 9
- 238000012937 correction Methods 0.000 claims description 2
- 230000008901 benefit Effects 0.000 abstract description 3
- 230000010365 information processing Effects 0.000 abstract description 2
- 238000012545 processing Methods 0.000 abstract description 2
- 230000007480 spreading Effects 0.000 abstract description 2
- 238000007405 data analysis Methods 0.000 abstract 1
- 230000008569 process Effects 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 3
- 238000013507 mapping Methods 0.000 description 3
- 230000009471 action Effects 0.000 description 2
- 230000004075 alteration Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000008676 import Effects 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000009897 systematic effect Effects 0.000 description 1
Landscapes
- Document Processing Apparatus (AREA)
Abstract
The invention discloses a method for converting unstructured data into structured data, which relates to the technical field of information processing and comprises the following steps: step 1, reading an electronic document, and acquiring text content according to a file format; step 2, obtaining the row and column coordinates of the text content marks; step 3, matching labels corresponding to corresponding coordinates of the file templates stored in the historical database one by one according to each marked coordinate; step 4, correcting the matched labels and modifying the labels corresponding to the coordinates; step 5, forming a structured field column and a numerical value corresponding to the field column by the label and the text content corresponding to the coordinates; and 6, completing the conversion from the unordered unstructured text to the structured field. According to the invention, the structured data file is generated by structuring the tag data, the conversion from the unstructured data file to the structured data is realized, the document file compiling efficiency is improved, the information making and spreading benefit maximization is realized, and the follow-up data analysis and processing are facilitated.
Description
Technical Field
The invention relates to the technical field of information processing, in particular to a method for converting unstructured data into structured data.
Background
Customs declaration refers to the process of handling goods, goods or transportation in and out procedures and relevant customs affairs by the goods and goods receiving and dispatching persons, the responsible persons of the transportation in and out, the owners of the goods and goods in and out or the agents of the goods and goods in and out or the transportation in and out procedures, and the customs affairs declaration, the acceptance of documents and certificates, the acceptance of customs supervision and inspection and the like; it belongs to one of the necessary links for fulfilling customs entry and exit procedures. With the development of computer technology and network technology, at present, customs in China has basically adopted a paperless electronic customs declaration mode in customs operations of import and export goods, namely customs clearance takes relevant data to declare a customs declaration action, wherein the declaration action comprises links of receiving a receipt, recording the receipt, examining the receipt, checking goods and the like. However, the simple manual and repeated mechanical labor for making the documents leads to long time for preparing the whole clearance data and human errors; data in the clearance process are repeatedly input for many times, the accuracy is poor, the workload in the sheet examination process is increased, and the clearance timeliness is greatly reduced.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a method for converting unstructured data into structured data, which adopts the following technical scheme: the method comprises the following steps:
step 1, reading an electronic document, and acquiring text content according to a file format;
step 2, obtaining the row and column coordinates of the text content marks;
step 3, matching labels corresponding to corresponding coordinates of the file templates stored in the historical database one by one according to each marked coordinate;
step 4, correcting the matched labels and modifying the labels corresponding to the coordinates;
step 5, forming a structured field column and a numerical value corresponding to the field column by the label and the text content corresponding to the coordinates;
and 6, completing the conversion from the unordered unstructured text to the structured field.
Preferably, the electronic document includes a text document, a data document, a Word document, an Excel document, a PDF document, and a picture document.
Preferably, in step 3, if the label in the history database is not matched, the history database is updated, and the updated document template is added.
Preferably, in step 3, the error correction is performed on the tags matched in the history database, and the tags corresponding to the coordinates are modified.
The method for converting the unstructured data into the structured data has the advantages that: by structuring the tag data, the structured data file is generated, the conversion from the unstructured data file to the structured data is realized, the document file compiling efficiency is improved, the maximization of information making and spreading benefits is realized, and the analysis and the processing of subsequent data are facilitated.
Detailed Description
The following further describes the embodiments of the present invention. It should be noted that the description of the embodiments is provided to help understanding of the present invention, but the present invention is not limited thereto. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
At present, documents required by customs clearance provided by enterprises are mostly unstructured data, the unstructured data are data except for structured data, the data structure is not fixed, the documents cannot be stored by using a relational database, and the documents can only be stored in various file forms, such as documents, text files, pictures, PDFs (Portable document Formats), image formats and the like. At that time, unstructured data can only be processed manually, and unstructured data cannot be processed quickly and efficiently in a systematic mode.
Therefore, it is necessary to convert unstructured data into structured data, which is data that has a certain structure, can be divided into fixed basic components, and can be represented by one or more two-dimensional tables. Structured data is typically stored in a database, with some logical structure. The structured data is very convenient to count, and the method is simple to operate and easy to maintain.
The embodiment provides a method for converting unstructured data into structured data, which adopts the following technical scheme: the method comprises the following steps:
step 1, reading an electronic document, wherein the electronic document comprises a text document, a data document, a Word document, an Excel document, a PDF document and a picture document, and acquiring text contents according to a file format;
step 2, obtaining the row and column coordinates of the text content marks;
step 3, matching labels corresponding to corresponding coordinates of the file templates stored in the historical database one by one according to each marked coordinate; if the label in the historical database is not matched, updating the historical database, and adding an updated document template; correcting the error of the matched labels in the historical database, and modifying the labels of the corresponding coordinates;
step 4, correcting the matched labels and modifying the labels corresponding to the coordinates;
step 5, forming a structured field column and a numerical value corresponding to the field column by the label and the text content corresponding to the coordinates;
and 6, completing the conversion from the unordered unstructured text to the structured field.
In step 1, the document, the data document, the Word document, the Excel document, the PDF document, the picture document, etc. are not universal, and can be analyzed and converted into a text document according to a certain feature of a certain file.
For example, a Word document requires a Word API technique to read the contents of the document, and writes the converted data into the document according to the grammatical requirements in the program, thereby completing the conversion from the Word document to the text document; the picture file is a file stored in a binary form, and text content in the picture needs to be converted through an OCR character recognition technology.
The Excel document needs an Excel API for operation, and the Excel API can read the content of the file. The method uses the Excel API to read the contents and formats of all cells in the Excel document, thereby completing the conversion from the Excel document to the text document.
A text file is a computer file made up of lines of characters. Due to the simple structure, text files are widely used for recording information. TEXT files come in many different formats, commonly used formats are ASCII, MIME, TEXT, etc. Files in these formats can be streamed to input and output data. The system finishes reading the text file through the file stream, and takes out the file structure and the file content in the text file to finish reading the electronic document data.
In the steps 2 and 3, the content in the document is read according to rows, in the analysis process, each character in each row needs to be read sequentially by taking the character as a unit, the reading is carried out line by line, the coordinate value of each character is obtained, and the document is ensured to be completely reserved according to the constraint integrity of table mapping and the structural mapping and the semantic mapping of the document with the historical database are complete and effective in the structural conversion process of the document.
The embodiments of the present invention have been described in detail, but the present invention is not limited to the described embodiments. It will be apparent to those skilled in the art that various changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, and the scope of protection is still within the scope of the invention.
Claims (4)
1. A method for converting unstructured data to structured data, comprising the steps of:
step 1, reading an electronic document, and acquiring text content according to a file format;
step 2, obtaining the row and column coordinates of the text content marks;
step 3, matching labels corresponding to corresponding coordinates of the file templates stored in the historical database one by one according to each marked coordinate;
step 4, correcting the matched labels and modifying the labels corresponding to the coordinates;
step 5, forming a structured field column and a numerical value corresponding to the field column by the label and the text content corresponding to the coordinates;
and 6, completing the conversion from the unordered unstructured text to the structured field.
2. The method according to claim 1, wherein the electronic document comprises a text document, a data document, a Word document, an Excel document, a PDF document, or a picture document.
3. The method according to claim 1, wherein in step 3, if the label in the history database is not matched, the history database is updated, and the updated document template is added.
4. The method according to claim 3, wherein in step 3, the error correction is performed on the tags matched to the historical database, and the tags corresponding to the coordinates are modified.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010140412.0A CN111401007A (en) | 2020-03-03 | 2020-03-03 | Method for converting unstructured data into structured data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010140412.0A CN111401007A (en) | 2020-03-03 | 2020-03-03 | Method for converting unstructured data into structured data |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111401007A true CN111401007A (en) | 2020-07-10 |
Family
ID=71430482
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010140412.0A Pending CN111401007A (en) | 2020-03-03 | 2020-03-03 | Method for converting unstructured data into structured data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111401007A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112364857A (en) * | 2020-10-23 | 2021-02-12 | 中国平安人寿保险股份有限公司 | Image recognition method and device based on numerical extraction and storage medium |
CN112861486A (en) * | 2021-04-25 | 2021-05-28 | 成都淞幸科技有限责任公司 | Data integration method, device, equipment and storage medium of semi-structured file |
CN113821555A (en) * | 2021-08-26 | 2021-12-21 | 陈仲永 | Unstructured data collection processing method of intelligent supervision black box |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170147540A1 (en) * | 2015-11-24 | 2017-05-25 | Bank Of America Corporation | Transforming unstructured documents |
CN108021632A (en) * | 2017-11-23 | 2018-05-11 | 中国移动通信集团河南有限公司 | Unstructured data and the mutual conversion process method of structural data |
CN109840519A (en) * | 2019-01-25 | 2019-06-04 | 青岛盈智科技有限公司 | A kind of adaptive intelligent form recognition input device and its application method |
CN110751143A (en) * | 2019-09-26 | 2020-02-04 | 中电万维信息技术有限责任公司 | Electronic invoice information extraction method and electronic equipment |
-
2020
- 2020-03-03 CN CN202010140412.0A patent/CN111401007A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170147540A1 (en) * | 2015-11-24 | 2017-05-25 | Bank Of America Corporation | Transforming unstructured documents |
CN108021632A (en) * | 2017-11-23 | 2018-05-11 | 中国移动通信集团河南有限公司 | Unstructured data and the mutual conversion process method of structural data |
CN109840519A (en) * | 2019-01-25 | 2019-06-04 | 青岛盈智科技有限公司 | A kind of adaptive intelligent form recognition input device and its application method |
CN110751143A (en) * | 2019-09-26 | 2020-02-04 | 中电万维信息技术有限责任公司 | Electronic invoice information extraction method and electronic equipment |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112364857A (en) * | 2020-10-23 | 2021-02-12 | 中国平安人寿保险股份有限公司 | Image recognition method and device based on numerical extraction and storage medium |
CN112364857B (en) * | 2020-10-23 | 2024-04-26 | 中国平安人寿保险股份有限公司 | Image recognition method, device and storage medium based on numerical extraction |
CN112861486A (en) * | 2021-04-25 | 2021-05-28 | 成都淞幸科技有限责任公司 | Data integration method, device, equipment and storage medium of semi-structured file |
CN113821555A (en) * | 2021-08-26 | 2021-12-21 | 陈仲永 | Unstructured data collection processing method of intelligent supervision black box |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Pletschacher et al. | The page (page analysis and ground-truth elements) format framework | |
CN111401007A (en) | Method for converting unstructured data into structured data | |
US20090089696A1 (en) | Graphical creation of a document conversion template | |
CN111783710B (en) | Information extraction method and system for medical photocopy | |
CN109002425B (en) | Method for acquiring upstream and downstream relations of enterprise, terminal device and medium | |
US20240028832A1 (en) | Natural language processing text-image-layout transformer | |
CN116629227A (en) | Method and equipment for converting text into SQL (structured query language) sentence | |
CN110717333A (en) | Method and device for automatically generating article abstract and computer readable storage medium | |
CN111695330B (en) | Method and device for generating table, electronic equipment and computer readable storage medium | |
CN111651994B (en) | Information extraction method and device, electronic equipment and storage medium | |
CN118194842A (en) | Intelligent document identification method and device, electronic equipment and storage medium | |
Hardisty et al. | The specimen data refinery: a canonical workflow framework and FAIR digital object approach to speeding up digital mobilisation of natural history collections | |
CN113723063B (en) | Method for converting RTF (real time transport format) into HTML (hypertext markup language) and realizing effect in PDF (portable document format) file | |
Thompson et al. | Identification of herbarium specimen sheet components from high‐resolution images using deep learning | |
CN116415562B (en) | Method, apparatus and medium for parsing financial data | |
CN113821555A (en) | Unstructured data collection processing method of intelligent supervision black box | |
CN116127087A (en) | Knowledge graph construction method and device, electronic equipment and storage medium | |
CN114936927A (en) | Cross-border remittance document verification method and device | |
CN107609155B (en) | Construction method of data asset model based on XBRL standard | |
CN111667214A (en) | Goods information acquisition method and device based on two-dimensional code and electronic equipment | |
CN117252201B (en) | Knowledge-graph-oriented discrete manufacturing industry process data extraction method and system | |
CN113836922B (en) | Named entity error correction method and device and electronic equipment | |
Guralnick et al. | Ensemble automated approaches for producing high‐quality herbarium digital records | |
CN117592434A (en) | Data document generation method and device, storage medium and electronic equipment | |
CN117608545B (en) | Standard operation program generation method based on knowledge graph |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200710 |
|
RJ01 | Rejection of invention patent application after publication |