CN114611466A

CN114611466A - Method and system for extracting effective information of PDF document page elements

Info

Publication number: CN114611466A
Application number: CN202210259864.XA
Authority: CN
Inventors: 萧展辉; 宋云奎; 余芸; 王尧; 沈宇红; 甘杉; 甘莹
Original assignee: Southern Power Grid Digital Grid Research Institute Co Ltd
Current assignee: Southern Power Grid Digital Grid Research Institute Co Ltd
Priority date: 2022-03-16
Filing date: 2022-03-16
Publication date: 2022-06-10

Abstract

The invention discloses a method and a system for extracting effective information of PDF document page elements, which comprises the following steps: constructing an initial PDF document information extraction model and storing the model in a first storage area; acquiring a document analysis rule set; generating a PDF document information extraction rule model according to the initial PDF document information extraction model and the document analysis rule set, and storing the PDF document information extraction rule model in a second storage area; according to the initial PDF document information extraction model and the PDF document information extraction rule model, a PDF document information extraction model for extracting effective information of the PDF document is constructed; updating a PDF document information extraction model according to the initial PDF document information extraction model and a document analysis rule set by setting a first interval time; according to the text information at the top and the bottom of the page, the text is respectively obtained from the front page and the back page to fill up the text information missing from the page, and the text information is summarized by taking the page as a unit, so that the information is more refined.

Description

Method and system for extracting effective information of PDF document page elements

Technical Field

The invention relates to the technical field of computer information processing, in particular to a method and a system for extracting effective information of a PDF document page element.

Background

PDF is the most common document format for everyday communication, and its display and print effects are not affected by the operating system and operating equipment. PDF documents contain a large amount of text and image information. The extraction method of the PDF page effective information is to extract text information contained in a PDF document, filter interference information and useless information by combining context through series information processing, and extract and store the effective information according to a specified mode. The other method is that the method analyzes all texts in a PDF document by a text analysis technology to obtain all text contents, because coding and display characters in the PDF document are not completely corresponding, the position of text display is not completely corresponding to the actual position in the document, the method can not extract the characters by the computer internal code of the extracted characters, and even if the character extraction is successful, the situation that the text contents are disordered can exist.

Disclosure of Invention

In order to solve the above problems, the present invention aims to provide a method for extracting characters by combining object coordinates and shape information of a PDF page, which combines features of front and back pages and pages, overcomes the problems of recognition error, disordered text content and interference of invalid information in the prior art, and improves the validity of PDF text information. Especially, when the information of each page of the document needs to be processed independently, the effective information of the page can be determined according to the context splicing of the front page and the back page.

In order to achieve the technical purpose, the application provides an extraction method for effective information of a page element of a PDF document, which comprises the following steps:

an initial PDF document information extraction model is built and stored in a first storage area, and the initial PDF document information extraction model is used for generating a PDF document information extraction rule model with timeliness;

acquiring a document analysis rule set, wherein the document analysis rule set is used for expressing a rule set for analyzing a PDF document to acquire effective information of the PDF document;

generating a PDF document information extraction rule model according to the initial PDF document information extraction model and the document analysis rule set, and storing the PDF document information extraction rule model in a second storage area;

according to the initial PDF document information extraction model and the PDF document information extraction rule model, a PDF document information extraction model for extracting effective information of the PDF document is constructed;

and updating the PDF document information extraction model according to the initial PDF document information extraction model and the document analysis rule set by setting a first interval time.

Preferably, in the process of obtaining the effective information of the PDF document by using the PDF document information extraction model, the PDF document is opened, and objects of the PDF document are parsed and arranged in an order from top to bottom and from left to right, where the objects include text boxes, pictures, rectangular boxes, and curves.

Preferably, after the process of acquiring the object of the PDF document, the mark position of the key mark in the object is searched according to the object of the PDF document, the page is divided into a plurality of different information areas according to the mark position, and the information of the different areas is extracted and stored in the variable.

Preferably, in the searching for the position of the key mark in the object, the key mark is used to represent a line exceeding the length of the dash;

the mark position is used to indicate a position for dividing the information area.

Preferably, in the process of dividing the page into a plurality of different information areas, the different information areas include a header, a chapter number, a page number, a body area, a footer, a page number, and a comment.

Preferably, in the process of obtaining the effective information of the PDF document, whether a paragraph starts or not is determined according to different information areas, and if not, the last punctuation is taken from the text of the previous page and spliced to the head of the text to the end.

Preferably, in the process of obtaining the effective information of the PDF document, obtaining density information of a text in a text area, if the density information is less than a specified percentage, using OCR to identify, otherwise, extracting according to a position of a page area where the text is located, wherein the process of extracting the text includes: page number extraction validation, title extraction, text extraction and annotation extraction.

Preferably, in the process of judging whether the paragraph starts, the coordinate range of the valid information is determined according to the mark position, and all objects outside the range are excluded;

sequentially traversing objects in the coordinate range, extracting text information, and judging whether the paragraph is a beginning or not for the text information at the top or the bottom according to the preference of incomplete information, wherein the judgment basis is as follows: the top first character is not top-lined and/or the bottom end is not punctuation.

The invention also discloses a system for extracting the effective information of the PDF document page elements, which comprises the following steps:

the data acquisition module is used for acquiring PDF documents;

the data analysis module is used for constructing an initial PDF document information extraction model and storing the initial PDF document information extraction model in a first storage area, wherein the initial PDF document information extraction model is used for generating a PDF document information extraction rule model with timeliness; acquiring a document analysis rule set, wherein the document analysis rule set is used for expressing a rule set for analyzing a PDF document to acquire effective information of the PDF document; generating a PDF document information extraction rule model according to the initial PDF document information extraction model and the document analysis rule set, and storing the PDF document information extraction rule model in a second storage area; according to the initial PDF document information extraction model and the PDF document information extraction rule model, a PDF document information extraction model for extracting effective information of the PDF document is constructed; updating a PDF document information extraction model according to the initial PDF document information extraction model and a document analysis rule set by setting a first interval time;

the data correction module records the updating time point and the used document analysis rule of each effective information in the generated PDF document effective information, and generates a PDF document information extraction index model according to the effective information, the updating time point and the document analysis rule; merging the PDF document information extraction index model into a PDF document information extraction model; and according to the second interval time, according to the document analysis rule and the updating time point, judging the accuracy of the corresponding effective information.

Preferably, the extraction system further comprises:

the first storage unit is used for storing an initial PDF document information extraction model;

the second storage unit is used for storing a PDF document information extraction rule model;

and the third storage unit is used for storing the PDF document information extraction index model.

The invention discloses the following technical effects:

the method takes specific elements (objects) in the PDF page as marks, limits the effective range of information extraction, sequentially extracts texts after arranging the text objects according to the coordinate sequence in the effective range, prevents the situation of disordered text contents, respectively obtains the texts from the front page and the back page according to the text information at the top and the bottom of the page to complement the text information missing from the page, summarizes the text information by taking the page as a unit, and has more refined information.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

FIG. 1 is a flow chart of information extraction according to the present invention;

FIG. 2 is a schematic diagram illustrating an information completion preference process according to the present invention;

FIG. 3 is a schematic flow chart of the method of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, as presented in the figures, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application.

As shown in FIGS. 1-3, the present invention provides a method for extracting valid information of a PDF document page element, comprising the following steps:

Further preferably, in the process of obtaining the effective information of the PDF document by using the PDF document information extraction model, the PDF document is opened, and objects of the PDF document are parsed and arranged in an order from top to bottom and from left to right, where the objects include text boxes, pictures, rectangular boxes, and curves.

Further preferably, after the process of acquiring the object of the PDF document, according to the object of the PDF document, the mark position of the key mark in the object is searched, and the page is divided into a plurality of different information areas according to the mark position, and the information of the different areas is extracted and stored in the variable.

Further preferably, in the process of searching the position of the key mark in the object, the key mark is used for representing a line exceeding the length of the dash;

Further preferably, in the process of dividing the page into a plurality of different information areas, the different information areas include a header, a chapter number, a page number, a body area, a footer, a page number, and a comment.

Further preferably, in the process of obtaining the effective information of the PDF document, whether the paragraph starts or not is judged according to different information areas, and if not, the last punctuation is taken from the text of the previous page and spliced to the head of the text from the end.

Further preferably, in the process of obtaining the effective information of the PDF document, density information of a text in the text area is obtained, if the density information is less than a specified percentage, OCR recognition is used, otherwise, extraction is performed according to a position of a page area where the text is located, where the process of extracting the text includes: page number extraction validation, title extraction, text extraction and annotation extraction.

Further preferably, in the process of judging whether the paragraph starts, the coordinate range of the effective information is determined according to the mark position, and all objects outside the range are excluded;

the data acquisition module is used for acquiring PDF documents;

Further preferably, the extraction system further comprises:

and the third storage unit is used for storing the PDF document information extraction index model type.

Example 1: the most necessary original technical scheme for PDF document extraction is as follows: as shown in fig. 1; based on the above technical solution, an improvement is made, that is, besides the necessary input document, some necessary input information is also included, such as a storage mode (local document or database system), and information completion preference (completion or non-completion according to page context), as shown in fig. 2.

Furthermore, only the folder path where the input document is located is specified, the program automatically traverses all eligible PDF documents, and the effective information of all the documents is automatically processed, extracted and stored according to the specification.

The invention provides a method for extracting characters of a PDF document, which comprises the following specific processes: opening a specified PDF document, and if the specified PDF document does not exist, directly reporting an error and exiting;

all objects in the page are analyzed and arranged from top to bottom and from left to right according to coordinates, and generally comprise text boxes, pictures, rectangular boxes, curves and the like.

Searching the position of a key mark (the mark generally has a line with the length exceeding the length of a dash (the mark is valid within a certain range from the top and the bottom of the page), and the position of the first or the last character with a special size) from the obtained object, dividing the page into a plurality of different information areas (the areas are generally the title/chapter number/page number of a header, a text area, a footer/page number/annotation, and the like) according to the mark position, extracting the information of the different areas and storing the information into variables;

the information items are stored in units of pages. For efficiency, the information may also be stored uniformly after the entire document parsing is completed, i.e., D1 shown in fig. 1 is placed after the close document operation.

If necessary, the effective information of the page can be supplemented to some extent through the information of the front and back pages, and if a certain page has no page number, the page number of the page can be calculated and finally determined through the page number information of the front and back pages and the electronic page number information of the electronic document.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus once an item is defined in one figure, it need not be further defined and explained in subsequent figures, and moreover, the terms "first", "second", "third", etc. are used merely to distinguish one description from another and are not to be construed as indicating or implying relative importance.

Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present invention, which are used for illustrating the technical solutions of the present invention and not for limiting the same, and the protection scope of the present invention is not limited thereto, although the present invention is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the present invention in its spirit and scope. Are intended to be covered by the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for extracting effective information of PDF document page elements is characterized by comprising the following steps:

according to the initial PDF document information extraction model and the PDF document information extraction rule model, constructing a PDF document information extraction model for extracting the effective information of the PDF document;

2. The method for extracting the effective information of the PDF document page elements as claimed in claim 1, wherein:

in the process of obtaining the effective information of the PDF document by using the PDF document information extraction model, opening the PDF document, and analyzing and arranging objects of the PDF document according to the sequence from top to bottom and from left to right, wherein the objects comprise text boxes, pictures, rectangular boxes and curves.

3. The method for extracting the effective information of the PDF document page elements as claimed in claim 2, wherein:

after the process of obtaining the object of the PDF document, searching the mark position of a key mark in the object according to the object of the PDF document, dividing a page into a plurality of different information areas according to the mark position, extracting the information of the different areas and storing the information into variables.

4. The method for extracting the effective information of the PDF document page elements as claimed in claim 3, wherein:

in the process of searching the position of a key mark in the object, the key mark is used for representing a line which exceeds the length of a dash;

the mark position is used for indicating the division of the information area.

5. The method for extracting the effective information of the PDF document page elements as claimed in claim 4, wherein:

in the process of dividing the page into a plurality of different information areas, the different information areas comprise headers, chapter numbers, page numbers, text areas, footers, page numbers and comments.

6. The method for extracting the effective information of the PDF document page elements as claimed in claim 5, wherein:

and in the process of acquiring the effective information of the PDF document, judging whether the paragraph starts or not according to the different information areas, and if not, taking the last punctuation from the text of the previous page to the end of the text and splicing the punctuation to the head of the text.

7. The method for extracting the effective information of the PDF document page elements as claimed in claim 6, wherein:

in the process of obtaining the effective information of the PDF document, obtaining density information of the text area, if the density information is less than a specified percentage, using OCR (optical character recognition), otherwise, extracting according to the position of the page area where the text is located, wherein the process of extracting the text comprises the following steps: page number extraction validation, title extraction, text extraction and annotation extraction.

8. The method for extracting the effective information of the PDF document page elements as claimed in claim 6, wherein:

in the process of judging whether the paragraph starts, determining the coordinate range of the effective information according to the mark position, and excluding all objects outside the range;

sequentially traversing the objects in the coordinate range, extracting text information, and judging whether the paragraph is the beginning of the text information at the top or the bottom according to the preference of incomplete information, wherein the judgment basis is as follows: the top first character is not top lattice and/or the bottom end is not punctuation.

9. An extraction system for effective information of a PDF document page element is characterized by comprising:

the data acquisition module is used for acquiring PDF documents;

the data analysis module is used for constructing an initial PDF document information extraction model and storing the initial PDF document information extraction model in a first storage area, wherein the initial PDF document information extraction model is used for generating a time-efficient PDF document information extraction rule model; acquiring a document analysis rule set, wherein the document analysis rule set is used for expressing a rule set for analyzing a PDF document to acquire effective information of the PDF document; generating a PDF document information extraction rule model according to the initial PDF document information extraction model and the document analysis rule set, and storing the PDF document information extraction rule model in a second storage area; according to the initial PDF document information extraction model and the PDF document information extraction rule model, constructing a PDF document information extraction model for extracting the effective information of the PDF document; updating the PDF document information extraction model according to the initial PDF document information extraction model and the document analysis rule set by setting a first interval time;

the data correction module records the updating time point and the used document analysis rule of each effective information in the PDF document effective information, and generates a PDF document information extraction index model according to the effective information, the updating time point and the document analysis rule; merging the PDF document information extraction index model to the PDF document information extraction model; and according to the second interval time, according to the document analysis rule and the update time point, judging the accuracy of the corresponding effective information.

10. The system for extracting the effective information of the PDF document page elements as claimed in claim 9, wherein:

the extraction system further comprises:

the first storage unit is used for storing the initial PDF document information extraction model;

the second storage unit is used for storing the PDF document information extraction rule model;