CN118568056B

CN118568056B - Electronic archive retrieval method, device and medium based on dynamically constructed prompt words

Info

Publication number: CN118568056B
Application number: CN202411045211.7A
Authority: CN
Inventors: 申朝然; 李伟龙; 徐同明; 李伯钊; 鹿海洋; 勇喜; 刘本熙; 赵建华
Original assignee: Inspur General Software Co Ltd
Current assignee: Inspur General Software Co Ltd
Priority date: 2024-08-01
Filing date: 2024-08-01
Publication date: 2024-10-08
Anticipated expiration: 2044-08-01
Also published as: CN118568056A

Abstract

The application discloses an electronic archive retrieval method, equipment and medium based on dynamically constructing prompt words, relating to the field of electric data processing, wherein the method comprises the following steps: text recognition is carried out on non-text content in the electronic file, and corresponding disordered text data are obtained; obtaining key attributes to be extracted based on a pre-trained key information extraction model, and sorting according to the key attributes to obtain corresponding ordered key information; dynamically constructing to obtain a prompt word; and taking the prompt word as input, and outputting a corresponding search result through the large model. After the full text recognition result of the electronic file attachment is obtained, the disordered text data is converted into ordered key information through a key information extraction model, so that a prompt word can be dynamically constructed through key attributes to be extracted and the ordered key information, full text retrieval constraints adapted to the full text retrieval library of the electronic file are fused, and a corresponding large model is called for induction summarization, so that an extraction result with full text retrieval guiding significance is obtained.

Description

Electronic archive retrieval method, device and medium based on dynamically constructed prompt words

Technical Field

The application relates to the field of electric data processing, in particular to an electronic archive retrieval method, device and medium based on dynamically constructing prompt words.

Background

The electronic file has the excellent characteristics of non-human reading property of information, high density of information storage, separability between information and a carrier and inheritance of various information media, and plays an indispensable role in managing file entities and file information, providing services and the like.

Along with the application of electronic files, it is important to ensure the correctness of various information and perform efficient retrieval.

The traditional full-text retrieval of electronic files can only retrieve the fields on the structured file data such as attribution units, formation time, file title and the like, but can not effectively retrieve unstructured data such as electronic certificate image attachments and the like. This results in the fact that in practice it is difficult for the user to quickly and accurately retrieve the required electronic vouchers (including invoices, receipts, contracts, etc.) via the electronic archive attachment system. This limitation greatly reduces the usability and user experience of the electronic file system.

Disclosure of Invention

In order to solve the above problems, the present application provides an electronic archive retrieval method based on dynamically constructing a prompt word, including:

acquiring an electronic file, and carrying out text recognition on non-text content in the electronic file to obtain corresponding disordered text data;

Based on a pre-trained key information extraction model, obtaining key attributes to be extracted, extracting the disordered text data, and obtaining ordered key information corresponding to the key attributes;

Dynamically constructing and obtaining a prompt word according to the key attribute, the ordered key information and a preset full text retrieval constraint statement;

And taking the prompt word as input, and outputting a corresponding search result through a large model.

On the other hand, the application also provides an electronic archive retrieval device based on dynamically constructing the prompt words, which comprises:

At least one processor; and

A memory communicatively coupled to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform operations such as: the electronic archive retrieval method based on dynamically constructing prompt words according to the above example.

In another aspect, the present application also provides a non-volatile computer storage medium storing computer-executable instructions configured to: the electronic archive retrieval method based on dynamically constructing prompt words according to the above example.

The electronic archive retrieval method based on dynamically constructing the prompt words provided by the application has the following beneficial effects:

Aiming at the problem of retrieval limitation, firstly, the disordered text data is converted into ordered key information through a key information extraction model after the electronic file attachment is full-text, so that a prompt word can be dynamically constructed through key attributes to be extracted and the ordered key information, and a full-text retrieval constraint adapted to an electronic file full-text retrieval library is fused, and a corresponding large model is called for induction summarization, so that an extraction result with full-text retrieval guiding significance is obtained, and the quick and accurate retrieval of non-text contents in the electronic file is realized.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:

FIG. 1 is a flow chart of an electronic archive retrieval method based on dynamically constructing prompt words in an embodiment of the application;

Fig. 2 is a schematic diagram of an electronic archive retrieval device based on dynamically constructing a prompt word according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be clearly and completely described below with reference to specific embodiments of the present application and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The following describes in detail the technical solutions provided by the embodiments of the present application with reference to the accompanying drawings.

In order to solve the problem of retrieval limitation, text search and graph searching can be realized based on a text recognition and content transfer mode, file attachment information in the text search and graph searching can be also retrieved, but when file attachments are large, text content is too much, full text recognition retrieval can reduce retrieval efficiency. In addition, the full-text recognition result can be directly transmitted to the large model for structured information extraction, so that extraction is only carried out in key information during file retrieval, and the process of directly transmitting the full-text recognition result to the large model for extraction is uncontrollable, and is mainly reflected in the uncontrollable extraction result and extraction format although the retrieval efficiency can be improved to a certain extent.

Based on this, as shown in fig. 1, an embodiment of the present application provides an electronic archive retrieval method based on dynamically constructing a prompt word, including:

S101: and acquiring an electronic file, and carrying out text recognition on non-text content in the electronic file to obtain corresponding disordered text data.

Electronic archives typically include text content, which refers to data in the form of text, which is typically stored in a word, txt, or the like, and non-text content. While non-textual content may include invoices, receipts, business licenses, newspaper bills, and the like, which are typically stored in PNG, JPG, JPEG, JPE, BMP, tif, tiff, and the like, as well as in the format of PDF files.

For text content, the corresponding retrieval can generally be performed directly. And for non-text content, text recognition is needed to obtain corresponding disordered text data. The text recognition mode may be optical character recognition (Optical Character Recognition, OCR), by which each text data is recognized, only text information of the text data itself and its position coordinates in the non-text content can be determined, and the positional relationship and association between each text data cannot be determined. For example, when non-text content is invoiced for an a-transportation business, the identified text data may include: "carrier company", "carrier number", "date", "A carrier company", "ABC1234", "2021-01-01", etc. Wherein a shipping enterprise, ABC1234 refer to an exemplary description of a carrier number for a shipping enterprise, respectively. Although these text data can be recognized, it is not possible to recognize the association between the text data, for example, it is difficult to determine that "carrier business" corresponds to "a carrier business", "carrier number" corresponds to "ABC1234", "date" corresponds to "2021-01-01", etc., and the association between the text data is disordered, so that it is called disordered text data, and it is difficult to extract ordered text content therein. If the file is directly used for searching, the searching difficulty is increased, and misunderstanding can be caused when a user analyzes the file.

S102: and obtaining key attributes to be extracted based on a pre-trained key information extraction model, and extracting the disordered text data to obtain ordered key information corresponding to the key attributes.

The key attributes refer to attributes required for the current search, for example, the key attributes may include "company name", "registration number", "type", "residence", "date of establishment", and the like according to different scene requirements. The ordered key information refers to text information corresponding to the key attribute after the association between the disordered text data is established, for example, the key attribute to be extracted is "carrier enterprise", and the key information finally extracted is "a carrier enterprise".

The training process of the key information extraction model comprises the following steps: and acquiring the corresponding non-text content in the plurality of electronic files respectively, and acquiring corresponding disordered text data through text recognition. The plurality of electronic files may be obtained from data in a history.

For the scrambled text data, key attributes and key information contained therein are determined, and a dataset is generated. The key attribute refers to an attribute for specifying an information category, for example, it is "carrier business", "carrier number", "date", etc., and the key information refers to a specific value for the information category in the current electronic archive, for example, it is "a carrier business", "ABC1234", "2021-01-01", etc.

And labeling the mapping relation of the key attributes and the key information aiming at the data set. Labeling may include labeling whether each text data belongs to a key attribute or key information, and labeling corresponding relationships between different key attributes and key information.

And taking the marked data set as a training sample, dividing the training sample into a training set and a testing set, and training a unified frame pre-training model (Universal Information Extraction, UIE) for general information extraction to obtain a key information extraction model.

The method comprises the steps that a unified frame pre-training model is extracted aiming at general information, a universal information extraction unified frame pre-training model based on a Paddle frame can be obtained, and model training parameters are set; wherein the model training parameters include: initial learning rate, number of data samples entered, maximum number of token for text, number of training rounds. For example, the initial learning rate is set to 1e-5, the number of data samples input each time the model is trained is 32, the maximum number of token for the text is 512, and the training round is 100.

Training the unified frame pre-training model for general information extraction based on model training parameters to obtain a key information extraction model, wherein key attributes to be extracted and key information corresponding to the key attributes can be found in disordered text data of an electronic file.

S103: and dynamically constructing and obtaining a prompt word according to the key attribute, the ordered key information and a preset full-text retrieval constraint statement.

Full text retrieval constraint statements refer to constraint commands that need to be performed at the time of retrieval. The term "hint" refers to the content that needs to be input to the large model when issuing a search task to the large model.

Specifically, the business domain to which the electronic archive belongs may be determined, which may include a financial domain, a purchasing domain, and the like, and of course, division may also be continued, for example, in the purchasing domain, including: office supplies purchase, raw material purchase, etc.

Each business field can be provided with a plurality of templates in advance, and at this time, a specified template containing key attributes to be extracted is searched for among a plurality of templates preset in the business field. The specified template comprises a plurality of sentences, and the plurality of sentences comprise search requirement sentences and full-text search constraint sentences. The full-text search constraint statement is used for constraining a search process or providing corresponding original materials (including contents of electronic files, correspondence between key attributes and key information, and the like), and the search requirement statement describes a search requirement in a way of asking questions or requirements, and the like.

At this time, the search requirement statement which does not belong to the key attribute to be extracted is deleted. And in the rest search requirement sentences, supplementing the search requirement sentences with preset variables through ordered key information. And obtaining complete prompt words according to the completed search requirement statement and the full-text search constraint statement.

For different search requirement sentences, corresponding preset variables may be set, and different preset variables are used for representing different ordered key information. For example, among the remaining search requirement sentences, the search requirement sentences having the preset variables are determined as follows: additional information such as the first key attribute is captured to facilitate prompting { the first preset variable }. The first key attribute and the first preset variable are described in an exemplary manner, when the content is actually generated, the first key attribute and the first preset variable can be replaced by the actual related content, the key attribute represented by the first preset variable is assumed to be a user, the user can be directly used as the replacement of the preset variable, a similar effect can be achieved, but a specific value corresponding to the user is filled in a preset variable form, so that the information in the prompt word can be more perfect, and the actual expression content of the electronic file is more close to the actual expression content of the current time, so that a large model can give corresponding emphasis when capturing the prompt information. Of course, the method can also adopt a form of' so as to send a prompt to the user { first preset variable }, and on the basis of explaining that the key attribute is the user, identity information corresponding to the value of the first preset variable is clicked out, so that the retrieval accuracy of the large model is improved.

At this time, for each preset variable, determining the corresponding variable name, selecting the corresponding ordered key information according to the variable name, and placing the ordered key information in the position of the preset variable so as to complement the rest search requirement statement.

S104: and taking the prompt word as input, and outputting a corresponding search result through a large model.

After the prompt word is obtained, the prompt word is used as a large model, and the large model can realize corresponding retrieval according to the retrieval requirement statement and the full text retrieval constraint statement, and return corresponding retrieval results. The large model may be a large language model, or a large model trained in certain specific fields.

The large model has strong understanding capability and abstract writing capability, can deeply understand and analyze the identification information on the basis of acquiring key attributes and ordered key information, generates more ordered and clear archive information relative to information such as relative data, digital notes and the like, optimizes the full-text retrieval library, and enables the full-text retrieval library to be clearer and easier to retrieve.

In one embodiment, the electronic archive typically has corresponding text content in addition to non-text content, which often serves as a body portion of the electronic archive, while non-text content typically serves as an attachment portion.

Based on the above, the text content in the electronic archive is obtained and can be directly used for the corresponding retrieval process. Of course, for the ordered key information, if the corresponding key attribute is provided with a preset mark (the preset mark is marked based on manual operation, and the mark is often set for the most important key attribute), the key attribute is very important, or the accuracy of the search result is very important. At this time, it is generally considered that an attachment as non-text content is associated with a body as text content, and there is a high possibility that there is a consistent word for the ordered key information between the attachment and the body, so that searching is performed in the text content in the electronic archive according to the ordered key information to determine whether there is a content with consistent ordered key information in the text content in the electronic archive.

If the consistent content exists, the text recognition process is considered to be not in error, the recognized disordered text data are correct, and corresponding processing is not carried out. For example, the key attribute with the preset mark is "carrier business", and the corresponding key information is "a transportation business", and if "a transportation business" is searched in the text content, the identification is considered to be correct. If the key attribute or the ordered key information is marked (for example, the font color, the size, the background color and the like are changed) in the prompt words, so that a user can be correspondingly checked before inputting the key attribute or the ordered key information into the large model, and the problem of the identification error of the disordered text data is solved.

In one embodiment, in some cases (for example, the identified out-of-order text data is lost, or the identification of the key information extraction model is incorrect), the situation that the key information corresponding to the specified key attribute cannot be extracted by the key information extraction model may be caused for the specified key attribute to be extracted.

At this time, the location coordinates of the specified key attributes in the electronic archive are determined according to the out-of-order text data. Where, when recognizing out-of-order text data, it is often recognized that a corresponding coordinate position is a range coordinate, and its corresponding ordered key information is often located around it.

Based on this, the coordinate position is expanded in a preset direction (the preset direction generally includes at least two directions, for example, right and lower directions, of course, the preset direction may also include fewer or more directions), and when the coordinate position is expanded, the preset direction is kept to be elongated, and the vertical direction of the preset direction is unchanged until reaching a coordinate position corresponding to other data in the out-of-order text data, that is, intersecting with other data. For example, if the preset direction is right, the vertical height is kept unchanged, and the vertical height is extended to the right until coordinate information of other data is reached or a boundary is reached. If there are multiple preset directions, the priority of the preset direction is set according to the type of the electronic file (including invoice, receipt, etc.), and the preset direction with high priority is extended first.

Ordered key information is generally considered to be around the key attribute and, as it is not identified, is blank on the electronic archive, i.e. does not contain location coordinates, so it can be included, typically by way of extension.

And taking the expanded position coordinates as the ordered key information of the appointed key attribute. At this time, the screenshot is not required to be identified, but is loaded into the prompt words in the form of additional drawings, and is filled in or deleted after being manually identified by a user.

In one embodiment, after the search is completed, the search result and the search process corresponding to the search result may also be stored in an ES full text search library (elastic search); wherein, the search result comprises: at least one of file identification corresponding to the electronic file, key attribute to be extracted, ordered key information, prompt words and model identification corresponding to the large model. So as to improve the efficiency and accuracy of full text retrieval and carry out corresponding tracing subsequently.

As shown in fig. 2, the embodiment of the present application further provides an electronic archive retrieval device based on dynamically constructing a prompt word, including:

At least one processor; and

A memory communicatively coupled to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform operations such as: the electronic archive retrieval method based on dynamically constructing prompt words according to any one of the embodiments.

The embodiment of the application also provides a nonvolatile computer storage medium, which stores computer executable instructions, wherein the computer executable instructions are configured to: the electronic archive retrieval method based on dynamically constructing prompt words according to any one of the embodiments.

The embodiments of the present application are described in a progressive manner, and the same and similar parts of the embodiments are all referred to each other, and each embodiment is mainly described in the differences from the other embodiments. In particular, for the apparatus and medium embodiments, the description is relatively simple, as it is substantially similar to the method embodiments, with reference to the section of the method embodiments being relevant.

The devices and media provided in the embodiments of the present application are in one-to-one correspondence with the methods, so that the devices and media also have similar beneficial technical effects as the corresponding methods, and since the beneficial technical effects of the methods have been described in detail above, the beneficial technical effects of the devices and media are not repeated here.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

The foregoing is merely exemplary of the present application and is not intended to limit the present application. Various modifications and variations of the present application will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the application are to be included in the scope of the claims of the present application.

Claims

1. An electronic archive retrieval method based on dynamically constructing prompt words is characterized by comprising the following steps:

Taking the prompt word as input, and outputting a corresponding search result through a large model;

Dynamically constructing and obtaining a prompt word according to the key attribute, the ordered key information and a preset full text retrieval constraint statement, wherein the method specifically comprises the following steps:

determining the service field to which the electronic file belongs;

Searching a specified template containing the key attribute to be extracted from a plurality of templates preset in the service field; the specified template comprises a plurality of sentences, and the plurality of sentences comprise search requirement sentences and full text search constraint sentences;

Deleting the search requirement sentences which do not belong to the key attributes to be extracted, and completing the search requirement sentences with preset variables in the rest search requirement sentences through the ordered key information;

And obtaining the prompt word according to the completed retrieval requirement statement and the full-text retrieval constraint statement.

2. The method of claim 1, wherein the training process of the key information extraction model comprises:

acquiring corresponding non-text contents in a plurality of electronic files respectively, and acquiring corresponding disordered text data through text recognition;

determining key attributes and key information contained in the out-of-order text data, and generating a data set;

Labeling the mapping relation between the key attribute and the key information aiming at the data set;

And taking the marked data set as a training sample, dividing the training sample into a training set and a testing set, and training the universal information extraction unified frame pre-training model to obtain the key information extraction model.

3. The method according to claim 2, wherein training the unified framework pretraining model for general information extraction to obtain the key information extraction model specifically comprises:

acquiring a universal information extraction unified frame pre-training model based on a Paddle frame;

Setting model training parameters; wherein the model training parameters include: initial learning rate, number of data samples input, maximum number of token of text, training round number;

and training the universal information extraction unified frame pre-training model based on the model training parameters to obtain the key information extraction model.

4. The method according to claim 1, wherein the completion of the search requirement statement with the preset variable by the ordered key information in the remaining search requirement statement specifically comprises:

Determining a search demand statement with preset variables in the rest search demand statements, and determining the corresponding variable name of each preset variable;

And selecting corresponding ordered key information according to the variable names, and placing the ordered key information in the position of the preset variable so as to complement the rest search requirement sentences.

5. The method according to claim 1, wherein the method further comprises:

acquiring text content in the electronic archive;

Aiming at the ordered key information, if the corresponding key attribute is provided with a preset mark, searching in the text content in the electronic file according to the ordered key information to determine whether the text content in the electronic file exists in the content with consistent ordered key information or not;

and if the key attribute or the ordered key information does not exist, marking the key attribute or the ordered key information in the prompt word.

6. The method according to claim 1, wherein the method further comprises:

Aiming at the appointed key attribute to be extracted, if the ordered key information corresponding to the appointed key attribute cannot be extracted through the key information extraction model, determining the position coordinate of the appointed key attribute in the electronic file according to the disordered text data;

expanding the coordinate position in a preset direction of the coordinate position until reaching the coordinate position corresponding to other data in the disordered text data;

and taking the expanded position coordinates as the ordered key information of the specified key attribute.

7. The method according to claim 1, wherein the method further comprises:

Storing the search result and the search process corresponding to the search result into an ES full-text search library; wherein, the search result comprises: and file identifiers corresponding to the electronic files, the key attributes to be extracted, the ordered key information, the prompt words and model identifiers corresponding to the large models.

8. An electronic archive retrieval device based on dynamically constructing a prompt word, comprising:

At least one processor; and

A memory communicatively coupled to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform operations such as: the electronic archive retrieval method based on dynamically constructing prompt words according to any one of claims 1 to 7.

9. A non-transitory computer storage medium storing computer-executable instructions, the computer-executable instructions configured to: the electronic archive retrieval method based on dynamically constructing prompt words according to any one of claims 1 to 7.