CN118113806A

CN118113806A - Interpretable event context generation method for large model retrieval enhancement generation

Info

Publication number: CN118113806A
Application number: CN202311790773.XA
Authority: CN
Inventors: 雋兆波; 李春豹; 何健军; 陈莹; 张剑; 高翔; 陈伟晴
Original assignee: CETC 10 Research Institute
Current assignee: CETC 10 Research Institute
Priority date: 2023-12-22
Filing date: 2023-12-22
Publication date: 2024-05-31

Abstract

The invention provides a large model retrieval enhancement generated interpretable event context generation method, which comprises the following steps: s1: processing the text data and constructing a retrieval database; s2: performing mixed retrieval based on weighted summation in a retrieval database; s3: constructing a common-finger and shared event judging model, and removing common-finger and shared events in the mixed search result; s4: generating event context based on reasoning prompt based on the mixed search result processed in the step S3; s5: and carrying out explanatory event context tracing based on semantic similarity matching on the event context. On one hand, the method enables the generated large model to answer according to the mode of searching candidate knowledge, and reduces model illusion problems; secondly, removing common fingers and sharing events based on a semantic and key element combination method, and reducing redundancy of candidate knowledge; thirdly, the large model can generate event venation with different requirements in an inference prompt mode, and meanwhile, the generated result can be interpreted and traced.

Description

Interpretable event context generation method for large model retrieval enhancement generation

Technical Field

The invention relates to the field of natural language processing analysis, in particular to an interpretable event context generation method for large model retrieval enhancement generation.

Background

Event context analysis is an important means for combing the coming pulse of an event, is an analysis method aiming at interaction and time evolution among events, aims at mining correlation, time sequence relation, causality relation and the like among the events, and provides a more comprehensive and deep analysis method, and the main aim is to understand development and change of the event and driving force behind the change, including identifying key stages of the event and analyzing the connection among the stages. However, with the development of big data in the information age, the data size is large, the complexity is high, the data quality is low, how to quickly mine and discover the subject materials required by users from the data, and meanwhile discover sharing and co-pointing events in the subject materials, and finally, the construction of event context according to the cleaned and tidied materials becomes more difficult. Therefore, the establishment of an efficient and accurate event context analysis method based on open source material data has important research value.

Conventional event context analysis steps include determining topics, collecting data, sorting events, analyzing events, modeling, parsing models, and concluding that have limitations in processing large-scale data as well as open-source data, such as rule or template-based methods, tend to be challenging in processing large-scale data because these methods typically require manual formulation of rules or templates, which become increasingly difficult as the amount of data increases. Meanwhile, the dynamic property and the mobility of the event are easy to be ignored, the process-event analysis can better highlight the dynamic property and the mobility of the social facts, and some traditional event context analysis methods still mainly focus on the static characteristics of the event, and neglect the dynamic change process of the event.

Disclosure of Invention

Aiming at the problems existing in the prior art, the invention provides an interpretable event context generation method for large model retrieval enhancement generation, which recalls an event candidate set of events analyzed by a user through a weighted summation mixed retrieval method according to the actual theme of the user, cleans co-pointed and shared events in data based on a discrimination model of a semantic and key element combination method, carries out time sequencing and classified marking treatment on the event candidate set, finally generates a prompt of a specific task according to retrieval content, generates an event context on demand based on a generation type large model, and carries out interpretable and traceable generation on the event context generation result, thereby aiming at solving the generation of the event context under complex text data.

The invention provides a method for generating interpretable event context generated by large model retrieval enhancement, which is characterized by comprising the following steps:

S1: processing the text data and constructing a retrieval database;

S2: performing mixed retrieval based on weighted summation in a retrieval database;

S3: constructing a common-finger and shared-event judging model, judging common-finger and shared-event in the mixed search result by using the common-finger and shared-event judging model, and removing the common-finger and shared-event in the mixed search result;

s4: generating event context based on reasoning prompt based on the mixed search result processed in the step S3;

s5: and carrying out explanatory event context tracing based on semantic similarity matching on the event context.

Further, step S1 includes:

cutting text into fixed sizes;

Summarizing the abstracts and extracting keywords from the cut text;

Vectorizing the abstract and the cut text based on bge word embedding models;

And constructing a search database based on the keywords, the abstract vectors and the cut text and the vectors thereof.

Further, step S2 includes:

Based on the search database, three search modes of text-abstract, text-keyword and text-cut text are respectively carried out; then, screening related topic candidate events by a mixed retrieval mode of weighted summation on the three retrieval modes; and finally, returning to the article corresponding to the cut sentence, and simultaneously, time ordering the articles based on the article release time, thereby obtaining a final related subject candidate event set.

Further, in step S2, the filtering of the related topic candidate event by the mixed search mode of weighted summation for the three search modes includes:

for a vector database constructed by the abstract vector and the cut text vector, calculating the similarity of the input theme and data in the vector database through cosine similarity;

Preliminary screening of a related subject candidate event set based on a set threshold score;

And finely screening the preliminarily screened related subject candidate event set by utilizing the keywords, so as to obtain a final related subject candidate event set.

Further, step S3 includes:

vectorizing the candidate event set based on bge word embedding model,

Combining the vectorized candidate event sets in pairs, respectively calculating the similarity of the rest strings, and reserving the event which is larger than a set threshold value to the similar event candidate set;

and extracting key elements from the events in the similar event candidate set, and further screening the common finger and the shared event by comparing the key elements in the similar events.

Further, step S4 includes:

for different requirements, event venues with different dimensions are constructed, so that different prompt prompts are designed according to different dimensions and candidate event sets;

and taking the prompt as the input of the large model of the generation type LLM to realize the generation of event context.

Further, step S5 includes:

sentence-based cutting is carried out on the generated event context and the candidate event set;

performing vectorization processing on the cut sentences based on bge word embedding models;

combining sentence vectors generated by event venation and candidate event sets in pairs;

and calculating the similarity between the two sentences through cosine similarity, and taking the sentence with the highest similarity result as the related traceability content for generating the event context.

In summary, due to the adoption of the technical scheme, the beneficial effects of the invention are as follows:

On one hand, the method enables the generated large model to answer according to the mode of searching candidate knowledge, and reduces model illusion problems; secondly, removing common fingers and sharing events based on a semantic and key element combination method, and reducing redundancy of candidate knowledge; thirdly, the large model can generate event venation with different requirements in an inference prompt mode, and meanwhile, the generated result can be interpreted and traced.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the following description will briefly describe the drawings in the embodiments, it being understood that the following drawings only illustrate some embodiments of the present invention and should not be considered as limiting the scope, and that other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of an illustrative event context generation method for large model search enhancement generation in an embodiment of the present invention.

FIG. 2 is a flow chart of constructing a search database in an embodiment of the invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the invention, as presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Examples

As shown in fig. 1, the present embodiment proposes an interpretable event context generation method for large-model search enhancement generation, including the following steps:

S1: processing the text data and constructing a retrieval database;

According to the method for generating interpretable event context generated by retrieval enhancement, firstly, an event candidate set of related topics is recalled according to a user input topic, and then related context generation is carried out based on the event candidate set, so that related text data processing and vector library construction are carried out according to actual requirements. As shown in fig. 2, step S1 includes the following sub-steps:

Text fixed size cutting: assuming that the text is represented as N, and the length of the fixedly cut characters is N, circularly taking the text with the length of N from the text N, and if N is not a complete sentence, cutting after the complete sentence, so as to form a text set with a fixed size [ N ₁,N₂. ];

Summarizing the abstracts and extracting the keywords from the cut text: summarizing the abstract and extracting keywords based on the LLM big model according to the cut text, wherein the keyword extraction comprises deterministic words such as 'figures, time, institutions' and the like;

Vectorizing the abstract and the cut text based on bge word embedding models;

S2: performing mixed search based on weighted summation in a search database:

Based on the search database, three search modes of text-abstract, text-keyword and text-cut text are respectively carried out; then, screening related topic candidate events by a mixed retrieval mode of weighted summation on the three retrieval modes; and calculating the similarity of the input theme and the data in the vector database through cosine similarity for the vector database constructed by the abstract vector and the cut text vector. Let the weight of abstract search result α ₁ be ω ₁, the weight of cut text search result α ₂ be ω ₂, and ω ₁>ω₂. The weighted sum α is then:

Based on the above mode, the related topic candidate event set larger than the set threshold score can be obtained, the vector retrieval is good in semantic effect and insufficient in keyword recognition, so that the constructed keyword database is utilized again, the related topic candidate event set obtained by the semantic retrieval is finely screened through matching and inputting keywords in topics, finally, articles corresponding to sentences are returned to be cut, and meanwhile, the articles are time ordered based on the article release time, so that the final related topic candidate event set is obtained.

S3: constructing a common-finger and shared-event judging model, judging common-finger and shared-event in the mixed search result by using the common-finger and shared-event judging model, and removing the common-finger and shared-event in the mixed search result:

In the related subject candidate event set obtained by searching, because a certain common-finger and shared event exists, a common-finger and shared event judging model needs to be constructed, so that the common-finger and shared event in the candidate event set is removed. In this embodiment, the co-fingering and sharing event discrimination model performs co-fingering and sharing event discrimination by adopting a method based on the combination of semantic and key elements. The method specifically comprises the following steps:

vectorizing the candidate event set based on bge word embedding model,

The vector-processed candidate event sets are combined in pairs, expressed as: [ [ a ₁,a₂],[a₁,a₂ ], ]; respectively calculating the similarity of the rest chords, and reserving the event which is larger than a set threshold value to a similar event candidate set;

The key elements in the similar event candidate set are extracted based on the generated LLM large model, the key elements (subject, object, time, place and the like between the key elements) in the similar event are compared, the common-finger and shared events are further screened, and if the key elements are the same, the common-finger or shared events can be judged.

S4: generating event context based on reasoning prompt based on the mixed search result:

For different requirements, different dimension event context construction is required, and the generated LLM has the capability of following instructions, and different prompt prompts can be designed according to different dimensions and candidate event sets according to the candidate event set obtained by retrieval, for example: "event venues for xxx events are grouped according to the following article content, in chronological order. Article content: xxx. And taking the prompt as the input of the large model of the generation type LLM to realize the generation of event context.

S5: explanatory event context tracing based on semantic similarity matching is performed on event context:

In order to ensure the interpretability of the generated result, the embodiment realizes the tracing of the generated context result based on semantic similarity matching. The method specifically comprises the following steps:

Vectorization processing is carried out on the sentences after the cutting based on bge word embedding models, event context after the cutting is expressed as [ s ₁,s₂.. The use of the sentence vector, the candidate event set cut sentence vector is expressed as [ w ₁,w₂..once. ];

Sentence vectors generated by event context and candidate event set are combined pairwise and expressed as [[s₁,w₁],[s₁,w₂],[s₂,w₁],[s₂,w₂]......];

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. An interpretable event context generation method for large model search enhancement generation, comprising the steps of:

S1: processing the text data and constructing a retrieval database;

2. The method for generating interpretable event context for large model search enhancement generation of claim 1, wherein step S1 includes:

cutting text into fixed sizes;

Summarizing the abstracts and extracting keywords from the cut text;

Vectorizing the abstract and the cut text based on bge word embedding models;

3. The method for generating interpretable event context for large model search enhancement generation of claim 2, wherein step S2 includes:

4. The method for generating context of interpretable events generated by large model search enhancement according to claim 3, wherein in step S2, the filtering of related topic candidate events by a weighted sum of three search modes in a mixed search mode includes:

5. The method for generating interpretable event context for large model search enhancement generation of claim 4, wherein step S3 includes:

vectorizing the candidate event set based on bge word embedding model,

6. The method for generating interpretable event context for large model search enhancement generation of claim 5, wherein step S4 includes:

7. The method for generating interpretable event context for large model search enhancement generation of claim 6, wherein step S5 includes: