CN115329048A - Statement retrieval method and device, electronic equipment and storage medium - Google Patents
Statement retrieval method and device, electronic equipment and storage medium Download PDFInfo
- Publication number
- CN115329048A CN115329048A CN202211077618.9A CN202211077618A CN115329048A CN 115329048 A CN115329048 A CN 115329048A CN 202211077618 A CN202211077618 A CN 202211077618A CN 115329048 A CN115329048 A CN 115329048A
- Authority
- CN
- China
- Prior art keywords
- document
- sentence
- sentences
- statement
- target
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3329—Natural language query formulation or dialogue systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/3332—Query translation
- G06F16/3334—Selection or weighting of terms from queries, including natural language queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/3332—Query translation
- G06F16/3338—Query expansion
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/338—Presentation of query results
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Mathematical Physics (AREA)
- Human Computer Interaction (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application discloses a sentence retrieval method and device, electronic equipment and a storage medium, which can be applied to the field of big data or finance, wherein the method comprises the following steps: acquiring a retrieval statement input by a user; searching whether a document sentence consistent with the retrieval sentence exists in the target library; if the document sentences which are consistent with the retrieval sentences are found out, determining the document sentences which are consistent with the retrieval sentences as target document sentences; if the consistent document sentences are not found out, segmenting the search sentences to obtain a search word set; matching the retrieval word set with the word sets corresponding to the document sentences to obtain matching results, and selecting at least one document sentence from the document sentences to determine the document sentence as a target document sentence based on the matching results; splicing all document sentences of paragraphs where the target document sentences are located respectively for each target document sentence to obtain a target text corresponding to the target document sentence; and feeding back each target document sentence and the corresponding target text to the user.
Description
Technical Field
The present application relates to the field of data retrieval technologies, and in particular, to a method and an apparatus for sentence retrieval, an electronic device, and a storage medium.
Background
In the current big data age, the importance of data is increasing. In the use process of the data, the most important is the retrieval of the data.
The current main data retrieval methods are character matching methods and full-text matching methods. The character matching method mainly comprises the steps of matching characters one by one according to the search condition input by a user, determining a search result according to the matching result and feeding back the search result. And in the full-text matching mode, the words of the input sentences are segmented, and then the similarity calculation is carried out according to the segmented words, so that the retrieval result is determined and fed back.
However, the efficiency of the character matching method is too low, and the accuracy of the obtained result is poor when the sentence is too long or too short.
Disclosure of Invention
Based on the defects of the prior art, the application provides a statement retrieval method and device, electronic equipment and a storage medium, so as to solve the problems of low efficiency or insufficient accuracy of the prior art.
In order to achieve the above object, the present application provides the following technical solutions:
the first aspect of the present application provides a method for sentence retrieval, including:
acquiring a retrieval statement input by a user;
searching whether a document sentence consistent with the retrieval sentence exists in a target library; the document sentences are obtained by splitting each document in advance;
if the document sentence consistent with the retrieval sentence is found out, determining the document sentence consistent with the retrieval sentence as a target document sentence;
if the document sentences consistent with the retrieval sentences are not found out, segmenting the retrieval sentences to obtain a retrieval word set;
matching the retrieval word set with the word sets corresponding to the document sentences to obtain matching results, and selecting at least one document sentence from the document sentences to determine the document sentence as a target document sentence based on the matching results;
respectively splicing all the document sentences of the paragraphs where the target document sentences are located aiming at all the target document sentences to obtain target texts corresponding to the target document sentences;
and feeding back each target document sentence and the corresponding target text to the user.
Optionally, in the above statement retrieval method, the method further includes:
based on each business process, connecting a plurality of documents in series;
merging the continuous documents of which the common use frequency exceeds a preset frequency threshold;
splitting each document to obtain each document sentence corresponding to each document;
recording related item information of each document statement of each document; wherein, the related item information at least comprises the number of the paragraph to which the related item belongs and the sequence number of the sentence;
storing each document statement to a document area of an index library according to the document to which the document statement belongs, and storing each document statement and related item information thereof to an index area of the index library;
establishing a mapping relation between each document and each document statement in the index area;
respectively segmenting words of each document sentence to obtain a word set corresponding to each document sentence;
and determining the keywords of each document sentence from the word set corresponding to each document sentence, and storing the keywords of each document sentence into a word list.
Optionally, in the sentence retrieval method, the splicing, for each target document sentence, each document sentence of a paragraph where the target document sentence is located to obtain a target text corresponding to the target document sentence, includes:
aiming at each target document statement, finding out each document statement consistent with the affiliated paragraph number of the target document statement;
and numbering according to the sentence sequence of each document sentence, splicing the searched document sentences to obtain a target text corresponding to the target document sentence.
Optionally, in the statement retrieval method, the matching the retrieved word set with the word set corresponding to each document statement to obtain a matching result, and selecting at least one document statement from each document statement based on the batch result to determine the document statement as a target document statement includes:
matching the retrieval word set with the word sets corresponding to the document sentences to obtain a first matching result;
adding the document sentences corresponding to the word sets with the maximum matching degree in the first matching result into a first sentence set;
if the first matching result meets the expansion condition, expanding the word set and/or the retrieval word set corresponding to each document sentence by using an expansion word;
matching the word set corresponding to each expanded document sentence with the retrieval word set for multiple times to obtain multiple second matching results;
adding the document sentences corresponding to the word sets with the maximum matching degree in each second matching result into a second sentence set;
if the sum of the number of the document sentences corresponding to the word sets with the matching degrees larger than the preset matching degree in each second matching result is larger than the preset number, determining a plurality of target popular documents based on the keywords in the search sentences;
performing secondary expansion on the search word set by using each hot word in each target hot document to obtain a plurality of search expansion word sets;
matching the retrieval expansion word set with the word set corresponding to each document statement respectively to obtain a third matching result corresponding to the retrieval expansion word set;
adding the document sentences corresponding to the word sets with the maximum matching degree in the third matching results into a third sentence set;
determining an intersection of the first statement set, the second statement set and the third statement set;
and determining each document statement in the intersection as the target document statement.
Optionally, in the above sentence searching method, before the searching target library has a document sentence consistent with the search sentence, the method further includes:
judging whether the retrieval sentences input by the user belong to current continuous high-density similar search sentences or not;
if the retrieval statement input by the user does not belong to the current continuous high-density similar search statement, determining an index library as the target library; wherein the index repository includes a full number of document statements;
if the retrieval statement input by the user belongs to the current continuous high-density similar search statement, determining the current exclusive library as the target library; and the exclusive library initially contains the whole amount of document sentences, and after the current high-density similar search sentences are carried out each time, the searched target document sentences are removed from the exclusive library.
A second aspect of the present application provides an apparatus for sentence retrieval, including:
the acquisition unit is used for acquiring a retrieval statement input by a user;
the first searching unit is used for searching whether a document sentence consistent with the retrieval sentence exists in a target library; the document sentences are obtained by splitting each document in advance;
a first determination unit configured to determine, when a document sentence that is consistent with the search sentence is found, the document sentence that is consistent with the search sentence as a target document sentence;
the first word segmentation unit is used for segmenting the search sentences to obtain a search word set when the document sentences consistent with the search sentences are not searched out;
the matching unit is used for matching the retrieval word set with the word sets corresponding to the document sentences to obtain matching results, and selecting at least one document sentence from each document sentence to be determined as a target document sentence based on the matching results;
the splicing unit is used for splicing the document sentences of the paragraphs where the target document sentences are located respectively aiming at each target document sentence to obtain a target text corresponding to the target document sentence;
and the feedback unit is used for feeding back each target document statement and the corresponding target text to the user.
Optionally, the above sentence retrieval apparatus further includes:
the series unit is used for connecting a plurality of documents in series based on each business process;
the merging unit is used for merging the continuous documents of which the common use frequency exceeds a preset frequency threshold;
the splitting unit is used for splitting each document to obtain each document statement corresponding to each document;
the recording unit is used for recording related item information of each document statement of each document; wherein, the related item information at least comprises the number of the paragraph to which the related item belongs and the sequence number of the sentence;
the storage unit is used for storing each document statement to a document area of an index library according to the document to which the document statement belongs, and storing each document statement and related item information thereof to an index area of the index library;
the establishing unit is used for establishing the mapping relation between each document and each document statement in the index area;
the second word segmentation unit is used for segmenting each document sentence to obtain a word set corresponding to each document sentence;
and the keyword processing unit is used for determining the keywords of the document sentences from the word sets corresponding to the document sentences and storing the keywords of the document sentences into a word list.
Optionally, in the apparatus for sentence searching, the splicing unit includes:
the second searching unit is used for respectively searching each document sentence which is consistent with the affiliated paragraph number of the target document sentence based on the target document sentence;
and the splicing subunit is used for splicing the searched document sentences according to the sentence sequence numbers of the document sentences to obtain the target text corresponding to the target document sentences.
Optionally, in the apparatus for sentence searching described above, the matching unit includes:
the first matching unit is used for matching the retrieval word set with the word sets corresponding to the document sentences to obtain a first matching result;
a first adding unit, configured to add the document statement corresponding to the word set with the largest matching degree in the first matching result to a first statement set;
the first expansion unit is used for expanding the word set and/or the retrieval word set corresponding to each document sentence by using expansion words when the first matching result meets the expansion condition;
the second matching unit is used for matching the word set corresponding to each expanded document sentence with the search word set for multiple times to obtain multiple second matching results;
the second adding unit is used for adding the document sentences corresponding to the word sets with the maximum matching degree in each second matching result into a second sentence set;
a document determining unit, configured to determine, when the sum of the number of document sentences corresponding to the word set whose matching degree is greater than the preset matching degree in each second matching result is greater than the preset number, a plurality of target popular documents based on the keywords in the search sentences;
the second expansion unit is used for carrying out secondary expansion on the retrieval word set by utilizing each hot word in each target hot document to obtain a plurality of retrieval expansion word sets;
the third matching unit is used for matching the retrieval expansion word set with the word set corresponding to each document sentence respectively to obtain a third matching result corresponding to the retrieval expansion word set;
a third adding unit, configured to add the document statement corresponding to the word set with the largest matching degree in each third matching result to a third statement set;
the intersection operation unit is used for determining the intersection of the first statement set, the second statement set and the third statement set;
a second determining unit, configured to determine each document statement in the intersection as the target document statement.
Optionally, the above sentence retrieval apparatus further includes:
a judging unit, configured to judge whether the search statement input by the user belongs to a current continuous high-density similar search statement;
a third determining unit, configured to determine an index library as the target library when it is determined that the search statement input by the user does not belong to a current continuous high-density similar search statement; wherein the index repository includes a full number of document statements;
a fourth determination unit, configured to determine, when it is determined that the search term input by the user belongs to a current continuous high-density similar search term, a current exclusive bank as the target bank; and the exclusive library initially contains the whole amount of document sentences, and after the current high-density similar search sentences are carried out each time, the searched target document sentences are removed from the exclusive library.
A third aspect of the present application provides an electronic device comprising:
a memory and a processor;
wherein the memory is used for storing programs;
the processor is configured to execute the program, and when the program is executed, the program is specifically configured to implement the method for sentence retrieval as described in any of the above.
A fourth aspect of the present application provides a computer storage medium for storing a computer program which, when executed, is adapted to implement a method of sentence retrieval as described in any of the above.
The application provides a sentence retrieval method, which divides a document into a plurality of document sentences in advance. In the searching process, a searching sentence input by a user is obtained. And then searching whether a document statement consistent with the retrieval statement exists in the target library. And if the document sentence consistent with the retrieval sentence is found, determining the document sentence consistent with the retrieval sentence as the target document sentence. And if the document sentences consistent with the search sentences are not found, performing word segmentation on the search sentences to obtain a search word set. And then matching the retrieval word set with the word sets corresponding to the document sentences to obtain matching results, and selecting at least one document sentence from the document sentences to determine the document sentence as a target document sentence based on the matching results. And respectively splicing the document sentences of the paragraphs where the target document sentences are located aiming at each target document sentence to obtain a target text corresponding to the target document sentence. And finally, feeding back each target document sentence and the corresponding target text to the user. Therefore, the retrieval is directly carried out by using the sentences, the retrieval efficiency is improved, and when the sentences cannot be directly retrieved, the retrieval is carried out by adopting a word segmentation matching mode, so that the retrieval accuracy is effectively improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
Fig. 1 is a flowchart of a sentence retrieval method provided in an embodiment of the present application;
FIG. 2 is a flowchart of a document preprocessing method provided by an embodiment of the present application;
FIG. 3 is a flowchart of a method for matching target document statements according to an embodiment of the present disclosure;
FIG. 4 is a flowchart of a document sentence merging method according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of a sentence retrieval apparatus according to an embodiment of the present application;
fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be described clearly and completely with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only some embodiments of the present application, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
In this application, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a," "8230," "8230," or "comprising" does not exclude the presence of additional like elements in a process, method, article, or apparatus that comprises the element.
The embodiment of the application provides a sentence retrieval method, as shown in fig. 1, including the following steps:
s101, acquiring a search statement input by a user.
Specifically, a user inputs a statement at the front end according to the requirement and clicks for searching, and at the moment, the system acquires the statement and takes the statement as a retrieval statement to execute the subsequent steps.
S102, searching whether a document sentence consistent with the retrieval sentence exists in the target library.
The document sentences are obtained by splitting each document in advance.
It should be noted that, in the embodiment of the present application, each document is split into a plurality of statements in advance, so that a document statement corresponding to each document is obtained and stored in the index repository. The index library can be used as a target library to find out whether a document sentence consistent with the retrieval sentence exists.
If the document sentence matching the search sentence is found, step S103 may be executed. Since the user may not be able to accurately input the sentence to be searched, there may be a case where the document sentence corresponding to the searched sentence is not found, and step S104 needs to be executed.
Of course, optionally, in order to improve the efficiency of the retrieval, in another embodiment of the present application, before performing step S102, further performing:
and judging whether the retrieval sentences input by the user belong to the current continuous high-density similar search sentences or not.
It should be noted that the search sentence inputted by the user may not be accurate enough, so the obtained result is not intended by the user, so the user may continuously modify the search sentence and then search again. So that it can frequently input similar search sentences for searching in a short time.
Therefore, the retrieval sentences input by the user in the current period can be recorded, and whether the currently input search sentences belong to the current continuous high-density similar search sentences can be judged by comparison.
And if the retrieval statement input by the user does not belong to the current continuous high-density similar search statement, determining the index library as the target library.
Wherein the index repository includes a full number of document statements.
And if the retrieval statement input by the user belongs to the current continuous high-density similar search statement, determining the current exclusive library as the target library.
Since the exclusive repository initially contains the whole number of document sentences and each time the current high-density similar search sentence is performed, each searched target document sentence is removed from the exclusive repository, and the user performs similar search again, which indicates that the searched target document sentence is not required, so that the target document sentence is removed, the number of document sentences required to be searched is reduced in the next search, and the search efficiency is improved.
Optionally, a method for preprocessing a document provided in another embodiment of the present application is shown in fig. 2, and includes:
s201, based on each business process, connecting a plurality of documents in series.
It should be noted that, when a user handles a service, there is a case where after one service is handled, another service must be handled, so documents used by the two services are generally used together, and therefore can be handled as a whole.
S202, merging the continuous documents with the common use frequency exceeding a preset frequency threshold value.
Specifically, each document which is continuous and the common use frequency of which exceeds a predetermined frequency threshold is used as an associated document, and other documents are used as common documents. The associated documents need to be merged and then stored together after being divided into sentences, so that the associated documents can be merged into one document. And the common document is independently split and stored.
S203, splitting each document to obtain each document sentence corresponding to each document.
And S204, recording related item information of each document statement of each document.
The related item information at least comprises the number of the paragraph to which the related item belongs and the sentence sequence number.
In order to facilitate subsequent obtaining of sentences of which target document sentences belong to the same paragraph and merging of the document sentences, in the embodiment of the present application, information such as the paragraph numbers and sentence sequence numbers of the document sentences also is recorded.
S205, storing each document sentence to a document area of the index library according to the document to which the document sentence belongs, and storing each document sentence and related item information thereof to the index area of the index library.
S206, establishing a mapping relation between each document and each document statement in the index area.
And S207, performing word segmentation on each document sentence respectively to obtain a word set corresponding to each document sentence.
And because the word sets can be used for matching subsequently, the words of each document sentence are divided, and each word after each document sentence is divided forms the word set corresponding to the document sentence.
Optionally, the sensitive word stop filtering may be performed on the document sentence, and then the word segmentation is performed to ensure data security.
S208, determining the keywords of each document sentence from the word set corresponding to each document sentence, and storing the keywords of each document sentence into a word list.
It should be noted that, since there may be a change in the subsequent service, there may also be a change in the usage of the document by the user. The timer can continuously record the use condition of the document, and rearrange and optimize the data of the document area and the index area according to the use condition.
S103, determining the document sentence consistent with the search sentence as a target document sentence.
Since the target document sentence has already been obtained at this time, step S106 can be directly executed.
And S104, performing word segmentation on the retrieval sentence to obtain a retrieval word set.
Since the retrieval sentences input by the user are not accurate enough and the sentences needing to be retrieved cannot be directly found out, the sentences need to be participled at the moment and then are matched through the vocabulary, so that the most possible sentences needing to be retrieved by the user are obtained and returned.
And S105, matching the retrieval word set with the word sets corresponding to the document sentences to obtain matching results, and selecting at least one document sentence from the document sentences to determine the document sentence as the target document sentence based on the matching results.
Specifically, the search word set may be matched with the word set corresponding to each document sentence to obtain a matching degree between the search word set and the word set corresponding to each document sentence, then at least one document sentence is selected based on each matching degree to be determined as the target document sentence, and then step S106 is executed.
Optionally, in another embodiment of the present application, a specific implementation manner of step S105, as shown in fig. 3, includes:
s301, matching the retrieval word set with the word sets corresponding to the document sentences to obtain a first matching result.
S302, adding the document sentence corresponding to the word set with the maximum matching degree in the first matching result into the first sentence set.
S303, judging whether the first matching result meets the expansion condition.
It should be noted that, because the detection sentences are not accurate enough and the accuracy of the first matching result is poor, the word sets and/or the search word sets corresponding to the document sentences may be expanded in an expansion manner, so that the two are more consistent, and the obtained result is more accurate.
Optionally, the expansion condition may be that the length of the search sentence is greater than the length of any one of the document sentences, that is, the maximum value of the matching degree in the first matching result is less than a preset value, or the number of the document sentences corresponding to the word set whose matching degree is greater than the preset matching degree in the first matching result is greater than a preset number.
It should be noted that, if the search statement is too long, the distance between the search statement and the document statement is significantly large, the matching degree in the obtained matching result is low, and the document statement meeting the user requirement cannot be accurately determined. When the retrieval sentences are too short, the matching degree of the retrieval sentences with a large number of document sentences is high, and the retrieval sentences cannot be determined to be more in line with the requirements of the user and cannot be fed back to the user. If the first matching result is determined to satisfy the extension condition, step S304 is executed. If the first matching result does not satisfy the expansion condition, it indicates that the current matching result can already satisfy the requirement, so step 312 can be directly executed.
S304, expanding the word set and/or the search word set corresponding to each document sentence by utilizing the expansion words.
Optionally, one of the word set and the search word set corresponding to the document sentence may be expanded, or both of them may be expanded at the same time. It should be noted that the expansion refers to expanding the word set, so as to obtain more word sets. Therefore, the vocabulary can be added in the word set to obtain a new word set so as to perfect the sentence. Alternatively, the words may be replaced by words that are not unified in description, and words that are unified in expression.
Optionally, in another embodiment of the present application, when an expansion condition that the length of the search term is greater than the length of any one document term is satisfied, the word set corresponding to each document term is expanded.
And if the number of the document sentences corresponding to the word sets with the matching degrees larger than the preset matching degree in the first matching result is larger than the preset number, expanding the search word sets.
S305, matching the word set corresponding to each expanded document sentence with the search word set for multiple times to obtain multiple second matching results.
Since the extension words that can be used for the extension may include a plurality of words, the word set and/or the search word set corresponding to each document sentence may be extended a plurality of times. And, after each expansion, one matching is performed, and the matching result of each time is one second matching result, so that there may be a plurality of second matching results.
S306, adding the document sentences corresponding to the word sets with the maximum matching degree in each second matching result into the second sentence sets.
In the embodiment of the application, the document statement corresponding to the word set with the maximum matching degree is selected as the target statement, so that the document statement corresponding to the word set with the maximum matching degree is added to the second statement set.
S307, judging that the sum of the number of the document sentences corresponding to the word sets with the matching degrees larger than the preset matching degree in each second matching result is larger than the preset number.
Optionally, since the augmented words may not be accurate enough, the matching result may still be not accurate enough, and therefore it is further required to determine that the sum of the number of the document sentences corresponding to the word sets with the matching degrees greater than the preset matching degree in each second matching result is greater than the preset number. If the sum of the number of the document sentences corresponding to the word sets with the matching degrees greater than the preset matching degree in each second matching result is greater than the preset number, step S308 is executed. If the sum of the number of the document sentences corresponding to the word sets whose matching degrees are not greater than the preset matching degree in each second matching result is determined to be greater than the preset number, step S312 may be executed.
S308, determining a plurality of target popular documents based on the keywords of the search sentences.
Specifically, for each keyword in the search sentence, it is determined according to the historical search record that the searched popular document, that is, the searched document with the frequency greater than the preset frequency, is found when the keyword is included in the searched sentence. Then, according to the frequency of the retrieval of each popular document, the weighted average is carried out on each popular document, and sorting according to the weighted average result, and finally determining the top K-ranked popular documents as target popular documents.
And S309, performing secondary expansion on the search word set by using each hot word in each target hot document to obtain a plurality of search expansion word sets.
It should be noted that after the target popular documents are obtained, the search term currently searched by the user is described, which may be a term in the target popular documents, so that the search term set may be secondarily extended by using each popular word in each target popular document. The term "hot word" refers to a high-frequency word used for searching as a search term.
And S310, respectively searching each expansion word set, and matching the expansion word sets with the corresponding word sets of the document sentences to obtain a third matching result corresponding to the expansion word sets.
S311, adding the document sentences corresponding to the word sets with the maximum matching degree in the third matching results into the third sentence sets.
S312, determining the intersection of the first statement set, the second statement set and the third statement set.
S313, determining each document statement in the intersection as a target document statement.
S106, splicing the document sentences of the paragraphs where the target document sentences are located respectively for each target document sentence to obtain a target text corresponding to the target document sentence.
The main purpose of the user inputting the search sentence is to obtain the relevant content of the search sentence, and in the embodiment of the present application, the relevant content is the text of the paragraph where the search sentence is located. For example, the search statement entered by the user is "activate credit card step", and the document will include the following steps: a first step; a second step. Therefore, the document sentences of the paragraphs where the document sentences are located need to be spliced to obtain the target text corresponding to the target document sentence.
Alternatively, when the document is processed in the manner shown in fig. 2, a specific implementation of step S106, as shown in fig. 4, includes:
s401, aiming at each target document statement, each document statement consistent with the paragraph number of the target document statement is found out.
S402, numbering according to the sentence sequence of each document sentence, splicing the searched document sentences to obtain a target text corresponding to the target document sentence.
And S107, feeding back each target document sentence and the corresponding target text to the user.
In order to enable the user to quickly know whether the retrieved result is the content that the user needs to retrieve, the target document sentence also needs to be fed back.
The embodiment of the application provides a sentence retrieval method, which divides a document into a plurality of document sentences in advance. In the searching process, a searching sentence input by a user is obtained. And then searching whether a document statement consistent with the retrieval statement exists in the target library. And if the document sentence consistent with the retrieval sentence is found, determining the document sentence consistent with the retrieval sentence as the target document sentence. And if the document sentences consistent with the search sentences are not found, performing word segmentation on the search sentences to obtain a search word set. And then matching the retrieval word set with the word sets corresponding to the document sentences to obtain matching results, and selecting at least one document sentence from each document sentence based on the matching results to determine the document sentence as a target document sentence. And respectively splicing all document sentences of the paragraphs where the target document sentences are located aiming at each target document sentence to obtain a target text corresponding to the target document sentence. And finally, feeding back each target document sentence and the corresponding target text to the user. Therefore, the retrieval is directly carried out by using the sentences, the retrieval efficiency is improved, and when the sentences can not be retrieved directly, the retrieval is carried out by adopting a word segmentation matching mode, so that the retrieval accuracy is effectively improved.
Another embodiment of the present application provides an apparatus for sentence searching, as shown in fig. 5, including:
an obtaining unit 501 is configured to obtain a search statement input by a user.
A first searching unit 502, configured to search whether a document statement consistent with the retrieval statement exists in the target library.
The document sentences are sentences obtained by splitting each document in advance.
A first determining unit 503, configured to determine, when a document sentence that matches the search sentence is found, the document sentence that matches the search sentence as a target document sentence.
The first word segmentation unit 504 is configured to segment the search statement to obtain a search word set when a document statement that is consistent with the search statement is not found.
And a matching unit 505, configured to match the search word set with a word set corresponding to each document sentence, obtain a matching result, and select at least one document sentence from each document sentence based on the matching result, so as to determine the document sentence as a target document sentence.
And a splicing unit 506, configured to splice, for each target document statement, each document statement of a paragraph where the target document statement is located, to obtain a target text corresponding to the target document statement.
And a feedback unit 507, configured to feed back each target document sentence and the target text corresponding to the target document sentence to the user.
Optionally, in an apparatus for sentence searching provided in another embodiment of the present application, the apparatus further includes:
and the series unit is used for connecting a plurality of documents in series based on each business process.
And the merging unit is used for merging the continuous documents of which the common use frequency exceeds a preset frequency threshold.
And the splitting unit is used for splitting each document to obtain each document sentence corresponding to each document.
And the recording unit is used for recording the related item information of each document statement of each document. The related item information at least comprises the number of the paragraph to which the related item belongs and the sentence sequence number.
And the storage unit is used for storing each document statement to a document area of the index library according to the document to which the document statement belongs, and storing each document statement and related item information thereof to an index area of the index library.
And the establishing unit is used for establishing the mapping relation between each document and each document statement in the index area.
And the second word segmentation unit is used for segmenting each document sentence to obtain a word set corresponding to each document sentence.
And the keyword processing unit is used for determining the keywords of each document sentence from the word set corresponding to each document sentence and storing the keywords of each document sentence into the word list.
Optionally, in an apparatus for sentence retrieval provided in another embodiment of the present application, a concatenation unit includes:
and the second searching unit is used for respectively searching each target document sentence based on each document sentence which is consistent with the paragraph number of the target document sentence.
And the splicing subunit is used for splicing the searched document sentences according to the sentence sequence numbers of the document sentences to obtain the target text corresponding to the target document sentence.
Optionally, in an apparatus for sentence retrieval provided in another embodiment of the present application, the matching unit includes:
and the first matching unit is used for matching the retrieval word set with the word sets corresponding to the document sentences to obtain a first matching result.
And the first adding unit is used for adding the document sentences corresponding to the word sets with the maximum matching degree in the first matching result into the first sentence set.
And the first expansion unit is used for expanding the word set and/or the retrieval word set corresponding to each document sentence by using the expansion words when the first matching result meets the expansion condition.
And the second matching unit is used for matching the word set corresponding to each expanded document sentence with the search word set for multiple times to obtain multiple second matching results.
And the second adding unit is used for adding the document sentences corresponding to the word sets with the maximum matching degree in each second matching result into the second sentence sets.
And the document determining unit is used for determining a plurality of target popular documents based on the keywords in the search sentences when the sum of the number of the document sentences corresponding to the word sets with the matching degrees larger than the preset matching degree in each second matching result is larger than the preset number.
And the second expansion unit is used for secondarily expanding the search word set by utilizing each hot word in each target hot document to obtain a plurality of search expansion word sets.
And the third matching unit is used for matching the retrieval expansion word set with the word set corresponding to each document sentence respectively to obtain a third matching result corresponding to the retrieval expansion word set.
And the third adding unit is used for adding the document sentences corresponding to the word sets with the maximum matching degree in the third matching results into the third sentence sets.
And the intersection operation unit is used for determining the intersection of the first statement set, the second statement set and the third statement set.
And the second determining unit is used for determining each document statement in the intersection as a target document statement.
Optionally, in an apparatus for sentence retrieval provided in another embodiment of the present application, the apparatus further includes:
and the judging unit is used for judging whether the retrieval sentences input by the user belong to the current continuous high-density similar search sentences.
And a third determining unit, configured to determine the index repository as a target repository when it is determined that the search statement input by the user does not belong to the current continuous high-density similar search statement. Wherein the index repository includes a full number of document statements.
And a fourth determination unit configured to determine the current exclusive repository as the target repository when it is determined that the search term input by the user belongs to the current continuous high-density similar search term. The exclusive library initially contains the whole amount of document sentences, and after the current high-density similar search sentences are carried out each time, all searched target document sentences are removed from the exclusive library.
It should be noted that, for the specific working processes of each unit provided in the foregoing embodiments of the present application, corresponding steps in the foregoing method embodiments may be referred to accordingly, and are not described herein again.
Another embodiment of the present application provides an electronic device, as shown in fig. 6, including:
a memory 601 and a processor 602.
The memory 601 is used for storing programs.
The processor 602 is configured to execute a program stored in the memory 601, and when the program is executed, the program is specifically configured to implement the method for sentence retrieval provided in any of the above embodiments.
Another embodiment of the present application provides a computer storage medium for storing a computer program, which when executed, is used to implement the sentence retrieval method provided in any one of the above embodiments.
Computer storage media, including persistent and non-persistent, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should be noted that the method and apparatus for sentence search, the electronic device, and the storage medium provided by the present invention can be used in the field of artificial intelligence, the field of big data, or the field of finance. The above description is only an example, and does not limit the application fields of the method and apparatus for sentence retrieval, the electronic device, and the storage medium provided by the present invention.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the technical solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims (10)
1. A method of sentence retrieval, comprising:
acquiring a retrieval statement input by a user;
searching whether a document sentence consistent with the retrieval sentence exists in a target library; the document sentences are obtained by splitting each document in advance;
if the document sentence consistent with the retrieval sentence is found out, determining the document sentence consistent with the retrieval sentence as a target document sentence;
if the document sentences consistent with the retrieval sentences are not found out, segmenting the retrieval sentences to obtain a retrieval word set;
matching the retrieval word set with the word sets corresponding to the document sentences to obtain matching results, and selecting at least one document sentence from each document sentence based on the matching results to determine the document sentence as a target document sentence;
respectively splicing all the document sentences of the paragraphs where the target document sentences are located aiming at all the target document sentences to obtain target texts corresponding to the target document sentences;
and feeding back each target document sentence and the corresponding target text to the user.
2. The method of claim 1, further comprising:
based on each business process, connecting a plurality of documents in series;
merging the continuous documents of which the common use frequency exceeds a preset frequency threshold;
splitting each document to obtain each document sentence corresponding to each document;
recording related item information of each document statement of each document; wherein, the related item information at least comprises the number of the paragraph to which the related item belongs and the sequence number of the sentence;
storing each document statement to a document area of an index library according to the document to which the document statement belongs, and storing each document statement and related item information thereof to an index area of the index library;
establishing a mapping relation between each document and each document statement in the index area;
performing word segmentation on each document sentence to obtain a word set corresponding to each document sentence;
and determining the keywords of each document sentence from the word set corresponding to each document sentence, and storing the keywords of each document sentence into a word list.
3. The method according to claim 2, wherein the splicing the document sentences of the paragraphs where the target document sentences are located to obtain the target texts corresponding to the target document sentences for each target document sentence respectively comprises:
respectively aiming at each target document statement, finding out each document statement consistent with the affiliated paragraph number of the target document statement;
and numbering according to the sentence sequence of each document sentence, splicing the searched document sentences to obtain a target text corresponding to the target document sentence.
4. The method of claim 1, wherein the matching the search word set with a word set corresponding to each of the document sentences to obtain matching results, and selecting at least one of the document sentences from each of the document sentences to be determined as a target document sentence based on the batch results comprises:
matching the retrieval word set with the word sets corresponding to the document sentences to obtain a first matching result;
adding the document sentences corresponding to the word sets with the maximum matching degree in the first matching result into a first sentence set;
if the first matching result meets the expansion condition, expanding the word set and/or the retrieval word set corresponding to each document sentence by using an expansion word;
matching the word set corresponding to each expanded document sentence with the retrieval word set for multiple times to obtain multiple second matching results;
adding the document sentences corresponding to the word sets with the maximum matching degree in each second matching result into a second sentence set;
if the sum of the number of the document sentences corresponding to the word sets with the matching degrees larger than the preset matching degree in each second matching result is larger than the preset number, determining a plurality of target popular documents based on the keywords in the search sentences;
performing secondary expansion on the search word set by using each hot word in each target hot document to obtain a plurality of search expansion word sets;
matching the retrieval expansion word set with the word set corresponding to each document sentence to obtain a third matching result corresponding to the retrieval expansion word set;
adding the document sentences corresponding to the word sets with the maximum matching degree in the third matching results into a third sentence set;
determining an intersection of the first statement set, the second statement set and the third statement set;
and determining each document statement in the intersection as the target document statement.
5. The method of claim 1, wherein before finding whether there is a document statement in the target repository that is consistent with the retrieval statement, further comprising:
judging whether the retrieval sentences input by the user belong to current continuous high-density similar search sentences or not;
if the retrieval statement input by the user does not belong to the current continuous high-density similar search statement, determining an index library as the target library; wherein the index repository includes a full number of document statements;
if the retrieval statement input by the user belongs to the current continuous high-density similar search statement, determining the current exclusive library as the target library; and the exclusive library initially contains a full amount of document sentences, and after the current high-density similar search sentences are carried out each time, the searched target document sentences are removed from the exclusive library.
6. An apparatus for sentence retrieval, comprising:
the acquisition unit is used for acquiring a retrieval statement input by a user;
the first searching unit is used for searching whether a document sentence consistent with the retrieval sentence exists in a target library; the document sentences are sentences obtained by splitting each document in advance;
a first determination unit configured to determine, when a document sentence that is identical to the search sentence is found, the document sentence that is identical to the search sentence as a target document sentence;
the first word segmentation unit is used for segmenting the search sentences to obtain a search word set when the document sentences consistent with the search sentences are not searched out;
the matching unit is used for matching the retrieval word set with the word sets corresponding to the document sentences to obtain matching results, and based on the matching results, at least one document sentence is selected from the document sentences to be determined as a target document sentence;
the splicing unit is used for splicing the document sentences of the paragraphs where the target document sentences are located respectively aiming at each target document sentence to obtain target texts corresponding to the target document sentences;
and the feedback unit is used for feeding back each target document statement and the corresponding target text to the user.
7. The apparatus of claim 6, further comprising:
the series unit is used for connecting a plurality of documents in series based on each business process;
the merging unit is used for merging the continuous documents of which the common use frequency exceeds a preset frequency threshold;
the splitting unit is used for splitting each document to obtain each document statement corresponding to each document;
a recording unit, configured to record related item information of each document statement of each document; wherein, the related item information at least comprises the number of the paragraph to which the related item belongs and the sequence number of the sentence;
the storage unit is used for storing each document statement to a document area of an index library according to the document to which the document statement belongs, and storing each document statement and related item information thereof to an index area of the index library;
the establishing unit is used for establishing the mapping relation between each document and each document statement in the index area;
the second word segmentation unit is used for performing word segmentation on each document sentence to obtain a word set corresponding to each document sentence;
and the keyword processing unit is used for determining the keywords of the document sentences from the word sets corresponding to the document sentences and storing the keywords of the document sentences into a word list.
8. The apparatus of claim 6, wherein the matching unit comprises:
the first matching unit is used for matching the retrieval word set with the word sets corresponding to the document sentences to obtain a first matching result;
a first adding unit, configured to add the document statement corresponding to the word set with the largest matching degree in the first matching result to a first statement set;
the first expansion unit is used for expanding the word set and/or the retrieval word set corresponding to each document sentence by using expansion words when the first matching result meets the expansion condition;
the second matching unit is used for matching the word set corresponding to each expanded document sentence with the retrieval word set for multiple times to obtain multiple second matching results;
the second adding unit is used for adding the document sentences corresponding to the word sets with the maximum matching degree in each second matching result into a second sentence set;
a document determining unit, configured to determine, when the sum of the number of document sentences corresponding to the word set whose matching degree is greater than the preset matching degree in each second matching result is greater than the preset number, a plurality of target popular documents based on the keywords in the search sentences;
the second expansion unit is used for carrying out secondary expansion on the retrieval word set by utilizing each hot word in each target hot document to obtain a plurality of retrieval expansion word sets;
the third matching unit is used for matching the retrieval expansion word set with the word set corresponding to each document sentence respectively to obtain a third matching result corresponding to the retrieval expansion word set;
a third adding unit, configured to add the document statement corresponding to the word set with the largest matching degree in each third matching result to a third statement set;
the intersection operation unit is used for determining the intersection of the first statement set, the second statement set and the third statement set;
a second determining unit, configured to determine each document statement in the intersection as the target document statement.
9. An electronic device, comprising:
a memory and a processor;
wherein the memory is used for storing programs;
the processor is configured to execute the program, which when executed is particularly configured to implement the method of statement retrieval as claimed in any one of claims 1 to 5.
10. A computer storage medium for storing a computer program which, when executed, is adapted to implement a method of sentence retrieval as claimed in any of claims 1 to 5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211077618.9A CN115329048A (en) | 2022-09-05 | 2022-09-05 | Statement retrieval method and device, electronic equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211077618.9A CN115329048A (en) | 2022-09-05 | 2022-09-05 | Statement retrieval method and device, electronic equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115329048A true CN115329048A (en) | 2022-11-11 |
Family
ID=83930104
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211077618.9A Pending CN115329048A (en) | 2022-09-05 | 2022-09-05 | Statement retrieval method and device, electronic equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115329048A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116108326A (en) * | 2023-04-12 | 2023-05-12 | 山东工程职业技术大学 | Mathematic tool software control method, device, equipment and storage medium |
CN116701615A (en) * | 2023-08-08 | 2023-09-05 | 建信金融科技有限责任公司 | Service document online management method and device, electronic equipment and readable storage medium |
-
2022
- 2022-09-05 CN CN202211077618.9A patent/CN115329048A/en active Pending
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116108326A (en) * | 2023-04-12 | 2023-05-12 | 山东工程职业技术大学 | Mathematic tool software control method, device, equipment and storage medium |
CN116701615A (en) * | 2023-08-08 | 2023-09-05 | 建信金融科技有限责任公司 | Service document online management method and device, electronic equipment and readable storage medium |
CN116701615B (en) * | 2023-08-08 | 2023-11-03 | 建信金融科技有限责任公司 | Service document online management method and device, electronic equipment and readable storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
EP3637295B1 (en) | Risky address identification method and apparatus, and electronic device | |
CN109460455B (en) | Text detection method and device | |
CN109446524B (en) | A kind of voice quality detecting method and device | |
CN111324784B (en) | Character string processing method and device | |
CN109271489B (en) | Text detection method and device | |
CN109145110B (en) | Label query method and device | |
CN110019669B (en) | Text retrieval method and device | |
CN115329048A (en) | Statement retrieval method and device, electronic equipment and storage medium | |
CN114239591B (en) | Sensitive word recognition method and device | |
CN107368489B (en) | Information data processing method and device | |
CN108427686A (en) | Text data querying method and device | |
CN114911917A (en) | Asset meta-information searching method and device, computer equipment and readable storage medium | |
CN108228612B (en) | Method and device for extracting network event keywords and emotional tendency | |
Iskandarli | Applying clustering and topic modeling to automatic analysis of citizens’ comments in E-Government | |
CN115563268A (en) | Text abstract generation method and device, electronic equipment and storage medium | |
CN117763106B (en) | Document duplicate checking method and device, storage medium and electronic equipment | |
EP2461255A1 (en) | Document data processing device | |
CN111401047A (en) | Method and device for generating dispute focus of legal document and computer equipment | |
CN110955845A (en) | User interest identification method and device, and search result processing method and device | |
CN110968691B (en) | Judicial hotspot determination method and device | |
CN110210030B (en) | Statement analysis method and device | |
CN113704398A (en) | Keyword extraction method and device | |
CN112861974A (en) | Text classification method and device, electronic equipment and storage medium | |
CN113627148A (en) | Automatic association method and device for knowledge in knowledge base | |
CN111400577A (en) | Search recall method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |