CN115455142A - Text retrieval method, computer device and storage medium - Google Patents
Text retrieval method, computer device and storage medium Download PDFInfo
- Publication number
- CN115455142A CN115455142A CN202211054998.4A CN202211054998A CN115455142A CN 115455142 A CN115455142 A CN 115455142A CN 202211054998 A CN202211054998 A CN 202211054998A CN 115455142 A CN115455142 A CN 115455142A
- Authority
- CN
- China
- Prior art keywords
- text
- candidate
- candidate text
- target
- pinyin
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 42
- 238000012216 screening Methods 0.000 claims abstract description 51
- 239000013598 vector Substances 0.000 claims description 32
- 238000004364 calculation method Methods 0.000 claims description 25
- 230000006399 behavior Effects 0.000 claims description 24
- 238000000605 extraction Methods 0.000 claims description 21
- 238000012545 processing Methods 0.000 claims description 18
- 238000004590 computer program Methods 0.000 claims description 12
- 238000013507 mapping Methods 0.000 claims description 10
- 238000006243 chemical reaction Methods 0.000 claims description 9
- 238000001914 filtration Methods 0.000 claims description 6
- 238000004422 calculation algorithm Methods 0.000 description 20
- 230000003993 interaction Effects 0.000 description 12
- 230000008569 process Effects 0.000 description 10
- 238000010586 diagram Methods 0.000 description 8
- 238000012706 support-vector machine Methods 0.000 description 8
- 230000009467 reduction Effects 0.000 description 5
- 238000013528 artificial neural network Methods 0.000 description 3
- 238000013527 convolutional neural network Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 238000010606 normalization Methods 0.000 description 3
- 230000002441 reversible effect Effects 0.000 description 3
- 238000013473 artificial intelligence Methods 0.000 description 2
- 230000002457 bidirectional effect Effects 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000007613 environmental effect Effects 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 230000002452 interceptive effect Effects 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 238000007781 pre-processing Methods 0.000 description 2
- 230000000306 recurrent effect Effects 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000001629 suppression Effects 0.000 description 1
- 239000013585 weight reducing agent Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/3332—Query translation
- G06F16/3338—Query expansion
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3343—Query execution using phonetics
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Computational Linguistics (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Health & Medical Sciences (AREA)
- Acoustics & Sound (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The method and the device for searching the candidate texts have the advantages that the candidate texts are obtained by performing similar pinyin searching on the texts to be searched, the candidate texts can be quickly screened out from massive texts, the problem of pronunciation recognition errors is solved, and efficiency and accuracy of text searching are improved. To a text retrieval method, computer apparatus and storage medium, the method comprising: acquiring a text to be retrieved; performing similar pinyin retrieval on the text to be retrieved to obtain at least one candidate text, and determining a target weight value corresponding to each candidate text; and based on a preset text screening strategy, performing text screening according to a target weight value corresponding to each candidate text to obtain a target text.
Description
Technical Field
The present application relates to the field of artificial intelligence, and in particular, to a text retrieval method, a computer device, and a storage medium.
Background
Along with the rapid development of artificial intelligence technology, man-machine interaction technology has been slowly popularized in people's life, and more intelligent home equipment, car machine software begin to provide the voice interaction function, let the user can pass through voice command control and inquiry information service, make user's operation more convenient. When speech recognition is performed, due to the influence of the accent of a speaker, environmental noise, homophones and the like, the content expressed by a user can be mistakenly recognized, and the interaction requirements of the user cannot be met. Therefore, in the voice interaction system, a fuzzy retrieval system is usually required to be accessed, fuzzy retrieval is performed on the voice recognition result so as to obtain a content result expected by the user, and more accurate instruction input and better human-computer interaction experience are provided for subsequent services.
The existing fuzzy retrieval scheme is directly used for retrieval according to sentence characteristics and semantics of a text, and the retrieval calculation amount is huge under the scene of massive candidate data, so that the retrieval efficiency is low; meanwhile, the existing fuzzy retrieval scheme cannot solve the problem of pronunciation recognition error in a voice interaction scene, so that the retrieval accuracy is low.
Therefore, how to improve the efficiency and accuracy of text retrieval becomes an urgent problem to be solved.
Disclosure of Invention
The application provides a text retrieval method, computer equipment and a storage medium, wherein similar pinyin retrieval is carried out on a text to be retrieved to obtain a candidate text, the candidate text can be quickly screened out from massive texts, the problem of wrong pronunciation identification is solved, and the efficiency and the accuracy of text retrieval are improved.
In a first aspect, the present application provides a text retrieval method, including:
acquiring a text to be retrieved;
performing similar pinyin retrieval on the text to be retrieved to obtain at least one candidate text, and determining a target weight value corresponding to each candidate text;
and based on a preset text screening strategy, performing text screening according to a target weight value corresponding to each candidate text to obtain a target text.
In a second aspect, the present application also provides a computer device comprising a memory and a processor;
the memory for storing a computer program;
the processor is configured to execute the computer program and implement the text retrieval method as described above when executing the computer program.
In a third aspect, the present application further provides a computer-readable storage medium storing a computer program, which when executed by a processor causes the processor to implement the text retrieval method as described above.
The application discloses a text retrieval method, computer equipment and a storage medium, wherein at least one candidate text is obtained by obtaining a text to be retrieved and performing similar pinyin retrieval on the text to be retrieved, so that the candidate text can be quickly screened out from a large amount of texts, the problem of wrong pronunciation identification is solved, and the efficiency and the accuracy of text retrieval are improved; by determining the target weight value corresponding to each candidate text and screening the text according to the target weight value corresponding to each candidate text based on a preset text screening strategy, the target text is obtained, the candidate text can be further screened according to the target weight value, and the accuracy of text retrieval is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic flow chart of a text retrieval method provided in an embodiment of the present application;
FIG. 2 is a schematic flow chart diagram of sub-steps of a similar pinyin retrieval provided by an embodiment of the application;
FIG. 3 is a schematic flow chart diagram of a substep of determining target weight values provided by an embodiment of the present application;
FIG. 4 is a schematic diagram of a weighting strategy provided by an embodiment of the present application;
FIG. 5 is a schematic flow chart diagram of a text filtering sub-step provided in an embodiment of the present application;
fig. 6 is a schematic block diagram of a structure of a computer device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The flow diagrams depicted in the figures are merely illustrative and do not necessarily include all of the elements and operations/steps, nor do they necessarily have to be performed in the order depicted. For example, some operations/steps may be decomposed, combined or partially combined, so that the actual execution order may be changed according to the actual situation.
It is to be understood that the terminology used in the description of the present application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification of the present application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should also be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.
The embodiment of the application provides a text retrieval method, computer equipment and a storage medium. The text retrieval method can be applied to a server or a terminal, candidate texts are obtained by performing similar pinyin retrieval on the texts to be retrieved, the candidate texts can be rapidly screened out from massive texts, the problem of pronunciation recognition errors is solved, and the efficiency and the accuracy of text retrieval are improved.
The server may be an independent server or a server cluster. The terminal can be an electronic device such as a smart phone, a tablet computer, a notebook computer, a desktop computer, a personal digital assistant and a wearable device.
Some embodiments of the present application will be described in detail below with reference to the accompanying drawings. The embodiments described below and the features of the embodiments can be combined with each other without conflict.
As shown in fig. 1, the text retrieval method includes steps S10 to S30.
And S10, acquiring a text to be retrieved.
It should be noted that, in the embodiment of the present application, the text retrieval method may be applied in a human-computer interaction scenario, and may receive a voice signal input by a user on a human-computer interaction platform, perform voice recognition on the voice signal, and obtain a text to be retrieved; then, performing similar pinyin retrieval on the text to be retrieved to obtain at least one candidate text, and determining a target weight value corresponding to each candidate text; and finally, based on a preset text screening strategy, performing text screening according to the target weight value corresponding to each candidate text to obtain a target text, and displaying the target text on the human-computer interaction platform.
Illustratively, a voice signal input by a user on the man-machine interaction platform can be collected through the voice collecting device, and the voice signal is subjected to voice recognition to obtain a text to be retrieved. The voice acquisition device can comprise an electronic device for acquiring voice, such as a recorder, a recording pen, a microphone and the like.
In some embodiments, before performing speech recognition on the speech signal, noise reduction processing may be performed on the speech signal, and the speech signal after the noise reduction processing is determined as the speech signal to be recognized.
For example, the noise reduction processing of the voice signal may be implemented according to a voice noise reduction algorithm. The voice denoising algorithm may include, but is not limited to, a spectral subtraction algorithm, a wiener filtering algorithm, a minimum averaging error algorithm, and a wavelet transform algorithm. By carrying out noise reduction processing on the voice signals, the noise information in the voice signals is removed, and the accuracy of subsequent voice recognition can be improved.
In some embodiments, before performing speech recognition on the speech signal, useful signal extraction may also be performed on the speech signal, and the extracted useful signal is determined as the speech signal to be recognized.
In the embodiment of the present application, since the speech signal may include a useless signal, in order to improve the accuracy of the subsequent speech recognition, the useful signal in the speech signal needs to be extracted.
For example, a speech signal to be recognized may be obtained by performing useful signal extraction on the speech signal based on a speech activity endpoint detection model. The specific useful signal extraction process is not limited herein. It should be noted that in the Voice signal processing, voice Activity endpoint Detection (VAD) is used to detect whether there is Voice, so as to separate the Voice segment and the non-Voice segment in the signal. VADs can be used for echo cancellation, noise suppression, speaker recognition, speech recognition, and the like.
For example, the text to be detected can be obtained by performing speech recognition on the speech signal to be recognized through a speech recognition algorithm. The speech recognition algorithm may include, but is not limited to, convolutional neural networks, deep learning neural networks, recurrent neural networks, and hidden markov models, among others. For example, the text to be detected can be obtained by performing speech recognition on the speech signal to be recognized through a convolutional neural network. The specific process of speech recognition is not limited herein.
And S20, performing similar pinyin retrieval on the text to be retrieved to obtain at least one candidate text, and determining a target weight value corresponding to each candidate text.
It should be noted that after the text to be detected is obtained, similar pinyin retrieval needs to be performed on the text to be retrieved to obtain at least one candidate text. It can be understood that through similar pinyin retrieval on the text to be detected, the problem of pronunciation recognition error caused by the influence of the accent of a speaker, environmental noise, homophones and the like can be solved, candidate texts can be quickly screened out from massive texts, and the efficiency and the accuracy of text retrieval are improved.
By carrying out similar pinyin retrieval on the text to be retrieved to obtain at least one candidate text, the candidate text can be quickly screened out from massive texts, the interference of the data quantity and irrelevant data of subsequent text screening is greatly reduced, and the efficiency and the accuracy of text retrieval are improved; meanwhile, the problem of pronunciation recognition errors is solved, the compatibility of recognition errors caused by inaccurate pronunciation of the user is improved, and the accuracy of text retrieval is improved.
Referring to fig. 2, fig. 2 is a schematic flow chart of sub-steps of a similar pinyin search according to an embodiment of the present application, which may specifically include the following steps S201 to S203.
Step S201, determining pinyin information corresponding to the text to be retrieved, wherein the pinyin information comprises at least one initial pinyin.
For example, the text to be retrieved may include at least one word, a pinyin corresponding to each word may be obtained, and the pinyin corresponding to each word is determined as an initial pinyin, so as to obtain at least one initial pinyin.
And S202, based on a preset pinyin mapping library, performing similar pinyin conversion on each initial pinyin to obtain a target pinyin corresponding to each initial pinyin.
In the embodiment of the application, in order to solve the problems of accent, pronunciation error and the like of a user, a pinyin mapping library can be established in advance, and common flat-warped tongue, front and back nasal sounds and consonants and vowels with similar pronunciation are mapped and converted and stored in the pinyin mapping library.
Illustratively, the pinyin mapping library may include pinyin pairs of similar pronunciations, as follows:
“sh”->“s”,“ch”->“c”,“zh”->“z”,“f”->“h”,“l”->“n”,“r”->“l”,“b”->“p”,“an”->“ang”,“en”->“eng”,“in”->“ing”,“ai”->“ei”,“eng”->“ong”。
for example, similar pinyin conversion may be performed on each initial pinyin based on a preset pinyin mapping library to obtain a target pinyin corresponding to each initial pinyin. For example, the initial pinyin "feng" may be converted to "hong", the initial pinyin "qin" may be converted to "qing", the initial pinyin "shui" may be converted to "sui", and so on.
Similar pinyin conversion is carried out on each initial pinyin based on the pinyin mapping library to obtain the target pinyin corresponding to each initial pinyin, so that the problem of pronunciation identification errors is solved.
Step S203, based on a preset named entity index library, performing text query according to a target index corresponding to each target pinyin to obtain a candidate text corresponding to each target pinyin.
It should be noted that, in the embodiment of the present application, the named entity index library may include a correspondence between pinyin and indexes and a text corresponding to each index.
In the embodiment of the application, a named entity index library can be established in advance; the establishment process of the named entity index library may include: preprocessing a preset number of sample texts, and coding each preprocessed sample text to obtain an index of each sample text; determining the initial pinyin corresponding to each sample text, and performing similar pinyin conversion on the initial pinyin of each sample text based on a preset pinyin mapping library to obtain a target pinyin corresponding to each sample text; and storing the target pinyin corresponding to each sample text in association with the index of the sample text.
Preprocessing may include, among other things, removing spaces and punctuation marks. The index may be an identification code of the sample text for identifying the sample text.
For example, the specific process of performing similar pinyin conversion on the initial pinyin of each sample text may refer to the process of performing similar pinyin conversion on each initial pinyin in the foregoing embodiment, which is not described herein again specifically.
Exemplary, array structures in the named entity index repository are as follows:
key- > ID [ value1, value2, …, value n ], wherein "key" represents pinyin, ID represents index, and value represents sample text corresponding to index ID.
For example, based on the named entity index library, a text query may be performed according to a target index corresponding to each target pinyin to obtain a candidate text corresponding to each target pinyin. For example, for the target pinyin "zou", the index corresponding to the target pinyin "zou" may be queried in the named entity index library based on the correspondence between the pinyin and the index, for example, the index is 1; and then the mahjong index is the text corresponding to 1, and the candidate text corresponding to the target pinyin zou is determined.
By carrying out text query according to the target index corresponding to each target pinyin based on the named entity index library, the text query according to the target index can be realized, and the efficiency of querying the candidate text corresponding to each target pinyin is improved.
In the embodiment of the application, after the similar pinyin retrieval is performed on the text to be retrieved to obtain at least one candidate text, the target weight value corresponding to each candidate text can be determined.
In some embodiments, determining the target weight value corresponding to each candidate text may include: and counting the occurrence frequency of each candidate text, and correspondingly determining the target weight value of each candidate text according to the frequency of each candidate text.
For example, the frequency of the candidate text may be determined as the target weight value of the candidate text. For example, for the candidate text a, if the frequency of occurrence of the candidate text a in all the candidate texts is 3, it may be determined that the target weight value of the candidate text a is 3. For another example, for candidate text B, if the frequency of occurrence of candidate text B in all candidate texts is 5, it may be determined that the target weight value of candidate text B is 5.
By counting the occurrence frequency of each candidate text and correspondingly determining the target weight value of each candidate text according to the frequency of each candidate text, the candidate text with higher frequency can be endowed with a higher weight value.
It should be noted that, in the embodiment of the present application, in addition to determining the target weight value according to the frequency of the candidate text, the target weight value may also be determined by combining the frequency and the feedback operation of the user on the candidate text. In the embodiment of the present application, how to determine the target weight value by combining the frequency and the feedback operation of the user will be described in detail.
Referring to fig. 3, fig. 3 is a schematic flowchart of the substeps of determining the target weight value according to the embodiment of the present application, which may specifically include the following steps S204 to S206.
And S204, counting the occurrence frequency of each candidate text, and correspondingly determining a first weight value of each candidate text according to the frequency of each candidate text.
For example, the first weight value of each candidate text may be determined according to the detailed description of the above embodiment, and the specific process is not described herein again.
Step S205, obtaining user feedback information, and performing weighting processing on each candidate text according to the user feedback information to obtain a second weight value of each candidate text.
Illustratively, the user feedback information includes information such as a feedback operation of the user on the candidate text. It should be noted that, in the embodiment of the present application, after the candidate text is obtained, the candidate text may be displayed on the human-computer interaction platform, so that a user performs a feedback operation on the displayed candidate text.
The user feedback information may include, but is not limited to, whether to click, whether to ask a question, the number of times of asking a question, the similarity of asking a question, and the like. In the embodiment of the application, the feedback behavior of the user can reflect whether the candidate text is the result desired by the user, and the weight value of the candidate text is set according to the feedback operation.
In some embodiments, weighting each candidate text according to the user feedback information to obtain a second weight value of each candidate text may include: inputting user feedback information into a behavior clustering model for classification to obtain a target behavior category corresponding to each candidate text; and correspondingly determining a second weight value corresponding to each candidate text according to the target behavior category corresponding to each candidate text based on the corresponding relation between the preset behavior category and the weight value.
It should be noted that, in the embodiment of the present application, multiple rounds of interactive feedback operations of a user history may be collected in advance, and a line may be established according to the multiple rounds of interactive feedback operations of the userIs a clustering model. The input data set of the behavior clustering model includes user feedback information, such as whether to click, whether to ask, the number of times to ask, the similarity of asking, and so on. The input data set can be defined as I = [ I = [ ] 1 ,I 2 ,…,I m ]Defining the output behavior class as C = [ C = [ [ C ] 1 ,C 2 ,…,C n ]。
The establishing process of the behavior clustering model is as follows: s1, randomly selecting k data objects from a whole data set I as a clustering center; s2, respectively calculating each data object I i To the clustering center C i Distance d (I) i ,C i ) And to combine the data object I i Classifying into the closest class; s3, calculating the average value of all the objects in each class as a new clustering center; and S4, repeating the step S2 and the step S3 until the cluster center is not changed.
Illustratively, the behavior categories may include no click operation, click operation but no question, click operation with question but dissimilar question, click operation with question but similar question, and so on.
In the embodiment of the present application, the weight values corresponding to different behavior categories are P = [ P = 1 ,p 2 ,…,p n ]As shown in fig. 4, fig. 4 is a schematic diagram of a weighting strategy provided in an embodiment of the present application. In fig. 4, it is determined whether or not a click operation is performed on the obtained search result; if no click operation exists, the weight is reduced; if the click operation exists but the inquiry is not made, weighting is carried out; if the click operation has a question and the questions are dissimilar, weighting; if the click operation has a question but the questions are similar, the right is reduced.
An initial weight value may be set for each text in the search result, for example, the initial weight value is 0, but other numerical values may also be used. The magnitude of each weight reduction and weighting can be set according to actual conditions, and specific numerical values are not limited herein. In the embodiment of the application, the weight value of each behavior category may be determined according to a weighting policy, and the behavior categories and the corresponding weight values may be stored in an associated manner.
It can be understood that, the user performs the click operation to indicate that the corresponding text conforms to the user expectation, so that the weighting is performed; the user does not pursue, and the corresponding text is shown to be in accordance with the expectation of the user, so the weighting is carried out; the user pursues but the questions are not similar, indicating that the corresponding text is in line with the user's expectations, and therefore the weighting is performed.
For example, the second weight value corresponding to each candidate text may be correspondingly determined according to the target behavior category corresponding to each candidate text based on the corresponding relationship between the behavior categories and the weight values.
For example, when the target behavior category is a click operation but is not being asked, the corresponding second weight value may be 5. For another example, when the target behavior category is a question of a click operation and the questions are not similar, the corresponding second weight value may be 7.
By acquiring the user feedback information and performing weighting processing on each candidate text according to the user feedback information to obtain the second weight value of each candidate text, the influence of the feedback operation of the user on the search can be fully considered, and the user experience and the acceptance of the user on the retrieval result are further improved.
Step S206, determining a target weight value corresponding to each candidate text according to the first weight value and the second weight value.
For example, a first weight value corresponding to the candidate text may be added to a second weight value to obtain a target weight value corresponding to the candidate text. In addition, the first weight value and a preset first proportion can be multiplied, the second weight value and a preset second proportion can be multiplied, and the two obtained multiplication values are added to serve as a target weight value corresponding to the candidate text. The first ratio and the second ratio can be set according to actual conditions, and specific numerical values are not limited herein.
By determining the target weight value corresponding to each candidate text according to the first weight value and the second weight value, the frequency of the candidate texts and the feedback operation of the user can be combined for weighting, and the accuracy of subsequent text retrieval is improved.
And S30, based on a preset text screening strategy, performing text screening according to the target weight value corresponding to each candidate text to obtain a target text.
It should be noted that, in order to further improve the accuracy of text retrieval, after the candidate texts are obtained, text screening is performed according to a target weight value corresponding to each candidate text based on a preset text screening policy to obtain a target text. The text screening mainly includes hyperplane distance screening and similarity screening, and in the embodiment of the present application, how to perform the hyperplane distance screening and the similarity screening will be described in detail.
Referring to fig. 5, fig. 5 is a schematic flowchart of a text filtering sub-step provided in an embodiment of the present application, which may specifically include the following steps S301 to S303.
Step S301, a first candidate text set is determined according to the candidate texts with the target weight values larger than a preset weight threshold value.
For example, candidate texts with target weight values larger than a preset weight threshold value can be selected to generate a first candidate text set. The first set of candidate text may include at least one candidate text. The preset weight threshold is set according to actual conditions, and specific values are not limited herein.
And S302, carrying out hyperplane distance screening on the first candidate text set to obtain a second candidate text set.
In the embodiment of the present application, in order to improve accuracy of text retrieval, a first candidate text set may be screened according to a hyperplane in a Support Vector Machine (SVM) model to obtain a second candidate text set. It should be noted that the hyperplane in the SVM model is used for performing the second classification.
In the SVM model, the hyperplane equation is an n-element linear equation:
ω 1 x 1 +ω 2 x 2 +…+ω n x n +b=0
the corresponding vector form is:
according to a distance calculation formula of a three-dimensional space, the distance from a point to a hyperplane can be obtained:
wherein w represents the weight of each dimension of the high-dimensional vector; w represents a set of weights, i.e. [ W ] 1 ,w 2 ,…,w n ](ii) a b represents an offset.
In the embodiment of the application, the SVM model can be iteratively trained to be converged by using positive sample data and negative sample data in advance, so that the trained SVM model is obtained. The specific iterative training process is not limited herein. And the method is used for adjusting and optimizing relevant parameters of the hyperplane equation in the SVM model by training the SVM model.
In some embodiments, performing hyperplane distance filtering on the first candidate text set to obtain a second candidate text set may include: extracting the characteristics of each candidate text in the first candidate text set to obtain the characteristic vector of each candidate text in the first candidate text set; inputting the feature vector of each candidate text in the first candidate text set into a hyperplane distance model for hyperplane distance calculation to obtain the hyperplane distance of each candidate text in the first candidate text set; and determining the candidate texts with the hyperplane distance larger than a preset distance threshold as a second candidate text set.
For example, when feature extraction is performed on each candidate text in the first candidate text set, feature extraction may be performed on the candidate text by using a feature extraction algorithm. The feature extraction algorithm may include, but is not limited to, a forward maximum matching algorithm, a reverse maximum matching algorithm, a bidirectional maximum matching algorithm, and the like. For example, feature extraction is performed on the candidate text by using a forward maximum matching algorithm to obtain a forward maximum matching feature vector. For another example, a reverse maximum matching algorithm is adopted to perform feature extraction on the candidate text, so as to obtain a generated reverse maximum matching feature vector.
In addition, feature extraction can be performed on the candidate texts through a feature extraction model, so that a feature vector of each candidate text is obtained. The feature extraction module may include, but is not limited to, an n-gram model, a BERT (Bidirectional Encoder retrieval from Transformer) model, a word2vec model, a glove model, an ELMo model, and the like. For example, the candidate text can be input into an n-gram model for feature extraction to obtain an n-gram feature vector.
For example, the feature vector of each candidate text in the first candidate text set may be input into a hyperplane distance model for hyperplane distance calculation, so as to obtain a hyperplane distance of each candidate text in the first candidate text set. The specific calculation process is not limited herein.
Note that the hyperplane distance refers to a distance from a feature vector of the candidate text to the hyperplane.
For example, after calculating the hyperplane distance of each candidate text in the first candidate text set, candidate texts with hyperplane distances greater than a preset distance threshold may be determined as the second candidate text set. The preset distance threshold may be set according to actual conditions, and the specific numerical value is not limited herein.
By performing hyperplane distance screening on the first candidate text set, coarse screening of candidate texts in the first candidate text set can be realized.
Step S303, carrying out similarity screening on the second candidate text set according to the text to be retrieved to obtain the target text.
In the embodiment of the application, after the first candidate text set is subjected to hyperplane distance screening to obtain the second candidate text set, the second candidate text set can also be subjected to similarity screening according to the feature vector and semantic information of the text, so that the accuracy of text retrieval is further improved.
In some embodiments, the performing similarity screening on the second candidate text set according to the text to be retrieved to obtain the target text may include: calculating the text similarity between each candidate text in the second candidate text set and the text to be retrieved; and determining the candidate texts with the text similarity larger than a preset similarity threshold as target texts.
The calculating the text similarity between each candidate text in the second candidate text set and the text to be retrieved may include: acquiring first feature information of each candidate text in a second candidate text set, wherein the first feature information comprises feature vectors and semantic information; acquiring second characteristic information of the text to be retrieved, wherein the second characteristic information comprises a characteristic vector and semantic information; and inputting the first characteristic information of each candidate text in the second candidate text set and the second characteristic information of the text to be retrieved into a similarity calculation model for similarity calculation to obtain the text similarity between each candidate text in the second candidate text set and the text to be retrieved.
For example, since feature extraction is already performed on each candidate text in the first candidate text set, a feature vector of each candidate text in the second candidate text set may be directly obtained. In the semantic information extraction, the semantic information of each candidate text in the second candidate text set may be extracted by a semantic encoder. The specific semantic information extraction process is not limited herein. It should be noted that the semantic encoder may include an input gate, a forgetting gate, and an output gate, and is used to implement a memory unit, so as to effectively extract the above semantic information.
Illustratively, when the second feature information of the text to be retrieved is obtained, feature extraction is performed on the text to be retrieved through a feature extraction algorithm or a feature extraction model. For a specific feature extraction process, reference may be made to the process of extracting features from each candidate text in the first candidate text set in the foregoing embodiment, which is not described herein again. When extracting semantic information, the semantic information of the text to be retrieved can be extracted through a semantic encoder.
In some embodiments, before inputting the first feature information of each candidate text in the second candidate text set and the second feature information of the text to be retrieved into the similarity calculation model for similarity calculation, the feature vector of each candidate text in the second candidate text set and the feature vector of the text to be retrieved may be spliced to obtain a spliced feature vector, and the semantic information of each candidate text in the second candidate text set and the semantic information of the text to be retrieved are spliced to obtain spliced semantic information; and then inputting the splicing feature vector and the splicing semantic information into a similarity calculation model for similarity calculation.
For example, the similarity calculation model may include, but is not limited to, a neural network model, a convolutional neural network, a constrained boltzmann machine, a recurrent neural network, or the like. The similarity calculation model may include at least one full-connected layer, a Self-attention mechanism (Self-attention) layer, a normalization layer, and an output layer, and is used to calculate text similarity between the candidate text and the text to be retrieved. It can be understood that the similarity calculation model outputs a prediction probability of similarity between the candidate text and the text to be retrieved, which can be used as the text similarity.
For example, the splicing feature vector and the splicing semantic information are input into a similarity calculation model, the splicing feature vector is subjected to full-connection processing through a first full-connection layer, and then the vector output by the first full-connection layer is subjected to full-connection processing through a second full-connection layer; meanwhile, processing the spliced semantic information through a Self-attention layer, and performing full-connection processing on information output by the Self-attention layer through a third full-connection layer; and then, splicing the vector output by the second full connection layer with the vector output by the third full connection layer, inputting the vector into a fourth full connection layer for full connection processing, inputting the result output by the fourth full connection layer into a normalization layer for normalization processing, and outputting the result by an output layer to obtain the similarity between the candidate text and the text to be retrieved.
In the embodiment of the application, the text similarity between each candidate text in the second candidate text set and the text to be retrieved can be calculated through the similarity calculation model, and the text similarity can also be directly calculated through a similarity calculation method. Illustratively, the similarity algorithm may include, but is not limited to, algorithms such as euclidean distance, cosine similarity, jaccard similarity coefficient, and Pearson correlation coefficient. For example, a cosine similarity algorithm may be adopted to calculate a cosine value between the first feature information of each candidate text in the second candidate text set and the second feature information of the text to be retrieved, and the obtained cosine value is used as the similarity. The specific calculation process is not limited herein.
For example, after the text similarity between each candidate text in the second candidate text set and the text to be retrieved is calculated, the candidate text with the text similarity greater than the preset similarity threshold may be determined as the target text. The preset similarity threshold may be set according to actual conditions, and the specific numerical value is not limited herein. For example, if the text similarity of the candidate text a is greater than a preset similarity threshold, the candidate text a may be determined as the target text. If the text similarity of the candidate text B is greater than a preset similarity threshold, the candidate text B can be determined as the target text.
Illustratively, after the target text is obtained, the target text can be output on the human-computer interaction platform. For example, the target text is displayed on the corresponding display interface.
And performing similarity screening on the second candidate text set according to the text to be retrieved, so that further fine screening can be performed on the basis of coarse screening, and the obtained target text is more accurate.
The text retrieval method provided by the embodiment obtains at least one candidate text by performing similar pinyin retrieval on the text to be retrieved, can realize rapid screening of the candidate text from a large amount of texts, greatly reduces the subsequent interference of the data amount of text screening and irrelevant data, and improves the efficiency and accuracy of text retrieval; meanwhile, the problem of pronunciation recognition errors is solved, the compatibility of recognition errors caused by inaccurate pronunciation of a user is improved, and the accuracy of text retrieval is further improved; similar pinyin conversion is carried out on each initial pinyin based on a pinyin mapping library to obtain a target pinyin corresponding to each initial pinyin, so that the problem of wrong pronunciation identification is solved; by carrying out text query according to the target index corresponding to each target pinyin based on the named entity index library, the text query according to the target index can be realized, and the efficiency of querying the candidate text corresponding to each target pinyin is improved; the user feedback information is obtained, and the weighting processing is carried out on each candidate text according to the user feedback information to obtain the second weight value of each candidate text, so that the influence of the feedback operation of the user on the search can be fully considered, and the user experience and the acceptance of the user on the retrieval result are further improved; by determining the target weight value corresponding to each candidate text according to the first weight value and the second weight value, the weighting can be realized by combining the frequency of the candidate texts and the feedback operation of the user, and the accuracy of subsequent text retrieval is improved; by performing hyperplane distance screening on the first candidate text set, coarse screening of candidate texts in the first candidate text set can be realized; and performing similarity screening on the second candidate text set according to the text to be retrieved, so that further fine screening can be performed on the basis of coarse screening, and the obtained target text is more accurate.
Referring to fig. 6, fig. 6 is a schematic block diagram of a computer device according to an embodiment of the present disclosure. The computer device may be a server or a terminal.
Referring to fig. 6, the computer device includes a processor and a memory connected by a system bus, wherein the memory may include a storage medium and an internal memory. The storage medium may be a volatile storage medium or a nonvolatile storage medium.
The processor is used for providing calculation and control capability and supporting the operation of the whole computer equipment.
The internal memory provides an environment for the execution of a computer program on a non-volatile storage medium, which when executed by the processor, causes the processor to perform any of the text retrieval methods.
It should be understood that the Processor may be a Central Processing Unit (CPU), and the Processor may be other general purpose processors, digital Signal Processors (DSPs), application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, etc. Wherein a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
Wherein, in one embodiment, the processor is configured to execute a computer program stored in the memory to implement the steps of:
acquiring a text to be retrieved; performing similar pinyin retrieval on the text to be retrieved to obtain at least one candidate text, and determining a target weight value corresponding to each candidate text; and based on a preset text screening strategy, performing text screening according to a target weight value corresponding to each candidate text to obtain a target text.
In one embodiment, the processor is configured to, when implementing similar pinyin retrieval on the text to be retrieved to obtain at least one candidate text, implement:
determining pinyin information corresponding to the text to be retrieved, wherein the pinyin information comprises at least one initial pinyin; based on a preset pinyin mapping library, performing similar pinyin conversion on each initial pinyin to obtain a target pinyin corresponding to each initial pinyin; and based on a preset named entity index library, performing text query according to the target index corresponding to each target pinyin to obtain a candidate text corresponding to each target pinyin.
In one embodiment, the processor, in implementing determining the target weight value corresponding to each of the candidate texts, is configured to implement:
counting the occurrence frequency of each candidate text, and correspondingly determining the target weight value of each candidate text according to the frequency of each candidate text; or counting the occurrence frequency of each candidate text, and correspondingly determining a first weight value of each candidate text according to the frequency of each candidate text; acquiring user feedback information, and performing weighting processing on each candidate text according to the user feedback information to obtain a second weight value of each candidate text; and determining a target weight value corresponding to each candidate text according to the first weight value and the second weight value.
In one embodiment, when the processor performs weighting processing on each candidate text according to the user feedback information to obtain a second weight value of each candidate text, the processor is configured to perform:
inputting the user feedback information into a behavior clustering model for classification to obtain a target behavior category corresponding to each candidate text; and correspondingly determining a second weight value corresponding to each candidate text according to the target behavior category corresponding to each candidate text based on the corresponding relation between the preset behavior category and the weight value.
In one embodiment, when implementing a preset text screening policy, and performing text screening according to a target weight value corresponding to each candidate text to obtain a target text, the processor is configured to implement:
determining a first candidate text set according to candidate texts with target weight values larger than a preset weight threshold; performing hyperplane distance screening on the first candidate text set to obtain a second candidate text set; and performing similarity screening on the second candidate text set according to the text to be retrieved to obtain the target text.
In one embodiment, the processor, when performing hyperplane distance filtering on the first candidate text set to obtain a second candidate text set, is configured to perform:
extracting features of each candidate text in the first candidate text set to obtain a feature vector of each candidate text in the first candidate text set; inputting the feature vector of each candidate text in the first candidate text set into a hyperplane distance model for hyperplane distance calculation to obtain the hyperplane distance of each candidate text in the first candidate text set; and determining the candidate texts with the hyperplane distance larger than a preset distance threshold as the second candidate text set.
In one embodiment, when the processor performs similarity screening on the second candidate text set according to the text to be retrieved to obtain the target text, the processor is configured to:
calculating the text similarity between each candidate text in the second candidate text set and the text to be retrieved; and determining the candidate text with the text similarity larger than a preset similarity threshold as the target text.
In one embodiment, the processor, in performing calculating the text similarity between each candidate text in the second candidate text set and the text to be retrieved, is configured to perform:
acquiring first feature information of each candidate text in the second candidate text set, wherein the first feature information comprises feature vectors and semantic information; acquiring second characteristic information of the text to be retrieved, wherein the second characteristic information comprises a characteristic vector and semantic information; and inputting the first characteristic information of each candidate text in the second candidate text set and the second characteristic information of the text to be retrieved into a similarity calculation model for similarity calculation to obtain the text similarity between each candidate text in the second candidate text set and the text to be retrieved.
The embodiment of the application further provides a computer-readable storage medium, wherein a computer program is stored in the computer-readable storage medium, the computer program comprises program instructions, and the processor executes the program instructions to implement any text retrieval method provided by the embodiment of the application.
For example, the program is loaded by a processor and may perform the following steps:
acquiring a text to be retrieved; performing similar pinyin retrieval on the text to be retrieved to obtain at least one candidate text, and determining a target weight value corresponding to each candidate text; and based on a preset text screening strategy, performing text screening according to a target weight value corresponding to each candidate text to obtain a target text.
The computer-readable storage medium may be an internal storage unit of the computer device described in the foregoing embodiment, for example, a hard disk or a memory of the computer device. The computer readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital Card (SD Card), a Flash memory Card (Flash Card), and the like provided on the computer device.
Further, the computer-readable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the blockchain node, and the like.
The block chain referred by the application is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
While the invention has been described with reference to specific embodiments, the scope of the invention is not limited thereto, and those skilled in the art can easily conceive various equivalent modifications or substitutions within the technical scope of the invention. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.
Claims (10)
1. A text retrieval method, comprising:
acquiring a text to be retrieved;
performing similar pinyin retrieval on the text to be retrieved to obtain at least one candidate text, and determining a target weight value corresponding to each candidate text;
and based on a preset text screening strategy, performing text screening according to a target weight value corresponding to each candidate text to obtain a target text.
2. The text retrieval method of claim 1, wherein the performing similar pinyin retrieval on the text to be retrieved to obtain at least one candidate text comprises:
determining pinyin information corresponding to the text to be retrieved, wherein the pinyin information comprises at least one initial pinyin;
based on a preset pinyin mapping library, performing similar pinyin conversion on each initial pinyin to obtain a target pinyin corresponding to each initial pinyin;
and based on a preset named entity index library, performing text query according to the target index corresponding to each target pinyin to obtain a candidate text corresponding to each target pinyin.
3. The method of claim 1, wherein the determining the target weight value corresponding to each candidate text comprises:
counting the occurrence frequency of each candidate text, and correspondingly determining the target weight value of each candidate text according to the frequency of each candidate text; or
Counting the occurrence frequency of each candidate text, and correspondingly determining a first weight value of each candidate text according to the frequency of each candidate text;
acquiring user feedback information, and performing weighting processing on each candidate text according to the user feedback information to obtain a second weight value of each candidate text;
and determining a target weight value corresponding to each candidate text according to the first weight value and the second weight value.
4. The text retrieval method of claim 3, wherein the weighting each candidate text according to the user feedback information to obtain a second weight value of each candidate text comprises:
inputting the user feedback information into a behavior clustering model for classification to obtain a target behavior category corresponding to each candidate text;
and correspondingly determining a second weight value corresponding to each candidate text according to the target behavior category corresponding to each candidate text based on the corresponding relation between the preset behavior category and the weight value.
5. The method of claim 1, wherein the text screening is performed according to a target weight value corresponding to each candidate text based on a preset text screening policy to obtain a target text, and the method includes:
determining a first candidate text set according to candidate texts with target weight values larger than a preset weight threshold;
performing hyperplane distance screening on the first candidate text set to obtain a second candidate text set;
and performing similarity screening on the second candidate text set according to the text to be retrieved to obtain the target text.
6. The method of claim 5, wherein the performing hyperplane distance filtering on the first set of candidate texts to obtain a second set of candidate texts comprises:
performing feature extraction on each candidate text in the first candidate text set to obtain a feature vector of each candidate text in the first candidate text set;
inputting the feature vector of each candidate text in the first candidate text set into a hyperplane distance model for hyperplane distance calculation to obtain the hyperplane distance of each candidate text in the first candidate text set;
and determining the candidate texts with the hyperplane distance larger than a preset distance threshold as the second candidate text set.
7. The text retrieval method of claim 5, wherein the performing similarity screening on the second candidate text set according to the text to be retrieved to obtain the target text comprises:
calculating the text similarity between each candidate text in the second candidate text set and the text to be retrieved;
and determining the candidate text with the text similarity larger than a preset similarity threshold as the target text.
8. The text retrieval method of claim 7, wherein the calculating the text similarity between each candidate text in the second candidate text set and the text to be retrieved comprises:
acquiring first feature information of each candidate text in the second candidate text set, wherein the first feature information comprises feature vectors and semantic information;
acquiring second characteristic information of the text to be retrieved, wherein the second characteristic information comprises a characteristic vector and semantic information;
and inputting the first characteristic information of each candidate text in the second candidate text set and the second characteristic information of the text to be retrieved into a similarity calculation model for similarity calculation to obtain the text similarity between each candidate text in the second candidate text set and the text to be retrieved.
9. A computer device, wherein the computer device comprises a memory and a processor;
the memory for storing a computer program;
the processor for executing the computer program and implementing the text retrieval method of any one of claims 1 to 8 when executing the computer program.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, causes the processor to implement the text retrieval method according to any one of claims 1 to 8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211054998.4A CN115455142A (en) | 2022-08-31 | 2022-08-31 | Text retrieval method, computer device and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211054998.4A CN115455142A (en) | 2022-08-31 | 2022-08-31 | Text retrieval method, computer device and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115455142A true CN115455142A (en) | 2022-12-09 |
Family
ID=84300495
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211054998.4A Pending CN115455142A (en) | 2022-08-31 | 2022-08-31 | Text retrieval method, computer device and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115455142A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116226426A (en) * | 2023-05-09 | 2023-06-06 | 深圳开鸿数字产业发展有限公司 | Three-dimensional model retrieval method based on shape, computer device and storage medium |
-
2022
- 2022-08-31 CN CN202211054998.4A patent/CN115455142A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116226426A (en) * | 2023-05-09 | 2023-06-06 | 深圳开鸿数字产业发展有限公司 | Three-dimensional model retrieval method based on shape, computer device and storage medium |
CN116226426B (en) * | 2023-05-09 | 2023-07-11 | 深圳开鸿数字产业发展有限公司 | Three-dimensional model retrieval method based on shape, computer device and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021208287A1 (en) | Voice activity detection method and apparatus for emotion recognition, electronic device, and storage medium | |
CN110457432B (en) | Interview scoring method, interview scoring device, interview scoring equipment and interview scoring storage medium | |
RU2678716C1 (en) | Use of autoencoders for learning text classifiers in natural language | |
CN108846077B (en) | Semantic matching method, device, medium and electronic equipment for question and answer text | |
CN107729313B (en) | Deep neural network-based polyphone pronunciation distinguishing method and device | |
CN107229627B (en) | Text processing method and device and computing equipment | |
CN108227564B (en) | Information processing method, terminal and computer readable medium | |
CN105702251B (en) | Reinforce the speech-emotion recognition method of audio bag of words based on Top-k | |
CN108536807B (en) | Information processing method and device | |
Dua et al. | Discriminative training using noise robust integrated features and refined HMM modeling | |
CN109947971B (en) | Image retrieval method, image retrieval device, electronic equipment and storage medium | |
CN113569578B (en) | User intention recognition method and device and computer equipment | |
CN111126084B (en) | Data processing method, device, electronic equipment and storage medium | |
CN116932735A (en) | Text comparison method, device, medium and equipment | |
CN116127001A (en) | Sensitive word detection method, device, computer equipment and storage medium | |
CN115455142A (en) | Text retrieval method, computer device and storage medium | |
CN111241106A (en) | Approximate data processing method, device, medium and electronic equipment | |
CN118152570A (en) | Intelligent text classification method | |
WO2024093578A1 (en) | Voice recognition method and apparatus, and electronic device, storage medium and computer program product | |
CN112100360B (en) | Dialogue response method, device and system based on vector retrieval | |
CN117235137A (en) | Professional information query method and device based on vector database | |
WO2023116572A1 (en) | Word or sentence generation method and related device | |
CN116955559A (en) | Question-answer matching method and device, electronic equipment and storage medium | |
CN111506764B (en) | Audio data screening method, computer device and storage medium | |
CN112528646A (en) | Word vector generation method, terminal device and computer-readable storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |