Disclosure of Invention
In order to solve the problems in the prior art, embodiments of the present invention provide a query method, an apparatus, a computer device, and a storage medium based on vertical search, which support intention identification of multi-field search of a search engine in the vertical field, and further improve accuracy of search results and search experience.
In order to solve one or more technical problems, the invention adopts the technical scheme that:
in a first aspect, a vertical search based query method is provided, which includes the following steps:
performing regular matching on received initial query sentences, acquiring first sentences which meet matching rules in the initial query sentences, and determining first attribute categories corresponding to the first sentences;
preprocessing a second sentence which does not meet the matching rule in the initial query sentence to obtain a keyword corresponding to the second sentence, and classifying each keyword by using a pre-trained classification model to obtain a second attribute category of each keyword;
generating a target query statement according to the first statement, the first attribute category, the keyword and the second attribute category;
and calling a preset search engine interface, and matching a query result according to the target query statement.
In some embodiments, the preprocessing a second statement that does not satisfy a matching rule in the initial query statement, and the obtaining a keyword corresponding to the second statement includes:
performing word segmentation processing on a second sentence which does not meet the matching rule in the initial query sentence to obtain a word segmentation result;
and determining the keywords of the second sentence according to the word segmentation result and a preset rule.
In some embodiments, before performing the word segmentation processing on the second sentence which does not satisfy the matching rule in the initial query sentence, the method includes:
and denoising the second sentence to remove the noise characters in the second sentence.
In some embodiments, the generating a target query statement from the first statement, the first attribute category, the keyword, and the second attribute category comprises:
generating data pairs respectively based on the first sentence and the corresponding first attribute category, the keyword and the corresponding second attribute category;
and generating a target query statement according to the data pair and a preset index rule of a search engine.
In some embodiments, the method further comprises a training process of the classification model, comprising:
acquiring training data according to a service scene;
and training a preset classifier by using the training data to obtain a trained classification model.
In some embodiments, the pre-set classifier comprises a logistic regression classifier or a support vector machine classifier.
In a second aspect, a vertical search based query device is provided, the device comprising:
the matching module is used for performing regular matching on the received initial query statement, acquiring a first statement which meets a matching rule in the initial query statement, and determining a first attribute category corresponding to the first statement;
the acquisition module is used for preprocessing a second statement which does not meet the matching rule in the initial query statement to acquire a keyword corresponding to the second statement;
the classification module is used for classifying each keyword by using a pre-trained classification model to acquire a second attribute category of each keyword;
a generation module, configured to generate a target query statement according to the first statement, the first attribute category, the keyword, and the second attribute category;
and the query module is used for calling a preset search engine interface and matching a query result according to the target query statement.
In a third aspect, a computer device is provided, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and when the processor executes the computer program, the following steps are implemented:
performing regular matching on received initial query sentences, acquiring first sentences which meet matching rules in the initial query sentences, and determining first attribute categories corresponding to the first sentences;
preprocessing a second sentence which does not meet the matching rule in the initial query sentence to obtain a keyword corresponding to the second sentence, and classifying each keyword by using a pre-trained classification model to obtain a second attribute category of each keyword;
generating a target query statement according to the first statement, the first attribute category, the keyword and the second attribute category;
and calling a preset search engine interface, and matching a query result according to the target query statement.
In a fourth aspect, there is provided a computer readable storage medium having a computer program stored thereon, which when executed by a processor, performs the steps of:
performing regular matching on received initial query sentences, acquiring first sentences which meet matching rules in the initial query sentences, and determining first attribute categories corresponding to the first sentences;
preprocessing a second sentence which does not meet the matching rule in the initial query sentence to obtain a keyword corresponding to the second sentence, and classifying each keyword by using a pre-trained classification model to obtain a second attribute category of each keyword;
generating a target query statement according to the first statement, the first attribute category, the keyword and the second attribute category;
and calling a preset search engine interface, and matching a query result according to the target query statement.
The technical scheme provided by the embodiment of the invention has the following beneficial effects:
the query method, device, computer equipment and storage medium based on vertical search provided by the embodiments of the present invention perform regular matching on a received initial query sentence to obtain a first sentence satisfying a matching rule in the initial query sentence, determine a first attribute type corresponding to the first sentence, pre-process a second sentence not satisfying the matching rule in the initial query sentence to obtain a keyword corresponding to the second sentence, classify each keyword by using a pre-trained classification model to obtain a second attribute type of each keyword, generate a target query sentence according to the first sentence, the first attribute type, the keyword and the second attribute type, call a preset search engine interface, match a query result according to the target query sentence, and implement search intention identification of the query sentence for a vertical search engine, the efficiency of inquiry is improved and user experience is promoted.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example one
As described in the background, in a specific field, such as the vertical search engine field, it is generally required to construct data of the vertical field, such as structured business data, into a vertical search engine to better provide query search service of the business data, wherein text-type data is not lacked. By constructing the search engine, the query function of the service data can be provided for the service by means of the efficient indexing technology of the search engine. Often, a good vertical search engine not only needs to provide a data query function, but also supports a data query retrieval capability of providing multiple dimensions by one input. This requires that the vertical search engine be able to intelligently identify the query terms entered by the user and the data attribute fields to which they pertain, so as to provide support for further optimizing the search query statement.
In order to solve the above problems, the embodiment of the present invention creatively provides a query method based on vertical search, the method includes a text recognition method for a vertical search engine, a multi-category text classification model of field attributes available for a user to search is trained based on service scene data after structured and unstructured cleaning in the vertical field, and short text attribute recognition processing is performed on input keywords of the user, so as to provide the search engine to search for different fields. Taking an enterprise information search engine as an example, the information is based on enterprises, such as text information of enterprise names, legal names and the like, character string information of registration numbers, uniform credit codes and the like, numerical information of registration capital and the like, and some other information. In order to inquire a certain enterprise, the vertical search engine supports text information search of enterprise names or legal information and also supports accurate matching of character strings of registration numbers and uniform credit codes.
Fig. 1 is an architecture diagram illustrating a search intention recognition system according to an exemplary embodiment, and referring to fig. 1, the system is composed of at least a memory, a system bus, a processor, and a network, wherein the memory may be composed of a plurality of storage media RAM, and specific properties of the memory are not limited herein.
Specifically, the above scheme can be realized by the following steps:
step one, constructing a multi-class text classification model for a vertical search engine based on service scene data, specifically, in the embodiment of the present invention, the step includes the following processes:
(1) business scenario data preparation and extraction
Specifically, firstly, the service data of the vertical search engine to be constructed is analyzed in combination with the service scene requirements of the vertical search engine, data extraction is performed from a related database to obtain structured data, preparation is made for establishing index data of the vertical search engine, and training data is provided for model training.
(2) Determining search dimension targets
The search fields of a vertical search engine, i.e., the fields that are desired to be provided to a user search query when creating a vertical search engine, are determined. For example, for a set of search engine systems for querying property information, a query function for multiple fields such as property source cell name information, property source agent name information, etc. needs to be provided. And taking out all data under the fields according to the structured data obtained in the previous step, and setting the labels of the fields on the corresponding field data to form the labeled data of multiple categories.
(3) And (5) performing processes of feature extraction, model training and the like.
FIG. 2 is a flow diagram illustrating training of a text model according to an exemplary embodiment, and referring to FIG. 2, features of corresponding fields are selected according to the above data format. In specific implementation, word segmentation processing can be carried out on related text content fields according to characters or words, characteristics of the words are extracted, corresponding characteristic vectors are generated and serve as model training data, such as TF-IDF characteristics, and TF-IDF characteristic vectors are generated. The training mode of the classification model can select a classifier based on the scimit-leann machine learning library, such as a logistic regression classifier or a support vector machine classifier, and can also construct other classifiers.
Taking the example of constructing a classification model of enterprise names and human names, the classification model is mainly used for distinguishing and inquiring the company names and natural human legal persons or actual control persons or company high-management and the like, and is used for different inquiry logics (such as enterprise name search, legal person search, high-management search, combined search and the like). When constructing the classification model, the process is as follows:
firstly, extracting training data from a business database or a search engine, wherein the data are an enterprise name list and a person name list, and labels are 'enterprise name' and 'person name', and the forms are as follows:
enterprise name: [
'Beijing company of furniture',
'Shenzhen software Limited',
……,
'Hubei star mechanical plant'
]
Name of person: [
'any of a',
' Zhang,
' Chen,
……,
'Li'
]
Second, a data set is constructed, examples of which are as follows:
(' furniture of Beijing city ' company ', ' Enterprise name ')
('Shenzhen software Limited', 'Enterprise name')
……
('Hebei xi mechanical plant', 'Business name')
(' ren's word ', ' name ')
('Zhang xing', 'name')
……
('Lixing', 'name')
Then, word segmentation processing is carried out on the data set according to characters, and the processing result is as follows:
(' Beijing city ' house with limited company ', ' Enterprise name ')
('Shenzhen star software restricted company', 'Enterprise name')
……
('Zhang xing', 'name')
……
('Lixing', 'name')
Then, the data set is cut into a training set and a testing set by adopting a random sequence, for example, according to the following steps of 4: 1, adopting a scimit-learn machine learning library to carry out TF-IDF text vector extraction, generating a TF-IDF matrix of a training set, and selecting a classifier (such as naive Bayes, logistic regression, support vector machine and the like) to carry out model training to obtain a classifier;
and finally, testing and evaluating the prediction capability of the classifier, and performing model evaluation on the classifier by adopting the test set generated in the previous step so as to evaluate the practicability of the classifier.
And step two, identifying the search intention of the received initial query statement, and generating a target query statement.
Specifically, fig. 3 is a flowchart illustrating a process of identifying attribute categories of keywords according to an exemplary embodiment, and referring to fig. 3, in an embodiment of the present invention, first, a regular matching is performed on an input initial query sentence, a keyword search is supported for a result obtained by the regular matching, then, a text character purification process is performed on a result not obtained by the matching, taking a text form as an example, noise characters in the text are removed, for example, useless characters and punctuations are removed, a chinese segmentation process is performed, and a keyword list included in the initial query sentence is extracted.
And secondly, calling the classification model obtained in the above steps for each keyword to perform classification processing, and obtaining the attribute category of each keyword. The combination of the dimension(s) of the keyword is input as a judgment for judging the search intention of the user, the search word is continuously processed by error correction, association and the like according to the search intention, and a (keyword, attribute) data pair is output.
In the embodiment of the invention, the vertical search engine accepts character input in any form, so that the query input character string (namely the initial query sentence) needs to be preprocessed, different inputs are judged, and the input character string is subjected to attribute judgment and output.
Examples are as follows:
step 201: after receiving the initial query statement, performing regular matching on the input initial query statement, judging whether the input initial query statement conforms to code formats such as registration codes or enterprise credit codes, and the like, if so, marking the character string as a corresponding code attribute, and outputting the character string. Otherwise, proceed to step 202.
For example:
(1) inputting 91320000608950986L and outputting social uniform credit code "
(2) Inputting 'future technology' and entering the next step of processing.
Step 202: and inputting the preprocessed initial query sentence into a text classifier, and outputting a corresponding prediction attribute category.
For example:
inputting 'future technology' and the classifier output is 'enterprise name'
Inputting Zhang III, and outputting the name of the person by the classifier.
And step three, constructing a target query statement, calling a preset search engine interface, and matching a query result according to the target query statement.
Specifically, based on the keyword attribute pair obtained in the previous step, a query statement (i.e., a target query statement) adapted to the data index of the underlying search engine is constructed, and a unified interface of the search engine is called to obtain a query data result.
As a preferred implementation manner, in the embodiment of the present invention, a search intention identification system and an apparatus for searching for enterprise information may also be constructed in advance based on the search intention identification module, so as to support query input of multiple attributes during enterprise information search, and adapt to retrieval of different attribute information according to the attribute category returned by the search intention identification module.
Example two
FIG. 4 is a flow diagram illustrating a vertical search based query method according to an exemplary embodiment, and referring to FIG. 4, the method includes the steps of:
s1, performing regular matching on the received initial query statement, acquiring a first statement which meets the matching rule in the initial query statement, and determining a first attribute category corresponding to the first statement.
S2: preprocessing a second sentence which does not meet the matching rule in the initial query sentence to obtain a keyword corresponding to the second sentence, and classifying each keyword by using a pre-trained classification model to obtain a second attribute category of each keyword.
Specifically, in the embodiment of the present invention, any form of character input is accepted, that is, the initial query statement is not limited, so that the query input character string needs to be preprocessed, different inputs are determined, and the attribute of the input character string is determined and output.
Specifically, in order to improve the precision and the query efficiency of the search query, in the embodiment of the present invention, the search intention of the user is identified according to the received initial query sentence, and in the specific implementation, the keywords included in the second sentence may be extracted first, and then each keyword is classified by using a pre-trained classification model, so as to obtain the attribute category of each keyword.
S3: and generating a target query statement according to the first statement, the first attribute category, the keyword and the second attribute category.
Specifically, based on the first sentence, the first attribute type, the keyword and the corresponding second attribute type obtained in the above steps, a query sentence adapted to the data index of the underlying search engine is constructed.
S4: and calling a preset search engine interface, and matching a query result according to the target query statement.
As a preferred implementation manner, in an embodiment of the present invention, the preprocessing the second statement that does not satisfy the matching rule in the initial query statement, and acquiring the keyword corresponding to the second statement includes:
performing word segmentation processing on a second sentence which does not meet the matching rule in the initial query sentence to obtain a word segmentation result;
and determining the keywords of the second sentence according to the word segmentation result and a preset rule.
Specifically, in the embodiment of the present invention, a keyword matching rule is predefined, and the segmentation result is matched according to the keyword matching rule, so that the segmentation result meeting the requirement is obtained as the keyword.
As a preferred implementation manner, in the embodiment of the present invention, before performing word segmentation processing on a second statement that does not satisfy a matching rule in the initial query statement, the method includes:
and denoising the second sentence to remove the noise characters in the second sentence.
Specifically, in order to improve the query efficiency and the query accuracy, in the embodiment of the present invention, denoising processing may be further performed on a second sentence that does not satisfy the matching rule in the initial query sentence, so as to remove noise characters in the second sentence, for example, remove useless characters and punctuations.
As a preferred implementation manner, in an embodiment of the present invention, the generating a target query statement according to the first statement, the first attribute category, the keyword, and the second attribute category includes:
generating data pairs respectively based on the first sentence and the corresponding first attribute category, the keyword and the corresponding second attribute category;
and generating a target query statement according to the data pair and a preset index rule of a search engine.
As a preferred implementation manner, in an embodiment of the present invention, the method further includes a training process of the classification model, including:
acquiring training data according to a service scene;
and training a preset classifier by using the training data to obtain a trained classification model.
As a preferred implementation manner, in an embodiment of the present invention, the preset classifier includes a logistic regression classifier or a support vector machine classifier.
Fig. 5 is a schematic structural diagram illustrating a vertical search based query device according to an exemplary embodiment, where the device includes:
the matching module is used for performing regular matching on the received initial query statement, acquiring a first statement which meets a matching rule in the initial query statement, and determining a first attribute category corresponding to the first statement;
the acquisition module is used for preprocessing a second statement which does not meet the matching rule in the initial query statement to acquire a keyword corresponding to the second statement;
the classification module is used for classifying each keyword by using a pre-trained classification model to acquire a second attribute category of each keyword;
a generation module, configured to generate a target query statement according to the first statement, the first attribute category, the keyword, and the second attribute category;
and the query module is used for calling a preset search engine interface and matching a query result according to the target query statement.
As a preferred implementation manner, in an embodiment of the present invention, the obtaining module includes:
the word segmentation unit is used for performing word segmentation processing on a second sentence which does not meet the matching rule in the initial query sentence to obtain a word segmentation result;
and the matching unit is used for determining the keywords of the second sentence according to the word segmentation result and a preset rule.
As a preferred implementation manner, in an embodiment of the present invention, the apparatus further includes:
and the denoising module is used for denoising the second statement and removing the noise characters in the second statement.
As a preferred implementation manner, in an embodiment of the present invention, the generating module is specifically configured to:
generating data pairs respectively based on the first sentence and the corresponding first attribute category, the keyword and the corresponding second attribute category;
and generating a target query statement according to the data pair and a preset index rule of a search engine.
As a preferred implementation manner, in an embodiment of the present invention, the apparatus further includes:
the training module is used for acquiring training data according to the service scene; and training a preset classifier by using the training data to obtain a trained classification model.
As a preferred implementation manner, in an embodiment of the present invention, the preset classifier includes a logistic regression classifier or a support vector machine classifier.
Fig. 6 is a schematic diagram illustrating an internal configuration of a computer device according to an exemplary embodiment, which includes a processor, a memory, and a network interface connected through a system bus, as shown in fig. 6. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of optimization of an execution plan.
Those skilled in the art will appreciate that the configuration shown in fig. 6 is a block diagram of only a portion of the configuration associated with aspects of the present invention and is not intended to limit the computing devices to which aspects of the present invention may be applied, and that a particular computing device may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
As a preferred implementation manner, in an embodiment of the present invention, the computer device includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and the processor implements the following steps when executing the computer program:
performing regular matching on received initial query sentences, acquiring first sentences which meet matching rules in the initial query sentences, and determining first attribute categories corresponding to the first sentences;
preprocessing a second sentence which does not meet the matching rule in the initial query sentence, obtaining keywords corresponding to the second sentence, and classifying each keyword by using a pre-trained classification model to obtain a second attribute category of each keyword;
generating a target query statement according to the first statement, the first attribute category, the keyword and the second attribute category;
and calling a preset search engine interface, and matching a query result according to the target query statement.
As a preferred implementation manner, in the embodiment of the present invention, when the processor executes the computer program, the following steps are further implemented:
performing word segmentation processing on a second sentence which does not meet the matching rule in the initial query sentence to obtain a word segmentation result;
and determining the keywords of the second sentence according to the word segmentation result and a preset rule.
As a preferred implementation manner, in the embodiment of the present invention, when the processor executes the computer program, the following steps are further implemented:
and denoising the second sentence to remove the noise characters in the second sentence.
As a preferred implementation manner, in the embodiment of the present invention, when the processor executes the computer program, the following steps are further implemented:
generating data pairs respectively based on the first sentence and the corresponding first attribute category, the keyword and the corresponding second attribute category;
and generating a target query statement according to the data pair and a preset index rule of a search engine.
As a preferred implementation manner, in the embodiment of the present invention, when the processor executes the computer program, the following steps are further implemented:
acquiring training data according to a service scene;
and training a preset classifier by using the training data to obtain a trained classification model.
In an embodiment of the present invention, a computer-readable storage medium is further provided, on which a computer program is stored, and when the computer program is executed by a processor, the computer program implements the following steps:
performing regular matching on received initial query sentences, acquiring first sentences which meet matching rules in the initial query sentences, and determining first attribute categories corresponding to the first sentences;
preprocessing a second sentence which does not meet the matching rule in the initial query sentence to obtain a keyword corresponding to the second sentence, and classifying each keyword by using a pre-trained classification model to obtain a second attribute category of each keyword;
generating a target query statement according to the first statement, the first attribute category, the keyword and the second attribute category;
and calling a preset search engine interface, and matching a query result according to the target query statement.
As a preferred implementation manner, in the embodiment of the present invention, when executed by the processor, the computer program further implements the following steps:
performing word segmentation processing on a second sentence which does not meet the matching rule in the initial query sentence to obtain a word segmentation result;
and determining the keywords of the second sentence according to the word segmentation result and a preset rule.
As a preferred implementation manner, in the embodiment of the present invention, when executed by the processor, the computer program further implements the following steps:
and denoising the second sentence to remove the noise characters in the second sentence.
As a preferred implementation manner, in the embodiment of the present invention, when executed by the processor, the computer program further implements the following steps:
generating data pairs respectively based on the first sentence and the corresponding first attribute category, the keyword and the corresponding second attribute category;
and generating a target query statement according to the data pair and a preset index rule of a search engine.
As a preferred implementation manner, in the embodiment of the present invention, when executed by the processor, the computer program further implements the following steps:
acquiring training data according to a service scene;
and training a preset classifier by using the training data to obtain a trained classification model.
In summary, the technical solution provided by the embodiment of the present invention has the following beneficial effects:
the query method, device, computer equipment and storage medium based on vertical search provided by the embodiments of the present invention perform regular matching on a received initial query sentence to obtain a first sentence satisfying a matching rule in the initial query sentence, determine a first attribute type corresponding to the first sentence, pre-process a second sentence not satisfying the matching rule in the initial query sentence to obtain a keyword corresponding to the second sentence, classify each keyword by using a pre-trained classification model to obtain a second attribute type of each keyword, generate a target query sentence according to the first sentence, the first attribute type, the keyword and the second attribute type, call a preset search engine interface, match a query result according to the target query sentence, and implement search intention identification of the query sentence for a vertical search engine, the efficiency of inquiry is improved and user experience is promoted.
It should be noted that: in the query device based on vertical search provided in the foregoing embodiment, when triggering query service, only the division of the functional modules is illustrated, and in practical application, the function distribution may be completed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the vertical search based query device provided by the above embodiment and the vertical search based query method embodiment belong to the same concept, that is, the device is based on the vertical search based query method, and the specific implementation process thereof is detailed in the method embodiment and is not described herein again.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.