CN107885879A - Semantic analysis, device, electronic equipment and computer-readable recording medium - Google Patents
Semantic analysis, device, electronic equipment and computer-readable recording medium Download PDFInfo
- Publication number
- CN107885879A CN107885879A CN201711230879.9A CN201711230879A CN107885879A CN 107885879 A CN107885879 A CN 107885879A CN 201711230879 A CN201711230879 A CN 201711230879A CN 107885879 A CN107885879 A CN 107885879A
- Authority
- CN
- China
- Prior art keywords
- word
- words
- preset
- data
- candidate
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000004458 analytical method Methods 0.000 title claims abstract description 36
- 238000012549 training Methods 0.000 claims description 88
- 238000000034 method Methods 0.000 claims description 33
- 238000012512 characterization method Methods 0.000 claims 1
- 239000012141 concentrate Substances 0.000 abstract 1
- 238000010586 diagram Methods 0.000 description 16
- 238000004891 communication Methods 0.000 description 6
- 238000004590 computer program Methods 0.000 description 6
- 230000011218 segmentation Effects 0.000 description 6
- 230000006870 function Effects 0.000 description 4
- 238000007477 logistic regression Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 3
- 230000003313 weakening effect Effects 0.000 description 3
- 230000006399 behavior Effects 0.000 description 2
- 238000004422 calculation algorithm Methods 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000012790 confirmation Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The embodiment of the present disclosure discloses a kind of semantic analysis, device, electronic equipment and computer-readable recording medium, wherein, the semantic analysis includes:Obtain candidate's word collection;Calculate the probable value that candidate's word concentrates word to be default word;The word that the probable value is met to preparatory condition confirms as target word.The disclosure, which can aid in, carries out accurate user view identification, improves retrieval hit rate, effectively improves the service quality of businessman or service provider, strengthens Consumer's Experience.
Description
Technical Field
The present disclosure relates to the field of information processing technologies, and in particular, to a semantic analysis method, an apparatus, an electronic device, and a computer-readable storage medium.
Background
With the development of internet technology, more and more merchants or service providers provide services for users through internet platforms, seek to improve service quality, enhance user experience, and strive for more user orders, so as to improve the utilization rate of existing resources and create more value for the merchants or service providers. However, when the user uses the search service provided by the merchant or the service provider at present, the hit rate of the search result cannot meet the requirement of the user, thereby weakening the user experience.
Disclosure of Invention
The embodiment of the disclosure provides a semantic analysis method, a semantic analysis device, electronic equipment and a computer-readable storage medium.
In a first aspect, a semantic analysis method is provided in the embodiments of the present disclosure.
Specifically, the semantic analysis method includes:
acquiring a candidate word set;
calculating the probability value of the word in the candidate word set as a preset word;
and confirming the words with the probability values meeting the preset conditions as target words.
With reference to the first aspect, in a first implementation manner of the first aspect, the obtaining a candidate word set includes:
acquiring an input character string;
segmenting the input character string to obtain candidate words;
and generating a candidate word set based on the obtained candidate words.
With reference to the first aspect, in a first implementation manner of the first aspect, the calculating a probability value of a word in the candidate word set as a preset word includes:
determining characteristic data;
acquiring training word data;
training based on the feature data and the training word data to obtain a weight value of the feature data;
and calculating the probability value of the candidate word as a preset word based on the weight value of the characteristic data.
With reference to the first aspect, in a first implementation manner of the first aspect, the feature data includes: the number of times that the word w appears in the current input character string, the number of times that the word w appears in the input character string within a preset historical time period, adjacent words of the word w, the part of speech of the adjacent words, and whether the word w is one or more of preset names.
With reference to the first aspect, in a first implementation manner of the first aspect, the training word data includes positive sample words and negative sample words.
With reference to the first aspect, in a first implementation manner of the first aspect, the acquiring training word data includes:
performing preset operation on the words to obtain preset operation data;
calculating the matching degree between the words and preset operation data;
determining the words with the matching degree higher than or equal to a preset matching degree threshold value as positive sample words, and determining the words with the matching degree lower than the preset matching degree threshold value as negative sample words.
With reference to the first aspect, in a first implementation manner of the first aspect, the training to obtain the weight value of the feature data based on the feature data and the training word data includes:
training based on the feature data and the training word data to obtain a feature weight prediction model;
and predicting the corresponding weight of the feature data based on the feature weight prediction model.
With reference to the first aspect, in a first implementation manner of the first aspect, the probability value p (w) that the candidate word w is a preset word is calculated based on the weight value of the feature data by using the following formula:
wherein f isiRepresenting the ith feature, λ, in the feature dataiDenotes the ith feature fiAnd (4) corresponding weight values.
With reference to the first aspect, in a first implementation manner of the first aspect, the determining, as a target word, a word whose probability value meets a preset condition includes:
determining the words with the probability values larger than a preset probability threshold value as target words.
With reference to the first aspect and the first implementation manner of the first aspect, in a second implementation manner of the first aspect, the method further includes: and executing preset operation on the target words.
In a second aspect, a semantic analysis apparatus is provided in the embodiments of the present disclosure.
Specifically, the semantic analysis device includes:
an obtaining module configured to obtain a candidate word set;
the computing module is configured to compute a probability value of a word in the candidate word set as a preset word;
a confirming module configured to confirm the words with the probability values meeting preset conditions as target words.
With reference to the second aspect, in a first implementation manner of the second aspect, the obtaining module includes:
a first obtaining submodule configured to obtain an input character string;
the segmentation submodule is configured to segment the input character string to obtain candidate words;
a generation submodule configured to generate a set of candidate words based on the obtained candidate words.
With reference to the second aspect, in a first implementation manner of the second aspect, the computing module includes:
a determination submodule configured to determine feature data;
a second obtaining submodule configured to obtain training word data;
the training submodule is configured to train on the basis of the feature data and the training word data to obtain a weight value of the feature data;
and the calculating submodule is configured to calculate the probability value of the candidate word being a preset word based on the weight value of the characteristic data.
With reference to the second aspect, in a first implementation manner of the second aspect, the feature data includes: the number of times that the word w appears in the current input character string, the number of times that the word w appears in the input character string within a preset historical time period, adjacent words of the word w, the part of speech of the adjacent words, and whether the word w is one or more of preset names.
With reference to the second aspect, in a first implementation of the second aspect, the training word data includes positive sample words and negative sample words.
With reference to the second aspect, in a first implementation manner of the second aspect, the second obtaining sub-module includes:
the execution unit is configured to execute preset operation on the words to obtain preset operation data;
the calculation unit is configured to calculate the matching degree between the words and preset operation data;
and the determining unit is configured to determine the words with the matching degrees higher than or equal to a preset matching degree threshold value as the positive sample words, and determine the words with the matching degrees lower than the preset matching degree threshold value as the negative sample words.
With reference to the second aspect, in a first implementation manner of the second aspect, the training submodule includes:
the training unit is configured to perform training based on the feature data and the training word data to obtain a feature weight prediction model;
a prediction unit configured to predict a weight corresponding to the feature data based on the feature weight prediction model.
With reference to the second aspect, in a first implementation manner of the second aspect, the calculating submodule is configured to calculate a probability value p (w) of the candidate word w being a preset word based on the weight value of the feature data by using the following formula:
wherein f isiRepresenting the ith feature, λ, in the feature dataiDenotes the ith feature fiAnd (4) corresponding weight values.
With reference to the second aspect, the present disclosure provides in a first implementation manner of the second aspect, the confirming module is configured to determine a word with the probability value larger than a preset probability threshold as the target word.
With reference to the second aspect and the first implementation manner of the second aspect, in a second implementation manner of the second aspect, the apparatus further includes: and the execution module is configured to execute preset operation on the target words.
In a third aspect, the disclosed embodiments provide an electronic device, including a memory for storing one or more computer instructions that support a semantic analysis apparatus to execute the semantic analysis method in the first aspect, and a processor configured to execute the computer instructions stored in the memory. The semantic analysis apparatus may further comprise a communication interface for the semantic analysis apparatus to communicate with other devices or a communication network.
In a fourth aspect, an embodiment of the present disclosure provides a computer-readable storage medium for storing computer instructions for a semantic analysis apparatus, which includes computer instructions for performing the semantic analysis method in the first aspect to be referred to as the semantic analysis apparatus.
The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects:
according to the technical scheme, whether the words input by the user are preset words such as important words or not is determined through analysis, and a corresponding retrieval strategy is formulated to carry out retrieval or other preset operations, so that accurate user intention identification can be facilitated, the service quality of merchants or service providers is effectively improved, the user experience is enhanced, more users are won, and more values are created for the merchants or the service providers.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
Other features, objects, and advantages of the present disclosure will become more apparent from the following detailed description of non-limiting embodiments when taken in conjunction with the accompanying drawings. In the drawings:
FIG. 1 illustrates a flow diagram of a semantic analysis method according to an embodiment of the present disclosure;
FIG. 2 shows a flow chart of step S101 according to the embodiment shown in FIG. 1;
FIG. 3 shows a flowchart of step S102 according to the embodiment shown in FIG. 1;
FIG. 4 shows a flowchart of step S302 according to the embodiment shown in FIG. 3;
FIG. 5 shows a flowchart of step S303 according to the embodiment shown in FIG. 3;
fig. 6 shows a block diagram of a semantic analysis apparatus according to an embodiment of the present disclosure;
FIG. 7 is a block diagram of the acquisition module 601 according to the embodiment shown in FIG. 6;
FIG. 8 illustrates a block diagram of a computing module 602 according to the embodiment shown in FIG. 6;
FIG. 9 is a block diagram illustrating a second capture submodule 802 according to the embodiment shown in FIG. 8;
FIG. 10 is a block diagram illustrating the structure of the training sub-module 803 according to the embodiment shown in FIG. 8;
FIG. 11 shows a block diagram of an electronic device according to an embodiment of the present disclosure;
FIG. 12 is a schematic block diagram of a computer system suitable for use in implementing a semantic analysis method according to one embodiment of the present disclosure.
Detailed Description
Hereinafter, exemplary embodiments of the present disclosure will be described in detail with reference to the accompanying drawings so that those skilled in the art can easily implement them. Also, for the sake of clarity, parts not relevant to the description of the exemplary embodiments are omitted in the drawings.
In the present disclosure, it is to be understood that terms such as "including" or "having," etc., are intended to indicate the presence of the disclosed features, numbers, steps, behaviors, components, parts, or combinations thereof, and are not intended to preclude the possibility that one or more other features, numbers, steps, behaviors, components, parts, or combinations thereof may be present or added.
It should be further noted that the embodiments and features of the embodiments in the present disclosure may be combined with each other without conflict. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.
According to the technical scheme provided by the embodiment of the disclosure, whether the words input by the user are preset words such as important words or not is determined through analysis, and a corresponding retrieval strategy is formulated to carry out retrieval or other preset operations, so that accurate user intention identification can be facilitated, the service quality of merchants or service providers is effectively improved, the user experience is enhanced, more users are won, and more values are created for the merchants or the service providers.
The technical scheme of the present disclosure can be used for operations such as retrieval, search, pairing and the like using words, and for convenience of description, the following takes retrieval as an example to describe the technical scheme of the present disclosure in detail.
Fig. 1 shows a flow diagram of a semantic analysis method according to an embodiment of the present disclosure. As shown in fig. 1, the semantic analysis method includes the following steps S101 to S103:
in step S101, a candidate word set is obtained;
in step S102, calculating a probability value of a word in the candidate word set as a preset word;
in step S103, the word whose probability value meets the preset condition is determined as the target word.
Considering that when a user uses a retrieval service provided by a merchant or a service provider at present, the merchant or the service provider generally directly uses a character string input by the user as a retrieval object for retrieval, so that a lot of noises naturally exist in a retrieval result, contents which the user wants to see cannot be accurately retrieved, that is, the hit rate of the retrieval result cannot meet the requirements of the user, thereby reducing the quality of the service of the merchant or the service provider and weakening the user experience.
In the embodiment, a semantic analysis method is provided, which assists in subsequently formulating a corresponding retrieval strategy to perform retrieval or perform other preset operations by analyzing and determining whether a word input by a user is a preset word such as an important word, specifically, a candidate word set is first obtained, then a probability value that the word in the candidate word set is the preset word is calculated, finally, a word with the probability value meeting a preset condition is determined as a target word, and subsequently, retrieval or other preset operations may be performed based on the target word. The technical scheme can improve the hit rate of the retrieval result, improve the service quality of merchants or service providers and enhance the user experience.
In an optional implementation manner of this embodiment, as shown in fig. 2, the step S101, that is, the step of obtaining the candidate word set, includes steps S201 to S203:
in step S201, an input character string is acquired;
in step S202, the input character string is segmented to obtain candidate words;
in step S203, a candidate word set is generated based on the obtained candidate words.
Generally, when a user searches, it cannot be predicted which word or words are more effective for searching, and therefore, it is necessary to extract and judge the words more effective for searching from a character string input by the user. Specifically, in this embodiment, after a user inputs a character string, word segmentation is performed on the input character string to obtain candidate words, and these candidate words constitute a candidate word set, which can be used as a basis for subsequently analyzing the retrieval validity or importance of one or more words.
There are many methods for segmenting a character string to obtain words, and those skilled in the art can select the words according to the needs of practical application, and the present disclosure does not specifically limit the particular method for segmenting the character string.
In an optional implementation manner of this embodiment, as shown in fig. 3, the step S102, namely, the step of calculating the probability value of a word in the candidate word set as a preset word, includes steps S301 to S304:
in step S301, feature data is determined;
in step S302, training word data is acquired;
in step S303, training based on the feature data and the training word data to obtain a weight value of the feature data;
in step S304, a probability value of the candidate word being a preset word is calculated based on the weight value of the feature data.
In this implementation manner, a probability value that a candidate word in the candidate word set is a preset word is calculated by using a model training method, for example, a probability value that a certain candidate word is an important word with higher importance for retrieval is calculated. Specifically, feature data to be used is determined, training word data is obtained, then a weight value of the feature data is obtained based on the feature data and the training word data, and finally a probability value of a candidate word being a preset word is calculated based on the weight value of the feature data.
Wherein the feature data may include: the number of times a word w appears in a current input character string input by a current user, the number of times the word w appears in a character string input by the user or all users in an internet platform within a preset historical time period, such as a past month or a year, the number of times the word w appears in a character string input by the user or all users in an internet platform, the part of speech of the word w, the part of speech of adjacent words, whether the word w appears as a preset name, such as whether the word w is a merchant/service provider name or whether the word w is one or more of a product/service name, and the part of speech may be, for example: nouns, verbs, adjectives, adverbs, and the like.
Wherein the training word data includes positive sample words and negative sample words.
In an optional implementation manner of this embodiment, as shown in fig. 4, the step S302, that is, the step of acquiring training word data, includes steps S401 to S403:
in step S401, performing a preset operation on the word to obtain preset operation data;
in step S402, calculating a matching degree between the word and preset operation data;
in step S403, words whose matching degree is higher than or equal to a preset matching degree threshold are determined as positive sample words, and words whose matching degree is lower than the preset matching degree threshold are determined as negative sample words.
In order to improve the prediction accuracy of the training model, it is necessary to select appropriate training data, and in this embodiment, the model training data is selected based on the matching degree between a word and a preset operation result, specifically, a preset operation, such as a search operation, is performed on a large number of words to obtain preset operation data, i.e., a search result, and then the matching degree between the word and the search result is calculated, so that a word with a matching degree higher than or equal to a preset matching degree threshold value can be used as a positive sample word, whereas a word with a matching degree lower than the preset matching degree threshold value can be used as a negative sample word, because when a user inputs a character string for searching, and then clicks an entry in the search result, it can be explained that the entry meets the search requirement of the user to a large extent.
The matching degree threshold value can be set according to the requirements of practical application, and the specific value of the matching degree threshold value is not specifically limited by the disclosure.
In an optional implementation manner of this embodiment, as shown in fig. 5, the step S303, namely the step of obtaining the weight value of the feature data based on the feature data and training word data training, includes steps S501 to S502:
in step S501, training is performed based on the feature data and the training word data to obtain a feature weight prediction model;
in step S502, the weight corresponding to the feature data is predicted based on the feature weight prediction model.
As mentioned above, there are many types of feature data considered by the present disclosure, which can be used to characterize the importance of a word, for example, the more times a word w appears in a current input string of a current user input, the more important the word is; in some cases, the less frequently the word w appears in the string of characters entered by the user or by all users in an internet platform within a predetermined historical period of time, such as the past month or year, the more unique the word w appears, and thus the greater its importance; in some cases, nouns are more important than verbs, adjectives and adverbs; in some scenarios, the difference between adjacent words of a word w, and whether the word w is a predetermined name may also affect the importance of the word. However, the above feature data are different in terms of word importance, that is, when the above feature data are used to represent the importance of a word, the weights of different features should not be considered the same, but should have different weights.
Therefore, in this embodiment, the optimal weight assignment for the above-mentioned multiple feature data is predicted or estimated by using a training model, for example, a simple, efficient and widely-used logistic regression model in machine learning may be used to estimate the probability value, a feature weight prediction model may be obtained by using the logistic regression model based on the feature data and training word data, a set of feature weight values corresponding to the feature data may be obtained by further using the feature weight prediction model, and in general, an optimal set of feature weight values corresponding to the feature data may be obtained by an optimization algorithm.
In an optional implementation manner of this embodiment, the step S304 may be implemented to calculate a probability value p (w) that the candidate word w is a preset word based on the weight value of the feature data by using the following formula:
wherein f isiRepresenting the ith feature, λ, in the feature dataiDenotes the ith feature fiAnd (4) corresponding weight values.
In an optional implementation manner of this embodiment, in step S103, that is, the step of determining the word whose probability value meets the preset condition as the target word may be implemented to determine the word whose probability value is greater than a preset probability threshold value as the target word.
In this embodiment, the candidate word w is a word whose probability value p (w) of the more important preset word is greater than the preset probability threshold, which may be considered as a word having relatively strong importance and relatively effective for the preset operations such as retrieval, and therefore, such a word may be used as a target word to participate in the subsequent preset operations such as retrieval, so as to improve the retrieval hit rate.
The probability threshold value can be set according to the needs of practical application, and the specific value is not specifically limited by the disclosure.
In an optional implementation manner of this embodiment, the method further includes a step of performing a preset operation on the target word, where the preset operation includes: one or more of retrieving, searching, matching.
The following are embodiments of the disclosed apparatus that may be used to perform embodiments of the disclosed methods.
Fig. 6 shows a block diagram of a semantic analysis apparatus according to an embodiment of the present disclosure, which may be implemented as part or all of an electronic device by software, hardware, or a combination of the two. As shown in fig. 6, the semantic analysis device includes:
an obtaining module 601 configured to obtain a candidate word set;
a calculating module 602 configured to calculate a probability value of a word in the candidate word set as a preset word;
a confirming module 603 configured to confirm the word with the probability value meeting a preset condition as a target word.
Considering that when a user uses a retrieval service provided by a merchant or a service provider at present, the merchant or the service provider generally directly uses a character string input by the user as a retrieval object for retrieval, so that a lot of noises naturally exist in a retrieval result, contents which the user wants to see cannot be accurately retrieved, that is, the hit rate of the retrieval result cannot meet the requirements of the user, thereby reducing the quality of the service of the merchant or the service provider and weakening the user experience.
In this embodiment, a semantic analysis device is provided, which assists in subsequently formulating a corresponding retrieval policy to perform retrieval or perform other preset operations by analyzing and determining whether a word input by a user is a preset word such as an important word, specifically, first, an obtaining module 601 obtains a candidate word set, then, a calculating module 602 calculates a probability value that the word in the candidate word set is the preset word, and finally, a confirming module 603 determines the word with the probability value meeting a preset condition as a target word, and then, the semantic analysis device may perform retrieval or perform other preset operations based on the target word. The technical scheme can improve the hit rate of the retrieval result, improve the service quality of merchants or service providers and enhance the user experience.
In an optional implementation manner of this embodiment, as shown in fig. 7, the obtaining module 601 includes:
a first obtaining sub-module 701 configured to obtain an input character string;
a segmentation submodule 702 configured to segment the input character string to obtain candidate words;
a generating sub-module 703 configured to generate a set of candidate words based on the obtained candidate words.
Generally, when a user searches, it cannot be predicted which word or words are more effective for searching, and therefore, it is necessary to extract and judge the words more effective for searching from a character string input by the user. Specifically, in this embodiment, after the first obtaining sub-module 701 obtains a character string input by a user, the segmentation sub-module 702 first performs word segmentation on the input character string to obtain some candidate words, and then the generation sub-module 703 combines these candidate words into a candidate word set, where the candidate word set can be used as a basis for subsequently analyzing the retrieval validity or importance of some word or words.
There are many ways to divide a character string into words, and those skilled in the art can select the words according to the needs of practical application, and the specific way to divide a character string is not specifically limited in the present disclosure.
In an optional implementation manner of this embodiment, as shown in fig. 8, the calculating module 602 includes:
a determination sub-module 801 configured to determine feature data;
a second obtaining sub-module 802 configured to obtain training word data;
a training submodule 803 configured to train to obtain a weight value of the feature data based on the feature data and the training word data;
the calculating submodule 804 is configured to calculate a probability value of the candidate word being a preset word based on the weight value of the feature data.
In this implementation, the calculating module 602 calculates a probability value of a candidate word in the candidate word set being a preset word by using a model training method, for example, calculates a probability value of a candidate word being an important word with higher importance for retrieval. Specifically, the determining submodule 801 determines feature data to be used, the second obtaining submodule 802 obtains training word data, the training submodule 803 obtains a weight value of the feature data based on the feature data and the training word data, and the calculating submodule 804 calculates a probability value of a candidate word being a preset word based on the weight value of the feature data.
Wherein the feature data may include: the number of times a word w appears in a current input character string input by a current user, the number of times the word w appears in a character string input by the user or all users in an internet platform within a preset historical time period, such as a past month or a year, the number of times the word w appears in a character string input by the user or all users in an internet platform, the part of speech of the word w, the part of speech of adjacent words, whether the word w appears as a preset name, such as whether the word w is a merchant/service provider name or whether the word w is one or more of a product/service name, and the part of speech may be, for example: nouns, verbs, adjectives, adverbs, and the like.
Wherein the training word data includes positive sample words and negative sample words.
In an optional implementation manner of this embodiment, as shown in fig. 9, the second obtaining sub-module 802 includes:
an execution unit 901 configured to execute a preset operation on the word, so as to obtain preset operation data;
a calculating unit 902 configured to calculate a matching degree between the word and preset operation data;
a determining unit 903 configured to determine a word with a matching degree higher than or equal to a preset matching degree threshold as a positive sample word, and determine a word with a matching degree lower than the preset matching degree threshold as a negative sample word.
In order to improve the prediction accuracy of the training model, it is necessary to select appropriate training data, which, in this embodiment, the second obtaining sub-module 802 selects model training data based on the matching degree between the words and the preset operation result, specifically, first, the execution unit 901 performs a preset operation, such as a search operation, on a large number of words to obtain preset operation data, i.e., the search result, and then calculates the matching degree between the word and the search result through the calculating unit 902, the determining unit 903 may regard the word having the matching degree higher than or equal to the preset matching degree threshold as a positive sample word, and conversely, may regard the word having the matching degree lower than the preset matching degree threshold as a negative sample word, because when a user inputs a character string for searching and then clicks an item in the searching result, the item can be explained to a great extent to meet the searching requirement of the user.
The matching degree threshold value can be set according to the requirements of practical application, and the specific value of the matching degree threshold value is not specifically limited by the disclosure.
In an optional implementation manner of this embodiment, as shown in fig. 10, the training sub-module 803 includes:
a training unit 1001 configured to perform training based on the feature data and training word data to obtain a feature weight prediction model;
a prediction unit 1002 configured to predict a weight corresponding to the feature data based on the feature weight prediction model.
As mentioned above, there are many types of feature data considered by the present disclosure, which can be used to characterize the importance of a word, for example, the more times a word w appears in a current input string of a current user input, the more important the word is; in some cases, the less frequently the word w appears in the string of characters entered by the user or by all users in an internet platform within a predetermined historical period of time, such as the past month or year, the more unique the word w appears, and thus the greater its importance; in some cases, nouns are more important than verbs, adjectives and adverbs; in some scenarios, the difference between adjacent words of a word w, and whether the word w is a predetermined name may also affect the importance of the word. However, the above feature data are different in terms of word importance, that is, when the above feature data are used to represent the importance of a word, the weights of different features should not be considered the same, but should have different weights.
Therefore, in this embodiment, the training submodule 803 uses a training model to predict or estimate the optimal weight distribution for the above-mentioned multiple feature data, for example, a simple, efficient and widely-used logistic regression model in machine learning may be used to estimate the probability value, that is, the training unit 1001 uses the logistic regression model to obtain a feature weight prediction model based on the feature data and the training word data, and the prediction unit 1002 uses the feature weight prediction model to obtain a set of feature weight values corresponding to the feature data, and usually, an optimal set of feature weight values corresponding to the feature data may be obtained through an optimization algorithm.
In an optional implementation manner of this embodiment, the calculating sub-module 804 may be configured to calculate a probability value p (w) that the candidate word w is a preset word based on the weight value of the feature data by using the following formula:
wherein f isiRepresenting the ith feature, λ, in the feature dataiDenotes the ith feature fiAnd (4) corresponding weight values.
In an optional implementation manner of this embodiment, the confirming module 603 may be configured to determine words with the probability value greater than a preset probability threshold as target words.
In this embodiment, the candidate word w is a word whose probability value p (w) of the more important preset word is greater than the preset probability threshold, which may be considered as a word having relatively strong importance and relatively effective for the preset operations such as retrieval, and therefore, such a word may be used as a target word to participate in the subsequent preset operations such as retrieval, so as to improve the retrieval hit rate.
The probability threshold value can be set according to the needs of practical application, and the specific value is not specifically limited by the disclosure.
In an optional implementation manner of this embodiment, the apparatus further includes:
an execution module configured to execute preset operations on the target words, wherein the preset operations include: one or more of retrieving, searching, matching.
The present disclosure also discloses an electronic device, fig. 11 shows a block diagram of an electronic device according to an embodiment of the present disclosure, and as shown in fig. 11, the electronic device 1100 includes a memory 1101 and a processor 1102; wherein,
the memory 1101 is configured to store one or more computer instructions, wherein the one or more computer instructions are executed by the processor 1102 to implement:
acquiring a candidate word set;
calculating the probability value of the word in the candidate word set as a preset word;
and confirming the words with the probability values meeting the preset conditions as target words.
The one or more computer instructions are further executable by the processor 1102 to implement:
the obtaining of the candidate word set includes:
acquiring an input character string;
segmenting the input character string to obtain candidate words;
and generating a candidate word set based on the obtained candidate words.
The calculating the probability value of the word in the candidate word set as a preset word comprises:
determining characteristic data;
acquiring training word data;
training based on the feature data and the training word data to obtain a weight value of the feature data;
and calculating the probability value of the candidate word as a preset word based on the weight value of the characteristic data.
The characteristic data includes: the number of times that the word w appears in the current input character string, the number of times that the word w appears in the input character string within a preset historical time period, adjacent words of the word w, the part of speech of the adjacent words, and whether the word w is one or more of preset names.
The training word data includes positive sample words and negative sample words.
The obtaining training word data includes:
performing preset operation on the words to obtain preset operation data;
calculating the matching degree between the words and preset operation data;
determining the words with the matching degree higher than or equal to a preset matching degree threshold value as positive sample words, and determining the words with the matching degree lower than the preset matching degree threshold value as negative sample words.
The training based on the feature data and the training word data to obtain the weight value of the feature data comprises:
training based on the feature data and the training word data to obtain a feature weight prediction model;
and predicting the corresponding weight of the feature data based on the feature weight prediction model.
Calculating the probability value p (w) of the candidate word w as a preset word based on the weight value of the characteristic data by using the following formula:
wherein f isiRepresenting the ith feature, λ, in the feature dataiDenotes the ith feature fiAnd (4) corresponding weight values.
The determining the words with the probability values meeting the preset conditions as target words comprises the following steps:
determining the words with the probability values larger than a preset probability threshold value as target words.
Further comprising:
and executing preset operation on the target words.
FIG. 12 is a schematic diagram of a computer system suitable for use in implementing a semantic analysis method according to an embodiment of the present disclosure.
As shown in fig. 12, the computer system 1200 includes a Central Processing Unit (CPU)1201, which can execute various processes in the embodiments shown in fig. 1 to 5 described above according to a program stored in a Read Only Memory (ROM)1202 or a program loaded from a storage section 1208 into a Random Access Memory (RAM) 1203. In the RAM1203, various programs and data necessary for the operation of the system 1200 are also stored. The CPU1201, ROM1202, and RAM1203 are connected to each other by a bus 1204. An input/output (I/O) interface 1205 is also connected to bus 1204.
The following components are connected to the I/O interface 1205: an input section 1206 including a keyboard, a mouse, and the like; an output portion 1207 including a display device such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 1208 including a hard disk and the like; and a communication section 1209 including a network interface card such as a LAN card, a modem, or the like. The communication section 1209 performs communication processing via a network such as the internet. A driver 1210 is also connected to the I/O interface 1205 as needed. A removable medium 1211, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like, is mounted on the drive 1210 as necessary, so that a computer program read out therefrom is mounted into the storage section 1208 as necessary.
In particular, the methods described above with reference to fig. 1-5 may be implemented as computer software programs, according to embodiments of the present disclosure. For example, embodiments of the present disclosure include a computer program product comprising a computer program tangibly embodied on a medium readable thereby, the computer program comprising program code for performing the semantic analysis method of fig. 1-5. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 1209, and/or installed from the removable medium 1211.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowcharts or block diagrams may represent a module, a program segment, or a portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units or modules described in the embodiments of the present disclosure may be implemented by software or hardware. The units or modules described may also be provided in a processor, and the names of the units or modules do not in some cases constitute a limitation of the units or modules themselves.
As another aspect, the present disclosure also provides a computer-readable storage medium, which may be the computer-readable storage medium included in the apparatus in the above-described embodiment; or it may be a separate computer readable storage medium not incorporated into the device. The computer readable storage medium stores one or more programs for use by one or more processors in performing the methods described in the present disclosure.
The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the present disclosure is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is possible without departing from the inventive concept. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.
The disclosure discloses a1, a semantic analysis method, the method comprising: acquiring a candidate word set; calculating the probability value of the word in the candidate word set as a preset word; and confirming the words with the probability values meeting the preset conditions as target words. A2, the method of A1, the obtaining a set of candidate words, comprising: acquiring an input character string; segmenting the input character string to obtain candidate words; and generating a candidate word set based on the obtained candidate words. A3, according to the method in A1, the calculating the probability value of a word in the candidate word set as a preset word includes: determining characteristic data; acquiring training word data; training based on the feature data and the training word data to obtain a weight value of the feature data; and calculating the probability value of the candidate word as a preset word based on the weight value of the characteristic data. A4, the method of A3, the feature data comprising: the number of times that the word w appears in the current input character string, the number of times that the word w appears in the input character string within a preset historical time period, adjacent words of the word w, the part of speech of the adjacent words, and whether the word w is one or more of preset names. A5, the method of A3, the training word data comprising positive and negative sample words. A6, the method of A5, the obtaining training word data, comprising: performing preset operation on the words to obtain preset operation data; calculating the matching degree between the words and preset operation data; determining the words with the matching degree higher than or equal to a preset matching degree threshold value as positive sample words, and determining the words with the matching degree lower than the preset matching degree threshold value as negative sample words. A7, the method according to A3, wherein the training based on the feature data and training word data to obtain the weight values of the feature data includes: training based on the feature data and the training word data to obtain a feature weight prediction model; and predicting the corresponding weight of the feature data based on the feature weight prediction model. A8, according to the method in A3, calculating a probability value p (w) of the candidate word w being a preset word based on the weight value of the feature data by using the following formula:
wherein f isiRepresenting the ith feature, λ, in the feature dataiDenotes the ith feature fiAnd (4) corresponding weight values. A9, according to the method in A1, the confirming the words with the probability values meeting the preset conditions as target words comprises: determining the words with the probability values larger than a preset probability threshold value as target words. A10, the method of A1, further comprising: and executing preset operation on the target words.
The present disclosure discloses B11, a semantic analysis apparatus, the apparatus comprising: an obtaining module configured to obtain a candidate word set; the computing module is configured to compute a probability value of a word in the candidate word set as a preset word; a confirming module configured to confirm the words with the probability values meeting preset conditions as target words. B12, the apparatus of B11, the obtaining module comprising: a first obtaining submodule configured to obtain an input character string; the segmentation submodule is configured to segment the input character string to obtain candidate words; a generation submodule configured to generate a set of candidate words based on the obtained candidate words. B13, the apparatus of B11, the computing module comprising: a determination submodule configured to determine feature data; a second obtaining submodule configured to obtain training word data; the training submodule is configured to train on the basis of the feature data and the training word data to obtain a weight value of the feature data; and the calculating submodule is configured to calculate the probability value of the candidate word being a preset word based on the weight value of the characteristic data. B14, the apparatus according to B13, the characteristic data comprising: the number of times that the word w appears in the current input character string, the number of times that the word w appears in the input character string within a preset historical time period, adjacent words of the word w, the part of speech of the adjacent words, and whether the word w is one or more of preset names. B15, the apparatus of B13, the training word data comprising positive sample words and negative sample words. B16, the apparatus according to B15, the second obtaining submodule includes: the execution unit is configured to execute preset operation on the words to obtain preset operation data; the calculation unit is configured to calculate the matching degree between the words and preset operation data; and the determining unit is configured to determine the words with the matching degrees higher than or equal to a preset matching degree threshold value as the positive sample words, and determine the words with the matching degrees lower than the preset matching degree threshold value as the negative sample words. B17, the apparatus according to B13, the training submodule comprising: the training unit is configured to perform training based on the feature data and the training word data to obtain a feature weight prediction model; a prediction unit configured to predict a weight corresponding to the feature data based on the feature weight prediction model. B18, the apparatus according to B13, the computing submodule being configured to compute a probability value p (w) of a candidate word w being a preset word based on the weight value of the feature data using:
wherein f isiRepresenting the ith feature, λ, in the feature dataiDenotes the ith feature fiAnd (4) corresponding weight values. B19, the apparatus of B11, the confirmation module being configured to determine words with the probability value being greater than a preset probability threshold as target words. B20, the apparatus according to B11, further comprising: and the execution module is configured to execute preset operation on the target words.
The present disclosure discloses C21, an electronic device comprising a memory and a processor; wherein the memory is to store one or more computer instructions, wherein the one or more computer instructions are to be executed by the processor to implement the method of any one of A1-A10.
The present disclosure also discloses D22, a computer readable storage medium having stored thereon computer instructions which, when executed by a processor, implement the method as recited in any of a1-a 10.
Claims (10)
1. A method of semantic analysis, the method comprising:
acquiring a candidate word set;
calculating the probability value of the word in the candidate word set as a preset word;
and confirming the words with the probability values meeting the preset conditions as target words.
2. The method of claim 1, wherein obtaining the set of candidate words comprises:
acquiring an input character string;
segmenting the input character string to obtain candidate words;
and generating a candidate word set based on the obtained candidate words.
3. The method of claim 1, wherein the calculating the probability value of a word in the candidate word set as a preset word comprises:
determining characteristic data;
acquiring training word data;
training based on the feature data and the training word data to obtain a weight value of the feature data;
and calculating the probability value of the candidate word as a preset word based on the weight value of the characteristic data.
4. The method of claim 3, wherein the characterization data comprises: the number of times that the word w appears in the current input character string, the number of times that the word w appears in the input character string within a preset historical time period, adjacent words of the word w, the part of speech of the adjacent words, and whether the word w is one or more of preset names.
5. The method of claim 3, wherein the training word data comprises positive sample words and negative sample words.
6. The method of claim 3, wherein the training based on the feature data and training word data to obtain weight values for feature data comprises:
training based on the feature data and the training word data to obtain a feature weight prediction model;
and predicting the corresponding weight of the feature data based on the feature weight prediction model.
7. The method according to claim 3, wherein the probability value p (w) that a candidate word w is a preset word is calculated based on the weight value of the feature data using the following formula:
<mrow> <mi>p</mi> <mrow> <mo>(</mo> <mi>w</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mn>1</mn> <mrow> <mn>1</mn> <mo>+</mo> <mi>exp</mi> <mrow> <mo>(</mo> <mo>-</mo> <msub> <mo>&Sigma;</mo> <mi>i</mi> </msub> <msub> <mi>&lambda;</mi> <mi>i</mi> </msub> <msub> <mi>f</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> </mrow> </mfrac> </mrow>
wherein f isiRepresenting the ith feature, λ, in the feature dataiDenotes the ith feature fiAnd (4) corresponding weight values.
8. A semantic analysis apparatus, characterized in that the apparatus comprises:
an obtaining module configured to obtain a candidate word set;
the computing module is configured to compute a probability value of a word in the candidate word set as a preset word;
a confirming module configured to confirm the words with the probability values meeting preset conditions as target words.
9. An electronic device comprising a memory and a processor; wherein,
the memory is to store one or more computer instructions, wherein the one or more computer instructions are to be executed by the processor to implement the method of any one of claims 1-7.
10. A computer-readable storage medium having stored thereon computer instructions, which when executed by a processor, implement the method of any one of claims 1-7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711230879.9A CN107885879A (en) | 2017-11-29 | 2017-11-29 | Semantic analysis, device, electronic equipment and computer-readable recording medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711230879.9A CN107885879A (en) | 2017-11-29 | 2017-11-29 | Semantic analysis, device, electronic equipment and computer-readable recording medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107885879A true CN107885879A (en) | 2018-04-06 |
Family
ID=61776158
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711230879.9A Pending CN107885879A (en) | 2017-11-29 | 2017-11-29 | Semantic analysis, device, electronic equipment and computer-readable recording medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107885879A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109947902A (en) * | 2019-03-06 | 2019-06-28 | 腾讯科技(深圳)有限公司 | A kind of data query method, apparatus and readable medium |
CN111079439A (en) * | 2019-12-11 | 2020-04-28 | 拉扎斯网络科技(上海)有限公司 | Abnormal information identification method and device, electronic equipment and computer storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8326861B1 (en) * | 2010-06-23 | 2012-12-04 | Google Inc. | Personalized term importance evaluation in queries |
CN104376065A (en) * | 2014-11-05 | 2015-02-25 | 百度在线网络技术(北京)有限公司 | Determination method and device for importance degree of search word |
CN104615723A (en) * | 2015-02-06 | 2015-05-13 | 百度在线网络技术(北京)有限公司 | Determining method and device of search term weight value |
CN104866496A (en) * | 2014-02-22 | 2015-08-26 | 腾讯科技(深圳)有限公司 | Method and device for determining morpheme significance analysis model |
-
2017
- 2017-11-29 CN CN201711230879.9A patent/CN107885879A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8326861B1 (en) * | 2010-06-23 | 2012-12-04 | Google Inc. | Personalized term importance evaluation in queries |
CN104866496A (en) * | 2014-02-22 | 2015-08-26 | 腾讯科技(深圳)有限公司 | Method and device for determining morpheme significance analysis model |
CN104376065A (en) * | 2014-11-05 | 2015-02-25 | 百度在线网络技术(北京)有限公司 | Determination method and device for importance degree of search word |
CN104615723A (en) * | 2015-02-06 | 2015-05-13 | 百度在线网络技术(北京)有限公司 | Determining method and device of search term weight value |
Non-Patent Citations (1)
Title |
---|
JIAWEI HAN等: "《数据挖掘概念与技术》", 31 August 2001, 机械工业出版社 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109947902A (en) * | 2019-03-06 | 2019-06-28 | 腾讯科技(深圳)有限公司 | A kind of data query method, apparatus and readable medium |
CN109947902B (en) * | 2019-03-06 | 2021-03-26 | 腾讯科技(深圳)有限公司 | Data query method and device and readable medium |
CN111079439A (en) * | 2019-12-11 | 2020-04-28 | 拉扎斯网络科技(上海)有限公司 | Abnormal information identification method and device, electronic equipment and computer storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105678587B (en) | Recommendation feature determination method, information recommendation method and device | |
US20200293924A1 (en) | Gbdt model feature interpretation method and apparatus | |
US20170228652A1 (en) | Method and apparatus for evaluating predictive model | |
CN102262647B (en) | Signal conditioning package, information processing method and program | |
CN107220845A (en) | User purchases probabilistic forecasting/user quality and determines method, device and electronic equipment again | |
CN109685537B (en) | User behavior analysis method, device, medium and electronic equipment | |
CN109903095A (en) | Data processing method, device, electronic equipment and computer readable storage medium | |
CN111598338B (en) | Method, apparatus, medium, and electronic device for updating prediction model | |
CN109685574A (en) | Data determination method and device, electronic equipment and computer readable storage medium | |
CN110209782B (en) | Question-answering model and answer sentence generation method and device, medium and electronic equipment | |
CN112148986B (en) | Top-N service re-recommendation method and system based on crowdsourcing | |
CN113407854A (en) | Application recommendation method, device and equipment and computer readable storage medium | |
CN109636530A (en) | Product determination method, product determination device, electronic equipment and computer-readable storage medium | |
CN117688155A (en) | Service problem replying method and device, storage medium and electronic equipment | |
CN107992570A (en) | Character string method for digging, device, electronic equipment and computer-readable recording medium | |
CN107885879A (en) | Semantic analysis, device, electronic equipment and computer-readable recording medium | |
CN107844584A (en) | Usage mining method, apparatus, electronic equipment and computer-readable recording medium | |
CN109344347B (en) | Display control method, display control device, electronic equipment and computer-readable storage medium | |
CN109460474B (en) | User preference trend mining method | |
CN110880117A (en) | False service identification method, device, equipment and storage medium | |
US8257091B2 (en) | Matching learning objects with a user profile using top-level concept complexity | |
CN108090785B (en) | Method and device for determining user behavior decline tendency and electronic equipment | |
CN116664306A (en) | Intelligent recommendation method and device for wind control rules, electronic equipment and medium | |
CN110659954A (en) | Cheating identification method and device, electronic equipment and readable storage medium | |
CN114218259B (en) | Multi-dimensional scientific information search method and system based on big data SaaS |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180406 |
|
RJ01 | Rejection of invention patent application after publication |