Disclosure of Invention
In order to solve the problems of the prior art, the invention provides a method and a device for extracting tag information. The technical scheme is as follows:
in a first aspect, the present invention provides a method for extracting tag information, the method including:
segmenting words of text information to obtain a candidate phrase set, wherein the candidate phrase set comprises at least one candidate phrase, and each candidate phrase comprises at least one keyword;
for each candidate phrase in the candidate phrase set, determining the score of the sentence in which the candidate phrase is located, determining the score of the position of the sentence in which the candidate phrase is located, determining the first score of each keyword included in the candidate phrase, and determining the score of the candidate phrase according to the score of the sentence in which the candidate phrase is located, the score of the position of the sentence in which the candidate phrase is located and the first score of each keyword included in the candidate phrase;
selecting a preset number of candidate phrases with highest scores from the candidate phrase set based on the score of each candidate phrase;
and forming the label information of the text information by the preset number of candidate phrases.
In one possible implementation, the determining the first score of each keyword included in the candidate phrase includes:
determining a first occurrence number and a second occurrence number for each keyword, wherein the first occurrence number is the occurrence number of the keyword in the text information, and the second occurrence number is the total occurrence number of each keyword included in the text information;
determining the word frequency of the keyword according to the first occurrence frequency and the second occurrence frequency;
determining a first quantity and a second quantity, wherein the first quantity is the quantity of sample text information included in a sample text information base, and the second quantity is the quantity of text information including the keywords in the sample text information base;
determining the reverse file frequency of the keywords according to the first quantity and the second quantity;
and determining a first score of the keyword according to the word frequency and the reverse file frequency.
In one possible implementation manner, the determining the score of the position of the sentence in which the candidate phrase is located includes:
determining a first position of a paragraph of a sentence in which the candidate phrase is located in the text information and a second position of the candidate phrase in the paragraph;
and determining the grade of the position of the sentence in which the candidate phrase is positioned according to the first position and the second position.
In one possible implementation manner, the determining the score of the candidate phrase according to the score of the sentence in which the candidate phrase is located, the score of the position of the sentence in which the candidate phrase is located, and the first score of each keyword included in the candidate phrase includes:
determining a first weight corresponding to a sentence in which the candidate phrase is located, a second weight corresponding to the position of the sentence in which the candidate phrase is located, and a third weight corresponding to each keyword included in the candidate phrase;
for each keyword included in the candidate phrase, multiplying the score of the sentence in which the candidate phrase is located by the first weight to obtain a first numerical value, multiplying the score of the position of the sentence in which the candidate phrase is located by the second weight to obtain a second numerical value, multiplying the first score of the keyword by a third weight corresponding to the keyword to obtain a third numerical value, and adding the first numerical value, the second numerical value and the third numerical value to obtain a second score of the keyword;
and determining the score of the candidate phrase according to the second score of each keyword.
In a possible implementation manner, the adding the first numerical value, the second numerical value, and the third numerical value to obtain a second score of the keyword includes:
determining the contribution degree of the keyword, and determining a fourth weight corresponding to the keyword according to the contribution degree of the keyword;
adding the first numerical value, the second numerical value and the third numerical value to obtain a fourth numerical value;
and multiplying the fourth numerical value and the fourth weight to obtain a second score of the keyword.
In one possible implementation manner, the forming the preset number of candidate phrases into the tag information of the text information includes:
selecting candidate phrases of concept types from the preset number of candidate phrases to form concept label information; and/or the presence of a gas in the gas,
and selecting candidate phrases of the event type from the preset number of candidate phrases to form event tag information.
In one possible implementation, the method further includes:
moving candidate phrases ending with keywords with preset parts of speech in the concept label information to the event label information;
and moving the candidate phrases which do not contain the keywords of the preset part of speech in the event label information to the concept label information.
In a possible implementation manner, the segmenting the text information to obtain a candidate phrase set includes:
sentence breaking is carried out on the text information to obtain at least one candidate sentence, and the at least one candidate sentence is formed into a candidate sentence set;
segmenting each candidate sentence in the candidate sentence set to obtain at least one keyword, and forming the at least one keyword into a keyword set;
generating at least one candidate phrase from the keywords in the keyword set based on a syntax tree algorithm;
and forming the at least one candidate phrase into the candidate phrase set.
In a possible implementation manner, before performing word segmentation on each candidate sentence in the candidate sentence set to obtain at least one keyword, the method further includes:
determining a sentence component of each candidate sentence in the set of candidate sentences;
and deleting the candidate sentences of which the sentence components are preset components in the candidate sentence set according to the sentence components of each candidate sentence.
In a second aspect, the present invention provides an apparatus for extracting tag information, the apparatus comprising:
the word segmentation module is used for segmenting words of the text information to obtain a candidate phrase set, wherein the candidate phrase set comprises at least one candidate phrase, and each candidate phrase comprises at least one keyword;
the scoring module is used for determining the score of a sentence where the candidate phrase is located and the score of the position of the sentence where the candidate phrase is located and the first score of each keyword included in the candidate phrase for each candidate phrase in the candidate phrase set, and determining the score of the candidate phrase according to the score of the sentence where the candidate phrase is located, the score of the position of the sentence where the candidate phrase is located and the first score of each keyword included in the candidate phrase;
a selection module, configured to select a preset number of candidate phrases with highest scores from the candidate phrase set based on the score of each candidate phrase;
and the composition module is used for composing the preset number of candidate phrases into label information of the text information.
In a possible implementation manner, the scoring module is further configured to determine, for each keyword, a first occurrence number and a second occurrence number, where the first occurrence number is the occurrence number of the keyword in the text information, and the second occurrence number is a total occurrence number of each keyword included in the text information; determining the word frequency of the keyword according to the first occurrence frequency and the second occurrence frequency; determining a first quantity and a second quantity, wherein the first quantity is the quantity of sample text information included in a sample text information base, and the second quantity is the quantity of text information including the keywords in the sample text information base; determining the reverse file frequency of the keywords according to the first quantity and the second quantity; and determining a first score of the keyword according to the word frequency and the reverse file frequency.
In one possible implementation manner, the scoring module is further configured to determine a first position of a paragraph in the text information where the sentence in which the candidate phrase is located, and a second position of the candidate phrase in the paragraph; and determining the grade of the position of the sentence in which the candidate phrase is positioned according to the first position and the second position.
In a possible implementation manner, the scoring module is further configured to determine a first weight corresponding to a sentence in which the candidate phrase is located, a second weight corresponding to a position of the sentence in which the candidate phrase is located, and a third weight corresponding to each keyword included in the candidate phrase; for each keyword included in the candidate phrase, multiplying the score of the sentence in which the candidate phrase is located by the first weight to obtain a first numerical value, multiplying the score of the position of the sentence in which the candidate phrase is located by the second weight to obtain a second numerical value, multiplying the first score of the keyword by a third weight corresponding to the keyword to obtain a third numerical value, and adding the first numerical value, the second numerical value and the third numerical value to obtain a second score of the keyword; and determining the score of the candidate phrase according to the second score of each keyword.
In a possible implementation manner, the scoring module is further configured to determine a contribution degree of the keyword, and determine a fourth weight corresponding to the keyword according to the contribution degree of the keyword; adding the first numerical value, the second numerical value and the third numerical value to obtain a fourth numerical value; and multiplying the fourth numerical value and the fourth weight to obtain a second score of the keyword.
In a possible implementation manner, the composition module is further configured to select candidate phrases of concept types from the preset number of candidate phrases to compose concept tag information; and selecting candidate phrases of the event type from the preset number of candidate phrases to form event tag information.
In one possible implementation, the apparatus further includes:
a moving module, configured to move a candidate phrase ending with a keyword with a preset part of speech in the concept tag information to the event tag information; and/or the presence of a gas in the gas,
the moving module is further configured to move a candidate phrase, which does not include the keyword of the preset part of speech, in the event tag information to the concept tag information.
In a possible implementation manner, the word segmentation module is further configured to perform sentence segmentation on the text information to obtain at least one candidate sentence, and form the at least one candidate sentence into a candidate sentence set; segmenting each candidate sentence in the candidate sentence set to obtain at least one keyword, and forming the at least one keyword into a keyword set; generating at least one candidate phrase from the keywords in the keyword set based on a syntax tree algorithm; and forming the at least one candidate phrase into the candidate phrase set.
In one possible implementation, the word segmentation module is further configured to determine a sentence component of each candidate sentence in the candidate sentence set; and deleting the candidate sentences of which the sentence components are preset components in the candidate sentence set according to the sentence components of each candidate sentence.
In the embodiment of the invention, the text information is segmented to obtain a candidate phrase set, the candidate phrase set comprises at least one candidate phrase, and label information is extracted based on the candidate phrases in the candidate phrase set, so that multi-element label information can be extracted. And for each candidate phrase in the candidate phrase set, determining the score of the sentence in which the candidate phrase is positioned, determining the score of the position of the sentence in which the candidate phrase is positioned, determining the first score of each keyword included in the candidate phrase, and determining the score of the candidate phrase according to the score of the sentence in which the candidate phrase is positioned, the score of the position of the sentence in which the candidate phrase is positioned and the first score of each keyword included in the candidate phrase. Due to the combination of sentence scoring, position scoring and keyword scoring, the scoring accuracy of the candidate phrases is improved, and the accuracy of extracting the tag information is improved.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail with reference to the accompanying drawings.
The embodiment of the invention provides a method for extracting tag information, and referring to fig. 1, the method comprises the following steps:
step 101: the method comprises the steps of segmenting words of text information to obtain a candidate phrase set, wherein the candidate phrase set comprises at least one candidate phrase, and each candidate phrase comprises at least one keyword.
Step 102: for each candidate phrase in the candidate phrase set, determining the score of the sentence in which the candidate phrase is located, determining the score of the position of the sentence in which the candidate phrase is located, determining the first score of each keyword included in the candidate phrase, and determining the score of the candidate phrase according to the score of the sentence in which the candidate phrase is located, the score of the position of the sentence in which the candidate phrase is located and the first score of each keyword included in the candidate phrase.
Step 103: based on the score of each candidate phrase, a preset number of candidate phrases with the highest scores are selected from the set of candidate phrases.
Step 104: and forming the label information of the text information by the preset number of candidate phrases.
In one possible implementation, determining a first score for each keyword included in the candidate phrase includes:
for each keyword, determining a first occurrence number and a second occurrence number, wherein the first occurrence number is the occurrence number of the keyword in the text information, and the second occurrence number is the total occurrence number of each keyword included in the text information;
determining the word frequency of the keyword according to the first occurrence frequency and the second occurrence frequency;
determining a first quantity and a second quantity, wherein the first quantity is the quantity of sample text information included in a sample text information base, and the second quantity is the quantity of text information including the keyword in the sample text information base;
determining the reverse file frequency of the keyword according to the first quantity and the second quantity;
and determining a first score of the keyword according to the word frequency and the reverse file frequency.
In one possible implementation, determining a score of a position of a sentence in which the candidate phrase is located includes:
determining a first position of a paragraph in which the sentence of the candidate phrase is located in the text information and a second position of the candidate phrase in the paragraph;
and determining the score of the position of the sentence in which the candidate phrase is positioned according to the first position and the second position.
In one possible implementation manner, determining the score of the candidate phrase according to the score of the sentence in which the candidate phrase is located, the score of the position of the sentence in which the candidate phrase is located, and the first score of each keyword included in the candidate phrase includes:
determining a first weight corresponding to a sentence in which the candidate phrase is located, a second weight corresponding to the position of the sentence in which the candidate phrase is located, and a third weight corresponding to each keyword included in the candidate phrase;
for each keyword included in the candidate phrase, multiplying the score of the sentence in which the candidate phrase is located by the first weight to obtain a first numerical value, multiplying the score of the position of the sentence in which the candidate phrase is located by the second weight to obtain a second numerical value, multiplying the first score of the keyword by a third weight corresponding to the keyword to obtain a third numerical value, and adding the first numerical value, the second numerical value and the third numerical value to obtain a second score of the keyword;
and determining the score of the candidate phrase according to the second score of each keyword.
In one possible implementation manner, adding the first value, the second value, and the third value to obtain a second score of the keyword, includes:
determining the contribution degree of the keyword, and determining a fourth weight corresponding to the keyword according to the contribution degree of the keyword;
adding the first numerical value, the second numerical value and the third numerical value to obtain a fourth numerical value;
and multiplying the fourth numerical value and the fourth weight to obtain a second score of the keyword.
In one possible implementation manner, the forming the preset number of candidate phrases into the tag information of the text information includes:
selecting candidate phrases of concept types from the preset number of candidate phrases to form concept label information; and/or the presence of a gas in the gas,
selecting candidate phrases of the event type from the preset number of candidate phrases to form event tag information.
In one possible implementation, the method further includes:
moving candidate phrases ending with keywords with preset parts of speech in the concept label information to the event label information;
and moving the candidate phrases which do not contain the keywords of the preset part of speech in the event label information to the concept label information.
In one possible implementation, segmenting the text information to obtain a candidate phrase set includes:
carrying out sentence breaking on the text information to obtain at least one candidate sentence, and forming the at least one candidate sentence into a candidate sentence set;
segmenting each candidate sentence in the candidate sentence set to obtain at least one keyword, and forming the at least one keyword into a keyword set;
generating at least one candidate phrase from the keywords in the keyword set based on a syntactic tree algorithm;
at least one candidate phrase is grouped into the set of candidate phrases.
In a possible implementation manner, before performing word segmentation on each candidate sentence in the candidate sentence set to obtain at least one keyword, the method further includes:
determining a sentence component of each candidate sentence in the set of candidate sentences;
and deleting the candidate sentences of which the sentence components are preset components in the candidate sentence set according to the sentence components of each candidate sentence.
In the embodiment of the invention, the text information is segmented to obtain a candidate phrase set, the candidate phrase set comprises at least one candidate phrase, and label information is extracted based on the candidate phrases in the candidate phrase set, so that multi-element label information can be extracted. And for each candidate phrase in the candidate phrase set, determining the score of the sentence in which the candidate phrase is positioned, determining the score of the position of the sentence in which the candidate phrase is positioned, determining the first score of each keyword included in the candidate phrase, and determining the score of the candidate phrase according to the score of the sentence in which the candidate phrase is positioned, the score of the position of the sentence in which the candidate phrase is positioned and the first score of each keyword included in the candidate phrase. Due to the combination of sentence scoring, position scoring and keyword scoring, the scoring accuracy of the candidate phrases is improved, and the accuracy of extracting the tag information is improved.
The embodiment of the invention provides a method for extracting tag information, which is applied to a server and is shown in fig. 2, and the method comprises the following steps:
step 201: the server divides words of the text information to obtain a candidate phrase set, wherein the candidate phrase set comprises at least one candidate phrase, and each candidate phrase comprises at least one keyword.
In order to improve reading efficiency of a user, before the user acquires text information from a server through a terminal, the server extracts label information from the text information, wherein the label information is used for indicating the theme of the text information. When a user acquires text information from the server through the terminal, the server sends label information of the text information to the terminal. The terminal receives the label information of the text information sent by the server and displays the label information of the text information, so that the user can quickly know the gist of the text information according to the label information. The text information can be any text information including character information; for example, the text information may be electronic news information, social network information, commodity review information, web page information, mail information, and the like; in the embodiment of the present invention, the text information is not particularly limited.
When the server performs word segmentation on the text information to obtain a candidate phrase set, the server can perform word segmentation on the text information to obtain a keyword set, and based on the keyword set, a candidate phrase set is generated; accordingly, this step can be realized by the following steps (1) to (4), including:
(1): and the server breaks the sentence of the text information to obtain at least one candidate sentence, and the candidate sentence is formed into a candidate sentence set.
As the contribution of the main sentence is far greater than the effect of the clauses for understanding an article; moreover, when candidate phrases are extracted through a syntax tree algorithm subsequently, the extraction speed is greatly influenced by the length of the sentence, and in order to improve the operation speed of extracting the candidate phrases, the server filters out the number-of-words clauses; correspondingly, after the server composes the at least one candidate sentence into the candidate sentence set, the method further includes:
the server determines a sentence component of each candidate sentence in the candidate sentence set; and deleting the candidate sentences of which the sentence components are preset components in the candidate sentence set according to the sentence components of each candidate sentence.
The preset component may be a idiom or a fixed language, etc. In the embodiment of the present invention, the predetermined composition is not particularly limited.
(2): the server carries out word segmentation on each candidate sentence in the candidate sentence set to obtain at least one keyword, and the at least one keyword forms a keyword set.
Since the keywords such as "of", "bar", "o" have a small degree of contribution to the text information. Therefore, in order to reduce the amount of computation and improve the accuracy, the server may remove the keywords such as "the", "bar", and "o" in this step. Therefore, after the server makes at least one keyword into the keyword set, the method further includes:
the server marks the part of speech of each keyword in the keyword set; and searching keywords with a first preset part of speech from the keyword set according to the part of speech of each keyword, and removing the keywords with the first preset part of speech from the keyword set. The first preset part-of-speech keywords may be auxiliary words, prepositions, language-atmosphere words or numerals. In the embodiment of the present invention, the first predetermined part of speech is not specifically limited.
(3): and the server generates at least one candidate phrase from the keywords in the keyword set based on a syntactic tree algorithm.
The server inputs each keyword in the keyword set into a syntax tree model, wherein the syntax tree model comprises a syntax tree algorithm; and generating a keyword tree from the keywords in the keyword set through the syntax tree algorithm. The keyword tree comprises a plurality of nodes and relations among the plurality of nodes; one node in the keyword tree is a keyword. The server generates at least one candidate phrase based on the keyword tree.
Since the keyword tree includes keywords and relationships between the keywords; only one or more keywords having a relationship can constitute a candidate phrase. Correspondingly, the step of generating, by the server, at least one candidate phrase based on the keyword tree may be:
for each leaf node in the keyword tree, the server selects keywords of the leaf node and keywords of a parent node of the leaf node from the keyword tree to form a candidate phrase.
In this step, the server may also generate a candidate phrase according to the candidate keyword of the parent node of the leaf node, and correspondingly, the step of generating at least one candidate phrase by the server based on the keyword tree may be:
for each leaf node in the keyword tree, the server selects the keyword of the leaf node from the keyword tree, acquires the keyword of the father node of the leaf node, and acquires the keyword of the father node until acquiring the keyword of the root node; and forming the acquired keywords into candidate phrases.
In the embodiment of the invention, the candidate phrases are extracted through the syntax tree algorithm, so that the extracted candidate phrases are multiple, and the keywords included in the candidate phrases have strong semantic consistency, thereby improving the accuracy of candidate label information generation based on the candidate phrases.
(4): the server composes the at least one candidate phrase into a set of candidate phrases.
After the server obtains the set of candidate phrases, a score is determined for each candidate phrase in the set of candidate phrases by the following step 202.
Step 202: for each candidate phrase in the candidate phrase set, the server determines the score of the sentence in which the candidate phrase is located, the score of the position of the sentence in which the candidate phrase is located and the first score of each keyword included in the candidate phrase, and determines the score of the candidate phrase according to the score of the sentence in which the candidate phrase is located, the score of the position of the sentence in which the candidate phrase is located and the first score of each keyword included in the candidate phrase.
For each candidate phrase in the set of candidate phrases, the server may determine a score for the candidate phrase by the following steps (1) to (4).
(1): the server determines a score for the sentence in which the candidate phrase is located.
The server acquires the sentences of the candidate phrases; the score of the sentence is determined by the BM25 algorithm.
(2): the server determines a score for the position of the sentence in which the candidate phrase is located.
Because the grades of the sentences in different positions are different, for example, the sentences in the abstract can better reflect the gist of text information, and the sentences in the abstract correspond to higher grades; sentences in the body correspond to lower scores. Therefore, before this step, the correspondence between the position and the score is stored in the server. The location may be a summary or a body. Correspondingly, the steps can be as follows:
and the server determines the position of the sentence in which the candidate phrase is positioned in the text information, and acquires the score of the position of the sentence in which the candidate phrase is positioned from the corresponding relation between the position and the score according to the position.
In this step, the server may further calculate a score of the position of the sentence where the candidate phrase is located according to the position of the sentence where the candidate phrase is located. And accordingly. The method comprises the following steps:
the server determines a first position of a sentence where the candidate phrase is located in the text information, and a second position of the candidate phrase in the paragraph; and determining the score of the position of the sentence in which the candidate phrase is positioned according to the first position and the second position.
Wherein the first location may be a title, abstract or body text. The second position may be the first paragraph first sentence, the first paragraph is not the first sentence, the non-first paragraph first sentence or the non-first paragraph is not the first sentence. Correspondingly, the step of determining, by the server, the score of the position of the sentence in which the candidate phrase is located according to the first position and the second position may be:
the server acquires a first sub-score corresponding to the first position from the corresponding relation between the first position and the sub-score according to the first position; and according to the second position, acquiring a second sub-score corresponding to the second position from the corresponding relation between the second position and the sub-score, and multiplying the first sub-score and the second sub-score to obtain the score of the position of the sentence where the candidate phrase is located.
For example, when the first position is title, the first sub-score is 0.9; when the first position is the abstract, the first sub-score is 0.6; the first sub-score is 0.3 when the first position is text. When the second position is the first sentence of the first paragraph, the second sub-score is 0.3; when the second position is not the first sentence, the second sub-score is 0.3 x 0.4-0.12; when the second position is a non-first-segment first sentence, the second sub-score is 0.1; when the second position is a non-first sentence, the second sub-score is 0.1 × 0.4 ═ 0.04.
(3): the server determines a first score for each keyword that the candidate phrase includes.
For each keyword included in the candidate phrase, the server determines a first score for the keyword by the following steps (3-1) to (3-5), including:
(3-1): the server determines a first occurrence number and a second occurrence number, wherein the first occurrence number is the occurrence number of the keyword in the text information, and the second occurrence number is the total occurrence number of each keyword included in the text information.
(3-2): and the server determines the word frequency of the keyword according to the first occurrence frequency and the second occurrence frequency.
And the server determines the ratio of the first occurrence frequency to the second occurrence frequency as the word frequency of the keyword.
(3-3): the server determines a first quantity and a second quantity, wherein the first quantity is the quantity of the sample text information included in the sample text information base, and the second quantity is the quantity of the text information including the keyword in the sample text information base.
The server stores a sample text information base in advance, and the sample text information base comprises at least one sample text information. In this step, the server counts the number of sample text messages included in the sample text message library, which is referred to as a first number for convenience of description. The server counts the number of text messages including the keyword in the sample text information base, and the number is called a second number.
(3-4): and the server determines the reverse file frequency of the keyword according to the first quantity and the second quantity.
The server determines the reverse file frequency of the keyword according to the first quantity and the second quantity through the following formula I:
the formula I is as follows:wherein idf is the reverse file frequency of the keyword, D is a first number, and J is a second number.
Since there may be no text information in the sample text information base that includes the keyword, the second number may be zero. Therefore, in this step, the following steps may be performed:
the server determines the reverse file frequency of the keyword according to the first quantity and the second quantity through the following formula II:
the formula II is as follows:wherein idf is the reverse file frequency of the keyword, D is a first number, and J is a second number.
(3-4): and the server determines a first score of the keyword according to the word frequency and the reverse file frequency of the keyword.
And the server determines a first score of the keyword through a first preset algorithm according to the word frequency and the reverse file frequency of the keyword.
The first preset algorithm may be set and changed as needed, and in the embodiment of the present invention, the first preset algorithm is not specifically limited. For example, the first preset algorithm may be multiplication, addition, subtraction, division, weighted multiplication, or weighted division.
When the first predetermined algorithm is multiplication, the step may be:
and the server multiplies the word frequency of the keyword by the reverse file frequency to obtain a first score of the keyword.
(4): and the server determines the score of the candidate phrase according to the score of the sentence in which the candidate phrase is positioned, the score of the position of the sentence in which the candidate phrase is positioned and the first score of each keyword included in the candidate phrase.
In this step, the server may determine the score of the candidate phrase directly according to the score of the sentence in which the candidate phrase is located, the score of the position of the sentence in which the candidate phrase is located, and the first score of each keyword included in the candidate phrase, that is, the following first implementation manner. The server may also set a first weight, a second weight, and a third weight for the sentence where the candidate phrase is located, the position of the matrix where the candidate phrase is located, and each keyword included in the candidate phrase, respectively, and determine the score of the candidate phrase based on the score of the sentence where the candidate phrase is located, the score of the position of the sentence where the candidate phrase is located, the first score of each keyword included in the candidate phrase, and the first weight, the second weight, and the third weight, that is, the following second implementation manner.
For the first implementation, this step may be implemented by the following steps (4-1) to (4-2), including:
(4-1): for each keyword included in the candidate phrase, the server determines a second score of the keyword through a second preset algorithm according to the score of the sentence in which the candidate phrase is located, the score of the position of the sentence in which the candidate phrase is located and the first score of the keyword.
The second preset algorithm may be set and changed as needed, and in the embodiment of the present invention, the second preset algorithm is not specifically limited. For example, the second predetermined algorithm may be multiplication, addition, subtraction, division, weighted multiplication, or weighted division.
When the second predetermined algorithm is addition, the step may be:
and the server adds the score of the sentence in which the candidate phrase is positioned, the score of the position of the sentence in which the candidate phrase is positioned and the first score of the keyword to obtain a second score of the keyword.
It should be noted that the contribution degree of the single word to the understanding of the article is small. Therefore, when the server determines the second score of the keyword of the single word, the server performs the weight reduction processing on the keyword. Words with quotation marks that appear in the text information many times contribute greatly to understanding the article. Therefore, when the server determines that the second score of the keyword with quotation marks appears for a plurality of times, the server performs weighting processing on the keyword. Correspondingly, the steps can be as follows:
for each keyword included in the candidate phrase, the server determines the contribution degree of the keyword; determining a fourth weight corresponding to the keyword according to the contribution degree; obtaining a fifth numerical value through a second preset algorithm according to the score of the sentence where the candidate phrase is located, the score of the position of the sentence where the candidate phrase is located and the first score of the keyword; and multiplying the fifth numerical value by the fourth weight to obtain a second score of the keyword.
The server determines the contribution degree of the keyword according to the word number of the keyword and/or the importance degree of the keyword, and the specific process may be as follows:
the server takes the word number of the keyword as the contribution degree of the keyword; and, the more words, the higher the contribution. Or,
the server takes the importance of the keyword as the contribution of the keyword; and, the higher the importance, the higher the contribution. Or,
and the server performs weighting processing on the word number and the importance of the keyword to obtain the contribution degree of the keyword.
Wherein, the importance of the keyword can be used for representing the occurrence number of the keyword and/or representing the keyword by highlighting and the like.
Storing the corresponding relation between the contribution degree and the weight in the server; correspondingly, the step of determining, by the server, the fourth weight corresponding to the keyword according to the contribution degree may be:
and the server acquires a fourth weight corresponding to the keyword from the corresponding relation between the contribution degree and the weight according to the contribution degree.
(4-2): the server determines the score of the candidate phrase according to the second score of each keyword included in the candidate phrase.
And the server determines the score of the candidate phrase through a third preset algorithm according to the second score of each keyword included in the candidate phrase.
The third preset algorithm may be set and changed as needed, and in the embodiment of the present invention, the third preset algorithm is not specifically limited. For example, the third preset algorithm may be multiplication, addition, subtraction, division, weighted multiplication, weighted division, or maximum value.
When the third preset algorithm is addition, the step may be:
and the server adds the second scores of each keyword included in the candidate phrase to obtain the score of the candidate phrase.
When the third preset algorithm is to find the maximum value, the step may be:
the server selects the maximum score from the second scores of each keyword included in the candidate phrase, and takes the maximum score as the score of the candidate phrase.
For the second implementation, this step may be implemented by the following steps (4-a) to (4-c), including:
(4-a): the server determines a first weight corresponding to a sentence in which the candidate phrase is located, a second weight corresponding to the position of the sentence in which the candidate phrase is located, and a third weight corresponding to each keyword included in the candidate phrase.
The server prestores a first weight corresponding to a sentence where the candidate phrase is located and a second weight of the position of the sentence where the candidate phrase is located; in this step, the server obtains a first weight corresponding to a sentence where the stored candidate phrase is located and a second weight corresponding to a position of the sentence where the candidate phrase is located.
Storing the corresponding relation between each keyword and the third weight in the server; correspondingly, the step of the server obtaining the third weight corresponding to each keyword included in the candidate phrase may be:
and the server acquires the third weight corresponding to each keyword included in the candidate phrase from the corresponding relation between the keyword and the third weight according to each keyword included in the candidate phrase.
It should be noted that the third weights corresponding to each keyword may be the same or different. For example, the first weight a1 of the sentence in which the candidate phrase is located is 0.1, the second weight a2 of the sentence in which the candidate phrase is located is 0.55, and the third weight a3 of each keyword included in the candidate phrase is 0.35.
(4-b): for each keyword included in the candidate phrase, the server multiplies the score of the sentence in which the candidate phrase is located by the first weight to obtain a first numerical value, multiplies the score of the position of the sentence in which the candidate phrase is located by the second weight to obtain a second numerical value, multiplies the first score of the keyword by the third weight corresponding to the keyword to obtain a third numerical value, and adds the first numerical value, the second numerical value and the third numerical value to obtain a second score of the keyword.
Also, the contribution degree of the single word to the understanding of the article is small. Therefore, when the server determines the second score of the keyword of the single word, the server performs the weight reduction processing on the keyword. Words with quotation marks that appear in the text information many times contribute greatly to understanding the article. Therefore, when the server determines that the second score of the keyword with quotation marks appears for a plurality of times, the server performs weighting processing on the keyword. Correspondingly, the step of adding the first numerical value, the second numerical value and the third numerical value by the server to obtain the second score of the keyword may be:
the server determines the contribution degree of the keyword; determining a fourth weight corresponding to the keyword according to the contribution degree; and adding the first numerical value, the second numerical value and the third numerical value to obtain a fourth numerical value, and multiplying the fourth numerical value by a fourth weight to obtain a second score of the keyword.
(4-c): and the server determines the score of the candidate phrase according to the second score of each keyword.
This step is the same as step (4-2), and will not be described herein again.
Step 203: the server selects a preset number of candidate phrases with the highest scores from the candidate phrase set based on the score of each candidate phrase.
And the server sorts each candidate phrase according to the grade of the candidate phrase from high to low, and outputs a preset number of candidate phrases sorted at the top.
The preset number may be set and changed as needed, and in the embodiment of the present invention, the preset number is not specifically limited. For example, the preset number may be 8 or 10, etc.
In the embodiment of the present invention, since the candidate phrases may include keywords having no meaning, such as auxiliary words, prepositions, discourse words, and number words; therefore, after the server selects the preset number of candidate phrases, the server filters out keywords of a second preset part of speech in the preset number of candidate phrases.
The second preset part of speech and the first preset part of speech can be the same or different; this is not particularly limited in the embodiments of the present invention. For example, the keywords of the second predetermined part of speech may be auxiliary words, prepositions, ambiguities, or numerals.
In the embodiment of the invention, the server can set concept tag information and event tag information; the concept label information comprises the most core concept phrase of the text information, and the event label information comprises the core event in the text information. After step 203 is performed, the server generates conceptual tag information by the following step 204 and event tag information by the following step 205.
Step 204: the server selects candidate phrases of concept types from a plurality of preset candidate phrases to form concept label information.
Wherein the candidate phrase of the concept type refers to a candidate phrase containing a noun.
Step 205: the server selects candidate phrases of the event type from a preset number of candidate phrases to form event tag information.
Wherein the candidate phrase of the event type refers to a candidate phrase containing a verb.
It should be noted that step 204 and step 205 do not have a time sequence, and step 204 may be executed first, and then step 205 may be executed; step 205 may be performed first, and then step 204 may be performed.
Since the server may make an error in classifying the concept tag information and the event tag information; therefore, the server can also perform phrase correction, and the specific process can be as follows:
the server moves candidate phrases ending with keywords of a third preset part of speech in the concept label information to the event label information; and/or moving candidate phrases which do not contain the keywords of the third preset part of speech in the event label information into the concept label information.
The third preset part of speech may be set and changed as needed, and in the embodiment of the present invention, the third preset part of speech is not specifically limited. For example, the third preset part of speech may be a verb.
After the server extracts the label information, the server stores the corresponding relation between the identification of the text information and the label information of the text information. The terminal can acquire the label information from the server; the specific process can be as follows:
and the terminal sends an acquisition request to the server, wherein the acquisition request carries the identifier of the text information to be acquired. The server receives an acquisition request sent by the terminal, acquires the label information of the text information from the corresponding relation between the identification and the label information according to the identification of the text information, and sends the label information of the text information to the terminal. And the terminal receives the label information of the text information sent by the server and displays the label information of the text information. So that the user can quickly understand the subject matter of the text information based on the tag information of the text information. The identification of the text information may be a name, a URL, a storage path or a number of the text information, or the like.
In the embodiment of the present invention, tag information extracted by LDA (document topic Allocation) in the existing method is a unitary tag, whereas tag information extracted based on a syntax tree in the embodiment of the present invention is multi-element tag information, and concept tag information and event tag information are extracted.
In the embodiment of the invention, the text information is segmented to obtain a candidate phrase set, the candidate phrase set comprises at least one candidate phrase, and label information is extracted based on the candidate phrases in the candidate phrase set, so that multi-element label information can be extracted. And for each candidate phrase in the candidate phrase set, determining the score of the sentence in which the candidate phrase is positioned, determining the score of the position of the sentence in which the candidate phrase is positioned, determining the first score of each keyword included in the candidate phrase, and determining the score of the candidate phrase according to the score of the sentence in which the candidate phrase is positioned, the score of the position of the sentence in which the candidate phrase is positioned and the first score of each keyword included in the candidate phrase. Due to the combination of sentence scoring, position scoring and keyword scoring, the scoring accuracy of the candidate phrases is improved, and the accuracy of extracting the tag information is improved.
The invention provides a device for extracting tag information, which is applied to a server and used for executing the server in the method for extracting the tag information. Referring to fig. 3, the apparatus includes:
a word segmentation module 301, configured to segment words of text information to obtain a candidate phrase set, where the candidate phrase set includes at least one candidate phrase, and each candidate phrase includes at least one keyword;
a scoring module 302, configured to determine, for each candidate phrase in the candidate phrase set, a score of a sentence in which the candidate phrase is located, a score of a position of the sentence in which the candidate phrase is located, and a first score of each keyword included in the candidate phrase, and determine a score of the candidate phrase according to the score of the sentence in which the candidate phrase is located, the score of the position of the sentence in which the candidate phrase is located, and the first score of each keyword included in the candidate phrase;
a selecting module 303, configured to select, based on the score of each candidate phrase, a preset number of candidate phrases with the highest scores from the candidate phrase set;
a composing module 304, configured to compose the preset number of candidate phrases into tag information of the text information.
In a possible implementation manner, the scoring module 302 is further configured to determine, for each keyword, a first occurrence number and a second occurrence number, where the first occurrence number is the occurrence number of the keyword in the text information, and the second occurrence number is a total occurrence number of each keyword included in the text information; determining the word frequency of the keyword according to the first occurrence frequency and the second occurrence frequency; determining a first quantity and a second quantity, wherein the first quantity is the quantity of sample text information included in a sample text information base, and the second quantity is the quantity of text information including the keywords in the sample text information base; determining the reverse file frequency of the keywords according to the first quantity and the second quantity; and determining a first score of the keyword according to the word frequency and the reverse file frequency.
In a possible implementation manner, the scoring module 302 is further configured to determine a first position of a paragraph in the text information where the sentence in which the candidate phrase is located, and a second position of the candidate phrase in the paragraph; and determining the grade of the position of the sentence in which the candidate phrase is positioned according to the first position and the second position.
In a possible implementation manner, the scoring module 302 is further configured to determine a first weight corresponding to a sentence in which the candidate phrase is located, a second weight corresponding to a position of the sentence in which the candidate phrase is located, and a third weight corresponding to each keyword included in the candidate phrase; for each keyword included in the candidate phrase, multiplying the score of the sentence in which the candidate phrase is located by the first weight to obtain a first numerical value, multiplying the score of the position of the sentence in which the candidate phrase is located by the second weight to obtain a second numerical value, multiplying the first score of the keyword by a third weight corresponding to the keyword to obtain a third numerical value, and adding the first numerical value, the second numerical value and the third numerical value to obtain a second score of the keyword; and determining the score of the candidate phrase according to the second score of each keyword.
In a possible implementation manner, the scoring module 302 is further configured to determine a contribution degree of the keyword, and determine a fourth weight corresponding to the keyword according to the contribution degree of the keyword; adding the first numerical value, the second numerical value and the third numerical value to obtain a fourth numerical value; and multiplying the fourth numerical value and the fourth weight to obtain a second score of the keyword.
In a possible implementation manner, the composition module 304 is further configured to select candidate phrases of concept types from the preset number of candidate phrases to compose concept tag information; and selecting candidate phrases of the event type from the preset number of candidate phrases to form event tag information.
In one possible implementation, the apparatus further includes:
a moving module, configured to move a candidate phrase ending with a keyword with a preset part of speech in the concept tag information to the event tag information; and/or the presence of a gas in the gas,
the moving module is further configured to move a candidate phrase, which does not include the keyword of the preset part of speech, in the event tag information to the concept tag information.
In a possible implementation manner, the word segmentation module 301 is further configured to perform sentence segmentation on the text information to obtain at least one candidate sentence, and form the at least one candidate sentence into a candidate sentence set; segmenting each candidate sentence in the candidate sentence set to obtain at least one keyword, and forming the at least one keyword into a keyword set; generating at least one candidate phrase from the keywords in the keyword set based on a syntax tree algorithm; and forming the at least one candidate phrase into the candidate phrase set.
In one possible implementation, the word segmentation module 301 is further configured to determine a sentence component of each candidate sentence in the candidate sentence set; and deleting the candidate sentences of which the sentence components are preset components in the candidate sentence set according to the sentence components of each candidate sentence.
In the embodiment of the invention, the text information is segmented to obtain a candidate phrase set, the candidate phrase set comprises at least one candidate phrase, and label information is extracted based on the candidate phrases in the candidate phrase set, so that multi-element label information can be extracted. And for each candidate phrase in the candidate phrase set, determining the score of the sentence in which the candidate phrase is positioned, determining the score of the position of the sentence in which the candidate phrase is positioned, determining the first score of each keyword included in the candidate phrase, and determining the score of the candidate phrase according to the score of the sentence in which the candidate phrase is positioned, the score of the position of the sentence in which the candidate phrase is positioned and the first score of each keyword included in the candidate phrase. Due to the combination of sentence scoring, position scoring and keyword scoring, the scoring accuracy of the candidate phrases is improved, and the accuracy of extracting the tag information is improved.
It should be noted that: in the above embodiment, when extracting the tag information, the apparatus for extracting the tag information is illustrated by only dividing the functional modules, and in practical applications, the function distribution may be completed by different functional modules according to needs, that is, the internal structure of the apparatus is divided into different functional modules to complete all or part of the functions described above. In addition, the apparatus for extracting tag information and the method for extracting tag information provided by the above embodiments belong to the same concept, and specific implementation processes thereof are detailed in the method embodiments and are not described herein again.
Fig. 4 illustrates a server for extracting tag information according to an example embodiment. Referring to fig. 4, server 400 includes a processing component 422, which further includes one or more processors, and memory resources, represented by memory 432, for storing instructions, such as applications, that are executable by processing component 422. The application programs stored in memory 432 may include one or more modules that each correspond to a set of instructions. Further, the processing component 422 is configured to execute instructions to perform the functions performed by the server in the above-described method of extracting tag information.
The server 400 may also include a power component 426 configured to perform power management of the server 400, a wired or wireless network interface 450 configured to connect the server 400 to a network, and an input/output (I/O) interface 458. The Server 400 may operate based on an operating system, such as Windows Server, stored in the memory 432TM,Mac OSXTM,UnixTM,LinuxTM,FreeBSDTMOr the like.
An embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium may be a computer-readable storage medium contained in the memory in the foregoing embodiment; or it may be a computer-readable storage medium that exists separately and is not assembled into a server. The computer readable storage medium stores one or more programs for use by one or more processors in performing a method of extracting tag information.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.