CN108052503B

CN108052503B - Confidence coefficient calculation method and device

Info

Publication number: CN108052503B
Application number: CN201711428423.3A
Authority: CN
Inventors: 刘兵
Original assignee: Beijing QIYI Century Science and Technology Co Ltd
Current assignee: Beijing QIYI Century Science and Technology Co Ltd
Priority date: 2017-12-26
Filing date: 2017-12-26
Publication date: 2021-04-27
Anticipated expiration: 2037-12-26
Also published as: CN108052503A

Abstract

The invention provides a method and a device for calculating confidence, wherein a text to be analyzed is acquired, at least one statistical dimension is determined, a statistical dimension value corresponding to each statistical dimension of the text to be analyzed is calculated, and the confidence of the text to be analyzed as a title entity is calculated according to the statistical dimension value corresponding to each statistical dimension and a confidence calculation formula. The confidence coefficient can be calculated by the method, and the problem that a method for calculating the confidence coefficient of the text to be analyzed as the entity of the title is needed in the prior art is solved.

Description

Confidence coefficient calculation method and device

Technical Field

The invention relates to the technical field of multimedia, in particular to a confidence coefficient calculation method and device.

Background

Named entity recognition refers to recognition of named entities with specific meanings in texts, such as names of people, places, names of organizations and the like, is a very important basic task in natural language processing, and plays a vital role in the fields of information retrieval, question answering systems, semantic search, knowledge base construction and the like.

A machine learning model may be used to identify a named entity, where an entity dictionary, which is a dictionary of words, is used in the identification process.

When the text to be analyzed is the drama name entity such as 'mother', the confidence coefficient of each text to be analyzed as the drama name entity can be increased in the entity dictionary, and further when the confidence coefficient is larger than a preset numerical value, the text to be analyzed can be considered as the drama name entity, so that the identification accuracy of the drama name entity can be improved, for example, the identification degree of 'mother' as the drama name entity is improved. Wherein, the drama name entity is text for representing the movie name.

Therefore, a method capable of calculating the confidence of the text to be analyzed as the entity of the drama name is needed.

Disclosure of Invention

In view of the above, the present invention provides a method and an apparatus for calculating a confidence level, so as to solve the problem of a method for calculating a confidence level of a text to be analyzed as a title entity.

In order to solve the technical problems, the invention adopts the following technical scheme:

a method of confidence calculation, comprising:

acquiring a text to be analyzed;

determining at least one statistical dimension;

calculating a statistical dimension value corresponding to each statistical dimension of the text to be analyzed;

and calculating the confidence coefficient of the text to be analyzed as the entity of the title according to the statistical dimension value corresponding to each statistical dimension and a confidence coefficient calculation formula.

Preferably, the statistical dimensions include:

the number of times of the text to be analyzed appearing in the non-video text set, the number of times of the text to be analyzed appearing in the video text set, the click entropy of the text to be analyzed which is taken as search content within a preset time, the Boolean value of the text to be analyzed, and the character length value.

Preferably, when the statistical dimension is a boolean value of the text to be analyzed, calculating a statistical dimension value corresponding to each statistical dimension of the text to be analyzed includes:

searching by taking the text to be analyzed as a search word;

determining an entity name Boolean value of the text to be analyzed according to the result of whether preset words exist in the search result;

obtaining a first word segmentation Boolean value of the text to be analyzed according to the result of whether the text to be analyzed can be subjected to word segmentation;

obtaining a second word segmentation Boolean value of the text to be analyzed according to whether each word segmentation result in the word segmentation results of the text to be analyzed is a word result;

the statistical dimension value corresponding to the boolean value of the text to be analyzed comprises the entity name boolean value, the first segmentation boolean value and the second segmentation boolean value.

Preferably, the process of obtaining the confidence calculation formula includes:

acquiring a plurality of texts to be trained; each text to be trained comprises a confidence coefficient that the text to be trained serves as the entity of the title;

training an initial confidence coefficient calculation formula according to the plurality of texts to be trained to obtain the confidence coefficient calculation formula;

wherein the initial confidence calculation formula is generated based on a logistic regression algorithm.

Preferably, training an initial confidence coefficient calculation formula according to the plurality of texts to be trained to obtain the confidence coefficient calculation formula, includes:

determining a weight value of each statistical dimension in an initial confidence coefficient calculation formula according to the texts to be trained and the initial confidence coefficient calculation formula;

and generating the confidence coefficient calculation formula according to the weight value of each statistical dimension in the determined initial confidence coefficient calculation formula.

A confidence calculation apparatus, comprising:

the first acquisition module is used for acquiring a text to be analyzed;

a dimension determination module for determining at least one statistical dimension;

the first calculation module is used for calculating a statistical dimension value corresponding to each statistical dimension of the text to be analyzed;

and the second calculation module is used for calculating the confidence coefficient of the text to be analyzed as the title entity according to the statistical dimension value corresponding to each statistical dimension and the confidence coefficient calculation formula.

Preferably, the statistical dimensions include:

Preferably, when the statistical dimension is a boolean value of the text to be analyzed, the first calculation module includes:

the search submodule is used for searching by taking the text to be analyzed as a search word;

the first determining submodule is used for determining an entity name Boolean value of the text to be analyzed according to the result of whether preset words exist in the search result;

the second determining submodule is used for obtaining a first word segmentation Boolean value of the text to be analyzed according to the result that whether the text to be analyzed can be subjected to word segmentation or not;

the third determining submodule is used for obtaining a second word segmentation Boolean value of the text to be analyzed according to whether each word segmentation result in the word segmentation results of the text to be analyzed is a word result;

Preferably, the method further comprises the following steps:

the second acquisition module is used for acquiring a plurality of texts to be trained; each text to be trained comprises a confidence coefficient that the text to be trained serves as the entity of the title;

the training module is used for training an initial confidence coefficient calculation formula according to the plurality of texts to be trained to obtain the confidence coefficient calculation formula;

Preferably, the training module comprises:

the weight determining submodule is used for determining the weight value of each statistical dimension in the initial confidence coefficient calculation formula according to the texts to be trained and the initial confidence coefficient calculation formula;

and the generation submodule is used for generating the confidence coefficient calculation formula according to the weight value of each statistical dimension in the determined initial confidence coefficient calculation formula.

Compared with the prior art, the invention has the following beneficial effects:

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a flow chart of a method for calculating confidence according to the present invention;

FIG. 2 is a flow chart of another method of confidence level calculation provided by the present invention;

FIG. 3 is a flow chart of a method for calculating confidence level according to another embodiment of the present invention;

fig. 4 is a schematic structural diagram of a confidence level calculation apparatus according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The technical personnel of the invention find the following characteristics through experience summary:

a) the more frequently occurring strings in non-video industry text corpora, the lower the confidence as a dramatic name entity, because the greater the ambiguity, for example: "rouge", "decrypt", etc.;

b) the higher the successful frequency of matching with a movie and television title template in the video industry text, the higher the confidence coefficient of the title entity, for example: matches are marked with 'star', 'first star set', 'high definition version', 'star cast', 'movie version' and the like;

c) the larger the click entropy of the search engine results (measuring whether the click behavior distribution is concentrated or scattered), the lower the confidence as a drama entity, because the probability that the user clicks an album result is higher in the search results of the drama character strings without ambiguity (e.g., "Langyuan"), and the probability that the user clicks are more scattered in the search results of the drama character strings with ambiguity (e.g., "hackers");

d) the longer the string length, the higher the confidence as the dramatic name entity, for example: the confidence of "youth that we will elapse for the end" is higher and the confidence of "youth" is lower.

Therefore, on the basis of the characteristics, the technical scheme of the invention is provided.

Specifically, an embodiment of the present invention provides a method for calculating a confidence level, and with reference to fig. 1, the method includes:

s11, acquiring a text to be analyzed;

the text to be analyzed is a text which needs to be identified, for example, the text may be a text of "mother", "shaoshanshuai", and the like.

S12, determining at least one statistical dimension;

specifically, on the basis of this embodiment, the statistical dimension may include:

These several statistical dimensions are presented in turn below.

1. The number of times the text to be analyzed appears in the non-video text set;

the non-video text set is constructed by using a large number of literary works, news texts and the like, and is marked as C1, so that texts in the video industry are avoided as much as possible, and the frequency of occurrence of the drama name entity as a non-drama name can be counted in the texts in the non-entertainment industry. For example, "rouge" is an ambiguous tv show name, and we collected a collection of non-video text collections of non-video industries (e.g., novel, people's daily, etc.), where the appearance of "rouge" in these corpora is very unlikely to indicate a tv show name, i.e., it is highly likely that "rouge" appears as a "non-show name" in these non-video text collections.

And after the non-video text set is obtained, counting the occurrence frequency of the text to be analyzed in the non-video text set, and setting the occurrence frequency as a first frequency and recording the first frequency as Freq1 (e).

2. The number of times that the text to be analyzed appears in the video text set;

the video company has a huge amount of video industry texts, including video titles, brief introduction, video comments, search logs and the like, and the text collection of all the video titles and the search logs is used as the video text collection, which is denoted as C2. The text to be analyzed appears in C2, and there are generally many common contextual features, and some common template of the contextual features of the title entity is manually sorted out, for example: "," "x-th collection", "" x-th season "," "high definition version", "" cast member "," "movie version", "new version", etc.

Specifically, the times of the text to be analyzed appearing in the second preset text set are counted and set as the second times according to the preset word combination logic and the preset word combination sequence.

The preset word combination logic means that the text to be analyzed and a drama name entity context characteristic template appear at the same time. The preset word combination sequence refers to a sequence in which the text to be analyzed and the drama name entity context feature template appear in the video text set and accords with a preset sequence.

It should be noted that the text to be analyzed and the contextual feature templates of different drama entity may appear in different orders before and after, and may be the same.

For example, the text to be analyzed is referred to as "text to be analyzed", in which case the text to be analyzed appears in the middle, "1986 edition", in which case the text to be analyzed appears after the text.

And counting the occurrence times of the texts to be analyzed, which meet the two conditions that the texts to be analyzed and the drama name entity context feature templates simultaneously appear and the arrangement sequence of the texts to be analyzed is earlier than that of the drama name entity context feature templates, and setting the occurrence times as a second time, which is recorded as Freq2 (e).

3. The text to be analyzed is used as the click entropy of the search content within the preset time;

the entropy is a concept in the information theory, and the distribution condition of the search click behavior is measured by the entropy, which is called click entropy. For different search terms, the clicking behaviors of users in search results are different, for example, in a video search engine, searching for "Langya list" in a video search engine, most users click on the results of the "Langya list" of the drama ranked in the first place or several places, while a few users click on the results of the "Langya list segment" and Langya list catkin "ranked in the next place, that is, the clicks of the users tend to be" concentrated "; if the search term is "hacker", the results clicked by the user are not too concentrated on a certain result, but are scattered on several or more results, some users click on the "hacker" of the tv series, and some users click on the short video related to the "hacker", for example: "China minimal hacker only 12 years old invades the school system without doing homework". That is, the user's clicks tend to "scatter". The present invention uses click entropy to quantify the degree of such "concentration" or "dispersion," defined as the following equation:

d (e) represents all search result documents of the text e to be analyzed, d represents one of the result documents, and p (d) represents the probability of clicking the result document d by the user.

The text to be analyzed and the user click behavior can be easily obtained based on the log of the video search engine, and the click entropy h (e) of e can be calculated based on the definition.

It should be noted that the click entropy in this embodiment is the click entropy within a preset time, where the preset time may be 7 days, 15 days, or 30 days.

4. The Boolean value and the character length value of the text to be analyzed.

Wherein, the calculated character length value can be recorded as len (e).

In this example, several examples of statistical dimensions are given, and statistics can be performed according to the statistical dimensions listed in this example, and a confidence is calculated.

And S13, calculating the confidence coefficient of the text to be analyzed as the drama name entity according to the statistical dimension value corresponding to each statistical dimension and the confidence coefficient calculation formula.

The confidence coefficient represents the probability that the text to be analyzed is the entity of the drama name. The higher the probability, the higher the possibility of indicating that the text to be analyzed is the drama name entity, and the lower the ambiguity of indicating that the text to be analyzed is the drama name entity.

In this embodiment, a text to be analyzed is obtained, at least one statistical dimension is determined, a statistical dimension value corresponding to each statistical dimension of the text to be analyzed is calculated, and a confidence coefficient of the text to be analyzed as a title entity is calculated according to the statistical dimension value corresponding to each statistical dimension and a confidence coefficient calculation formula. The confidence coefficient can be calculated by the method, and the problem that a method for calculating the confidence coefficient of the text to be analyzed as the entity of the title is needed in the prior art is solved.

In addition, in the embodiment, the dictionary recognition method is assisted according to the confidence degree calculation method, so that the result of named entity recognition can be improved.

Optionally, on the basis of any of the above embodiments, with reference to fig. 2, when the statistical dimension is a boolean value of the text to be analyzed, calculating a statistical dimension value corresponding to each statistical dimension of the text to be analyzed includes:

s21, searching the text to be analyzed as a search word;

the method can be used for searching in software such as a webpage and mobile phone software APP.

S22, determining an entity name Boolean value of the text to be analyzed according to the result of whether preset words exist in the search result;

specifically, the text to be analyzed is searched on the search page, and the search result corresponding to the text to be analyzed may be obtained, and may include a plurality of search results. If the second data value is 5, the first five search results are selected from the search results.

And calculating the Boolean value of the entity name of the text to be analyzed according to the judgment result of whether the preset word appears in each search result. The preset words can be at least one preset video text in a preset video text set.

Specifically, the preset movie text set is a complete movie, a television play, a comprehensive program and other videos which are officially released by the videos, but does not include short videos uploaded by users. The entity name boolean value is denoted as album (e).

The boolean value of the entity name includes two values, one is 1 and one is 0.

Specifically, when at least one preset video text in the preset video text set appears in the search result, the entity name boolean value is 1, and when any preset video text in the preset video text set does not appear in the search result, the entity name boolean value is 0.

S23, obtaining a first word segmentation Boolean value of the text to be analyzed according to the result of whether the text to be analyzed can be subjected to word segmentation;

the first word segmentation boolean value is marked as single (e), and word segmentation software can be used for performing word segmentation on the text to be analyzed to obtain a word segmentation result.

Specifically, step S23 includes the following steps:

1) judging whether the text to be analyzed is a text which can not be subjected to word segmentation again or not according to the word segmentation result of the text to be analyzed;

2) if the text to be analyzed is judged to be the text which can not be subjected to word segmentation, setting a first word segmentation Boolean value as a first preset numerical value;

specifically, the first preset value may be 1 or True. If the text to be analyzed is the dawn, because the dawn can not be segmented any more, and the segmentation result of the dawn is the dawn, the dawn is considered to be the text which can not be segmented any more, and at this time, the first segmentation boolean value is set to 1 or True.

3) If the text to be analyzed is judged to be the text capable of being segmented again, setting the Boolean value of the first segmentation as a second preset numerical value;

specifically, the second preset value may be 0 or error flag. If the text to be analyzed is "hacker empire", since the hacker empire can perform segmentation again, it is considered that the hacker empire is not a text that can not be segmented again, and the first segmentation boolean value is set to 0 or Flase.

S24, obtaining a second segmentation Boolean value of the text to be analyzed according to whether each segmentation result in the segmentation results of the text to be analyzed is a word result;

wherein the second participle boolean value is denoted allchar (e).

Specifically, step S24 may include:

1) judging whether each word in each word segmentation result is a single word or not according to the word segmentation result of the text to be analyzed;

wherein, the single character is an independent character.

2) If each word in the word segmentation result is judged to be a single word, setting the Boolean value of the second word segmentation as a third preset numerical value;

specifically, the third preset value may be 1 or True. If the text to be analyzed is "flower-thousand-bone", because the segmentation result of flower-thousand-bone is flower/thousand/bone, each word in the segmentation result of the text to be analyzed is considered to be a single word, and the boolean value of the second segmentation is set to 1 or True.

3) And if at least one word in the word segmentation result is judged not to be a single word, setting the Boolean value of the second word segmentation as a fourth preset numerical value.

Specifically, the fourth preset value may be 0 or Flase. If the text to be analyzed is "hacker empire", since the segmentation result of the hacker empire is hacker/empire, it is considered that each word in the segmentation result of the text to be analyzed "hacker empire" is not a single word, and the boolean value of the second segmentation is set to 0 or Flase.

It should be noted that the statistical dimension value corresponding to the boolean value of the text to be analyzed includes the entity name boolean value, the first segmentation boolean value, and the second segmentation boolean value.

And after the Boolean value of the entity name, the Boolean value of the first participle and the Boolean value of the second participle are obtained through calculation, the statistical dimension value corresponding to the Boolean value of the text to be analyzed can be determined.

In this embodiment, an entity name boolean value, a first participle boolean value, and a second participle boolean value are obtained by analyzing a search result and a participle result of a text to be analyzed, and a method for calculating the entity name boolean value, the first participle boolean value, and the second participle boolean value is provided.

Optionally, on the basis of any of the above embodiments, with reference to fig. 3, the process of obtaining the confidence coefficient calculation formula includes:

s31, obtaining a plurality of texts to be trained;

and each text to be trained comprises a confidence coefficient of the text to be trained as the drama name entity.

Specifically, the statistical dimension values of a plurality of statistical dimensions of the text to be analyzed are extracted and obtained in the above steps, and the plurality of statistical dimension values form a feature set:

F(e)＝[Freq1(e),Freq2(e),H(e),Album(e),Len(e),Single(e),AllChar(e)]。

therefore, the value of the calculated confidence can be converted into a regression problem in machine learning, and methods such as linear regression, logistic regression, neural network and the like in machine learning can be used, wherein the output value of the logistic regression method is a probability value (the value is between 0 and 1), and the probability value can be directly taken as the confidence (the probability of the text e to be analyzed as the entity of the drama name). The experimental system of the present invention uses a logistic regression method to calculate confidence.

The logistic regression is a supervised machine learning method, data needs to be labeled, a batch of texts to be trained (labeled as T) are manually labeled, each text to be trained is < e, c >, e is a dramatic name entity, and c is the confidence coefficient (real numerical value between 0 and 1) of the dramatic name entity e. During labeling, the ambiguity condition of the text to be trained needs to be observed, and a real value between 0 and 1 is manually given according to experience, for example: for a completely unambiguous text to be trained (e.g., "mei gong he act"), 1.0 is labeled, for less ambiguous texts (e.g., "huaqiang gu", "langa bang" may refer to a game name in some cases) 0.9 is labeled, for more ambiguous texts (e.g., "crouchi long") 0.5 is labeled, for more ambiguous texts (e.g., "prefecture book" 0.3 is labeled, and for more ambiguous texts (e.g., "mom") 0.1 is labeled. When labeling is carried out, the confidence coefficient value of each text to be trained is not completely accurate, and only relatively accurate values can be judged and given according to experience, when the number of the texts to be trained reaches a certain number, the model obtained by training can have certain statistical significance (or tends to be accurate), and the method is the idea of a statistical machine learning method.

S32, training an initial confidence coefficient calculation formula according to the plurality of texts to be trained to obtain the confidence coefficient calculation formula;

Optionally, on the basis of this embodiment, step S32 may specifically include:

1) determining a weight value of each statistical dimension in an initial confidence coefficient calculation formula according to the texts to be trained and the initial confidence coefficient calculation formula;

2) and generating the confidence coefficient calculation formula according to the weight value of each statistical dimension in the determined initial confidence coefficient calculation formula.

Specifically, based on the logistic regression algorithm, the initial confidence calculation formula may be defined as:

and C (e) the confidence coefficient of the dramatic entity word e is represented, F (e) the feature set of the dramatic entity word, g is a sigmoid function, theta is a parameter to be learned by the model, actually the weight of each dimension in the feature set F is counted, the random assignment can be carried out initially, and the value of theta is obtained after the labeled text T to be trained is used for training. Namely, the weight of each statistical dimension can be calculated according to the text to be trained.

After the parameter theta of the model is obtained through training, a confidence coefficient calculation formula can be obtained, and then the model can be used for predicting (calculating) the confidence coefficient. Specifically, after the statistical dimension value corresponding to each statistical dimension is obtained, the statistical dimension value can be substituted into a confidence coefficient calculation formula to calculate the confidence coefficient of the text to be analyzed as the title entity, and then whether the text to be analyzed is the title entity can be judged according to the confidence coefficient.

In the embodiment, a method for generating a confidence coefficient calculation formula is provided, so that the confidence coefficient of the text to be analyzed can be calculated according to the generated confidence coefficient calculation formula.

Optionally, on the basis of the embodiment of the confidence coefficient calculation method, another embodiment of the present invention provides a confidence coefficient calculation apparatus, and with reference to fig. 4, the confidence coefficient calculation method may include:

a first obtaining module 101, configured to obtain a text to be analyzed;

a dimension determination module 102 for determining at least one statistical dimension;

the first calculating module 103 is configured to calculate a statistical dimension value corresponding to each statistical dimension of the text to be analyzed;

and the second calculating module 104 is configured to calculate a confidence degree that the text to be analyzed serves as the title entity according to the statistical dimension value corresponding to each statistical dimension and the confidence degree calculating formula.

Further, the statistical dimensions include:

It should be noted that, for the working process of each module in this embodiment, please refer to the corresponding description in the above embodiments, which is not described herein again.

Optionally, on the basis of an embodiment of any one of the above confidence degree calculation apparatuses, when the statistical dimension is a boolean value of the text to be analyzed, the first calculation module includes:

It should be noted that, for the working process of each sub-module in this embodiment, please refer to the corresponding description in the above embodiment, which is not described herein again.

Optionally, on the basis of an embodiment of any one of the above confidence calculating apparatuses, the method further includes:

Further, the training module includes:

It should be noted that, for the working processes of each module and sub-module in this embodiment, please refer to the corresponding description in the above embodiments, which is not described herein again.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for calculating a confidence level, comprising:

acquiring a text to be analyzed;

determining at least one statistical dimension;

calculating a statistical dimension value corresponding to each statistical dimension of the text to be analyzed, wherein the calculation comprises the following steps: when the statistical dimension is the Boolean value of the text to be analyzed, determining a statistical dimension value corresponding to the Boolean value of the text to be analyzed according to whether a preset word result exists in a search result of the text to be analyzed as a search word, whether the text to be analyzed can be segmented, and whether each segmentation result in the segmentation result of the text to be analyzed is a word result;

2. The computing method of claim 1, wherein the statistical dimensions comprise:

3. The calculation method according to claim 2, wherein determining the statistical dimension value corresponding to the boolean value of the text to be analyzed according to whether a result of a preset word exists in the search results of the text to be analyzed as the search word, whether the text to be analyzed can be segmented, and whether each segmentation result in the segmentation results of the text to be analyzed is a word result, comprises:

searching by taking the text to be analyzed as a search word;

4. The calculation method according to claim 1, wherein the process of obtaining the confidence calculation formula comprises:

5. The calculation method according to claim 4, wherein training an initial confidence calculation formula according to the plurality of texts to be trained to obtain the confidence calculation formula comprises:

6. An apparatus for calculating confidence, comprising:

the first acquisition module is used for acquiring a text to be analyzed;

the first calculating module is configured to calculate a statistical dimension value corresponding to each statistical dimension of the text to be analyzed, where the calculating module includes: when the statistical dimension is the Boolean value of the text to be analyzed, determining a statistical dimension value corresponding to the Boolean value of the text to be analyzed according to whether a preset word result exists in a search result of the text to be analyzed as a search word, whether the text to be analyzed can be segmented, and whether each segmentation result in the segmentation result of the text to be analyzed is a word result;

7. The computing device of claim 6, wherein the statistical dimensions comprise:

8. The computing device of claim 7, wherein the first computing module comprises:

9. The computing device of claim 6, further comprising:

10. The computing device of claim 9, wherein the training module comprises: