A kind of automatic question-answering method and device
Technical field
The present invention relates to web search technical field, more particularly to a kind of automatic question-answering method and device.
Background technology
In current web search, Ask-Answer Community has gradually developed, and Ask-Answer Community, that is, user participates in puing question to and answer,
And organize user and data according to this question and answer relationship, for the internet product of user's search.And in Ask-Answer Community,
User's enquirement demand is cannot be satisfied to answer a question by user completely, therefore most of Ask-Answer Communities also provide automatically at present
Question and answer function provides answer to the problem of user automatically by background server.
There are mainly two types of implementation methods at present for automatic question answering:
1) in specific knowledge field, according to the analysis method of setting, customer problem is automatically analyzed and from existing answer
Extract answer.
2) answer is matched in a large amount of predefined knowledge base.
Problem analysis and answer is extracted in specific knowledge field for the first, this method is specific due to being limited to
Ken, so having certain limitation.
And answer is matched in a large amount of predefined knowledge base for second, this method problem-solving ability takes
Certainly in the size of pre-stored knowledge base data volume, automatic question answering cannot achieve beyond the problem of knowledge base scope.
In short, in the prior art, automatic question answering must rely on specific knowledge field or knowledge base;As long as being led beyond knowledge
The problem of domain or knowledge base, it all cannot achieve automatic question answering.
Invention content
In view of this, the present invention provides a kind of automatic question-answering method and device, it can be according to the use of existing Ask-Answer Community
Family question and answer data realize automatic question answering.In order to achieve the above object, what technical scheme of the present invention was specifically realized in:
A kind of automatic question-answering method, this method include:
The problem of being inputted according to user terminal string obtains relevant existing user's question and answer data;
Count the word frequency of the centre word of the abstract part of existing user's question and answer data;
According to the inverse document frequency of the word frequency of each centre word and each centre word counted in advance, calculate
The maximum centre word of word weight is determined as answer word by the word weight of each centre word;
The answer of the corresponding automatic question answering of described problem string is determined according to the answer word.
Preferably, the problem of being inputted according to user terminal string obtains relevant existing user's question and answer data, including:
It is gone here and there described problem string as retrieval, is input to the search engine of Ask-Answer Community, obtained corresponding with the retrieval string
Query result, every query result includes title division and the abstract part with distinctive mark.
Preferably, the word frequency of the centre word of the abstract part of statistics existing user's question and answer data, including:
The centre word word frequency of the abstract part of statistics each query result one by one, until all query results have all counted
At;
Wherein, for each query result, part of being made a summary is counted using fullstop cutting as sentence for each sentence
The word frequency of wherein each centre word, the word frequency of the centre word in all sentences is added up, all centre words in being made a summary
Word frequency.
Preferably, the word frequency by the centre word in all sentences adds up, all centre words in being made a summary
Word frequency, including:
If there is the word with distinctive mark in sentence, the word frequency of each centre word presses 3 times of criteria weights in the sentence
It is cumulative;If there is the word with distinctive mark before or after the sentence in adjacent sentence, the word of each centre word in the sentence
Frequency is cumulative by 2 times of criteria weights;Otherwise, the word frequency of each centre word is cumulative by criteria weights in the sentence, to obtain the sentence
The Weighted Term Frequency of all centre words in son.
Preferably, the centre word word frequency of the abstract part of each query result of statistics one by one, until all inquiries
As a result all statistics is completed, including:
Compare the similarity between the title division of each query result and described problem string, if current queries result
Title and the similarity of described problem string be more than preset threshold value, then the step of executing the statistics centre word word frequency, otherwise
The step of skipping the statistics centre word word frequency of current queries result.
Preferably, the word weight for calculating each centre word, including:
The inverse document frequency of the word frequency × centre word of the word weight=centre word of centre word.
Preferably, the answer that the corresponding automatic question answering of described problem string is determined according to answer word, including:
It is found in the abstract of the query result and the most preceding s abstracts of answer word occurs;S is whole more than or equal to 1
Number;
Described s abstract is respectively divided into multiple sentences by fullstop;It is found in these sentences and answer word occurs and user asks
Inscribe the largest number of sentences of centre word of string, the answer as the corresponding automatic question answering of described problem string.
A kind of automatic call answering arrangement, the device include:
Question and answer data acquisition module, string obtains relevant existing user's question and answer number the problem of for being inputted according to user terminal
According to;
Word frequency statistics module, the word frequency of the centre word of the abstract part for counting existing user's question and answer data;
Answer word determining module, for according to the word frequency of each centre word and each center counted in advance
The inverse document frequency of word calculates the word weight of each centre word, the maximum centre word of word weight is determined as answer word;
Automatic question answering answer determining module, for determining the corresponding automatic question answering of described problem string according to the answer word
Answer.
Preferably, the question and answer data acquisition module, including:
Retrieval unit is input to the search engine of Ask-Answer Community for being gone here and there described problem string as retrieval;
Acquiring unit, for obtaining query result corresponding with the retrieval string, every query result includes title division
With the abstract part with distinctive mark.
Preferably, the word frequency statistics module includes:
Cutting unit, for being directed to each query result, part of being made a summary is using fullstop cutting as sentence;
Statistic unit counts the word frequency of wherein each centre word for each sentence for the cutting unit cutting;
The word frequency of summing elements, the centre word in all sentences for counting the statistic unit adds up, and obtains
To the word frequency of all centre words in abstract;
Control unit counts each inquiry knot one by one for controlling the cutting unit, statistic unit and summing elements
The centre word word frequency of the abstract part of fruit, until all query results all count completion.
Preferably, the summing elements include:
Identify judgment sub-unit, the distinctive mark in sentence for judging the cutting unit cutting;
Weight adds up subelement, cumulative for carrying out word frequency according to the judgement of the mark judgment sub-unit;If sentence
In have the word with distinctive mark, then the word frequency of each centre word is cumulative by 3 times of criteria weights in the sentence;If before the sentence
Or having the word with distinctive mark in rear adjacent sentence, then the word frequency of each centre word is tired by 2 times of criteria weights in the sentence
Add;Otherwise, the word frequency of each centre word is cumulative by criteria weights in the sentence, to obtain all centre words in the sentence plus
Weigh word frequency.
Preferably, the word frequency statistics module further comprises:
Similarity-rough set unit, it is similar between the title division of each query result and described problem string for comparing
Degree;
Described control unit is further used for, if the similarity of the title of current queries result and described problem string is more than
Preset threshold value then controls the cutting unit, statistic unit and summing elements, executes the step of the statistics centre word word frequency
Suddenly, the step of otherwise skipping the statistics centre word word frequency of current queries result.
Preferably, the answer word determining module includes:
Word weight calculation unit, for according to formula:The word frequency of the word weight=centre word of the centre word × centre word
Inverse document frequency, calculate the word weight of each centre word;
Answer word determination unit, for the maximum centre word of word weight to be determined as answer word.
Preferably, the automatic question answering answer determining module includes:
There is the most preceding s abstracts of answer word for being found in the abstract of the query result in abstract acquiring unit;
S is the integer more than or equal to 1;
Abstract cutting unit, for described s abstract to be respectively divided into multiple sentences by fullstop;
There is answer word and user asks for being found in the sentence of the abstract cutting unit cutting in answer determination unit
Inscribe the largest number of sentences of centre word of string, the answer as the corresponding automatic question answering of described problem string.
As seen from the above technical solution, this automatic question-answering method and device of the invention, take full advantage of Ask-Answer Community
Existing user's question and answer data, need not establish question and answer knowledge base, need not also limit the ken of customer problem, and according to
It is most related that the parameters such as word frequency, inverse document frequency, text similarity find out the problem of being proposed to user from existing question and answer data
Answer, realize full-automatic answer.In addition to this, the present invention can be also used for carrying out semantic expansion to general problem or text string
Exhibition can be used for classifying or searching for etc..
Description of the drawings
Fig. 1 is the automatic question-answering method flow chart of the embodiment of the present invention;
Fig. 2 is the automatic call answering arrangement structural schematic diagram of the embodiment of the present invention;
Fig. 3 is the question and answer data acquisition module structural schematic diagram of the embodiment of the present invention;
Fig. 4 is the word frequency statistics modular structure schematic diagram of the embodiment of the present invention;
Fig. 5 is the summing elements structural schematic diagram of the embodiment of the present invention;
Fig. 6 is the answer word determining module structural schematic diagram of the embodiment of the present invention;
Fig. 7 is the automatic question answering answer determining module structural schematic diagram of the embodiment of the present invention.
Specific implementation mode
To make the objectives, technical solutions, and advantages of the present invention more comprehensible, develop simultaneously embodiment referring to the drawings, right
The present invention is further described.
The present invention mainly utilizes the existing question and answer data in Ask-Answer Community, obtains asking with what user proposed by search engine
The relevant question and answer data research result of topic string, and according to word frequency, the parameters such as similarity between inverse document frequency and text chunk,
Word candidate is selected from these retrieval results, and calculates the weight to these word candidates and sequence, weight is maximum
Word candidate is as answer word, and by the sentence where the answer word, automatic question answering answer that the problem of being proposed as user goes here and there.
Detailed process is as shown in Figure 1, include the following steps:
Step 101, the problem of being inputted according to user terminal string obtains relevant existing user's question and answer data;
It obtains the problem of user terminal proposes to go here and there (being indicated with q), is gone here and there problem string q as retrieval, be input to Ask-Answer Community
Search engine, obtain n query result, every result includes that title (uses ti, i=1 | n is indicated) and with distinctive mark
Identical word is marked during the problem of abstract, the distinctive mark in abstract is in make a summary, being inputted with user terminal goes here and there
Mark, marked in search result when returning to search result by Ask-Answer Community search engine, to prompt user;Usually with red
Color font marks, so the abstract with distinctive mark also known as marks red abstract, the red word of abstract acceptance of the bid is actually to pluck
Want appearance with retrieval go here and there in identical word.Certainly, according to the difference of search engine, the query result of acquisition may also use
Other distinctive marks, as long as getting the abstract with distinctive mark here, the form being specifically identified is arbitrary.
These query results be in Ask-Answer Community with the relevant existing user's question and answer data of problem string input by user,
Middle title is the problem related to problem string q input by user, and abstract is then corresponding answer.
Step 102, the word frequency of the centre word of the abstract part of existing user's question and answer data is counted;
After obtaining the i.e. existing user's question and answer data of query result, need to analyze this n query result one by one, and
It calculates in these existing user's question and answer data, the word frequency of the centre word for part of making a summary is specific as follows:
Since first query result, i.e. i=1;
Similarity can be excluded not by comparing the similarity between the title division and problem string q of query result first
High query result, to reduce the query result quantity for needing analyzing processing, if problem string q and title tiSimilarity be more than
Preset threshold value then illustrates that the search result and problem string q are related enough, analyzed, on the contrary then end processing, and carries out
The analyzing processing of next query result;
If problem string q and title tiSimilarity be more than preset threshold value, then specific processing procedure is as follows:
By the abstract part a of the query resultiWith fullstop "." cutting be m sentence (use aI, j, j=1 | m is indicated).For
Each sentence aI, j, j=1 | m statistics wherein each centre word (centre word does not include stop words, high frequency words and symbol, such as " I ",
" ", the remaining word such as " ") word frequency tf, i.e. occurrence number tires out the tf of the centre word in all m sentences
Add, obtains aiIn all centre words tf.
Wherein, due to the centre word and problem string q correlation biggers with distinctive mark in abstract, in order to embody centre word
With the difference of problem string q degrees of correlation, more accurate rational tf is obtained, weighted calculation can also be used when counting tf;Example
Such as, if sentence aI, jIn have the word with distinctive mark, then aI, jIn each centre word word frequency tf it is tired by 3 times of criteria weights
Add;If aI, jFront or rear adjacent sentence (aI, j | 1Or aI, j+1) in have the word with distinctive mark, then aI, jIn each centre word
Tf it is cumulative by 2 times of criteria weights;Otherwise, aI, jIn each centre word tf it is cumulative by criteria weights, to obtain aI, jIn it is each
The Weighted Term Frequency of a centre word.
Word frequency statistics are completed or problem string q and title tiSimilarity be less than or equal to preset threshold value, then terminate this
The analysis of query result handles next query result, even i=i+1, and above-mentioned processing procedure is repeated until n items inquire knot
Fruit has all been handled.Wherein, problem string q and title tiSimilarity calculating may be used it is similar between existing arbitrary two text
The algorithm of degree, such as word neighbour scoring method (Term proximity scoring).
Step 103, the tf of each centre word come out according to the above process and the centre word that counts in advance it is inverse
Document frequency (idf) calculates word the weight W, wherein W=tf*idf of all centre words;And to the word weight W of each centre word from
Small sequence is arrived greatly, and the maximum centre words of word weight W are determined as answer word.
Wherein, the inverse of inverse document frequency, that is, document frequency, document frequency refer to the document number for occurring some word, can be with
It being counted from internet by collecting text in advance, capture range is arbitrary, can be collected from specific website, community, or
Person collects in the Ask-Answer Community directly where providing automatic question answering.
Step 104 determines that the problem of user terminal input goes here and there corresponding automatic question answering answer according to determining answer word.
The specific steps are:It is found in the abstract of n query result and most first s of answer word occurs and make a summary that (value of s is
Arbitrary integer more than or equal to 1, for example, take 2), by this s abstract respectively press fullstop "." it is divided into several sentences, then in these sentences
The largest number of sentences of centre word for answer word and customer problem string q occur, the answer as automatic question answering are found in son.When
So, that the sentence containing answer word or the abstract comprising the answer word are directly determined as automatic question answering answer is also possible.
In addition, the present invention also provides a kind of automatic call answering arrangements, as shown in Fig. 2, the device includes:
Question and answer data acquisition module 201, for obtaining relevant existing user's question and answer number according to problem string input by user
According to;
Word frequency statistics module 202, the word frequency of the centre word of the abstract part for counting existing user's question and answer data;
Answer word determining module 203, for counting according to the word frequency of each centre word and in advance described each
The inverse document frequency of centre word calculates the word weight of each centre word, the maximum centre word of word weight is determined as answer
Word;
Automatic question answering answer determining module 204, the answer for determining automatic question answering according to the answer word.
Wherein, the specific structure is shown in FIG. 3 for the question and answer data acquisition module 201, including:
Retrieval unit 301 is input to the search engine of Ask-Answer Community for being gone here and there described problem string as retrieval;
Acquiring unit 302, for obtaining query result corresponding with the retrieval string, every query result includes title portion
Point and the abstract part with distinctive mark.
The word frequency statistics module 202 is as shown in figure 4, include:
Cutting unit 401, for being directed to each query result, part of being made a summary is using fullstop cutting as sentence;
Statistic unit 402 counts wherein each centre word for each sentence for 401 cutting of cutting unit
Word frequency;
The word frequency of summing elements 403, the centre word in all sentences for counting the statistic unit 402 is tired out
Add, the word frequency of all centre words in being made a summary;
Control unit 404 counts one by one for controlling the cutting unit 401, statistic unit 402 and summing elements 403
The centre word word frequency of the abstract part of each query result, until all query results all count completion.
Wherein, the summing elements 403, as shown in figure 5, including:
Identify judgment sub-unit 501, the distinctive mark in sentence for judging 401 cutting of cutting unit;
Weight adds up subelement 502, cumulative for carrying out word frequency according to the judgement of the mark judgment sub-unit 501;Such as
There is the word with distinctive mark in fruit sentence, then the word frequency of each centre word is cumulative by 3 times of criteria weights in the sentence;If should
There is the word with distinctive mark before or after sentence in adjacent sentence, then the word frequency of each centre word presses 2 times of standards in the sentence
Weight is cumulative;Otherwise, the word frequency of each centre word is cumulative by criteria weights in the sentence, to obtain all centers in the sentence
The Weighted Term Frequency of word.
As shown in figure 4, as another embodiment, the word frequency statistics module 202 may further include:
Similarity-rough set unit 405, between the title division for comparing each query result and described problem string
Similarity;
Described control unit 404 is further used for, if the similarity of the title of current queries result and described problem string
More than preset threshold value, then the cutting unit, statistic unit and summing elements are controlled, executes the statistics centre word word frequency
Step, the step of otherwise skipping the statistics centre word word frequency of current queries result.
The answer word determining module 203 is as shown in fig. 6, include:
Word weight calculation unit 601, for according to formula:The word frequency of the word weight=centre word of the centre word × center
The inverse document frequency of word calculates the word weight of each centre word;
Answer word determination unit 602, for the maximum centre word of word weight to be determined as answer word.
The automatic question answering answer determining module 204 as shown in fig. 7, comprises:
It makes a summary acquiring unit 701, occurs answer word most preceding s for being found in the abstract of the query result and pluck
It wants;S is the integer more than or equal to 1;
Abstract cutting unit 702, for described s abstract to be respectively divided into multiple sentences by fullstop;
Answer determination unit 703, for found in the sentence of 702 cutting of abstract cutting unit occur answer word and
The largest number of sentences of centre word of customer problem string, as the corresponding automatic question answering answer of problem string.
By the above embodiments as it can be seen that this automatic question-answering method and device of the present invention, take full advantage of Ask-Answer Community
Existing user's question and answer data, need not establish question and answer knowledge base, need not also limit the ken of customer problem, and according to
It is most related that the parameters such as word frequency, inverse document frequency, text similarity find out the problem of being proposed to user from existing question and answer data
Answer, realize full-automatic answer.In addition to this, the present invention can be also used for carrying out semantic expansion to general problem or text string
Exhibition can be used for classifying or searching for etc..
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all essences in the present invention
With within principle, any modification, equivalent substitution, improvement and etc. done should be included within the scope of protection of the invention god.