A kind of automatic question-answering method and device
Technical field
The present invention relates to the web search technical field, particularly a kind of automatic question-answering method and device.
Background technology
In current web search, Ask-Answer Community grows up gradually, and Ask-Answer Community is that the user participates in puing question to and answering, and according to this question and answer relation user and Organization of Data is got up, can be for the internet product of user search.And in the Ask-Answer Community, fully can't satisfy the user by user's answer problem to put question to demand, so the most Ask-Answer Community also provides the automatic question answering function, namely automatically user's problem provided answer by background server.
Automatic question answering mainly contains two kinds of implementation methods at present:
1) in the specific knowledge field, according to the analytical approach of setting, the automatic analysis customer problem also extracts answer from existing the answer.
2) in a large amount of predefined knowledge bases, mate answer.
In specific knowledge field inner analysis problem and extract answer, this method is owing to be limited to specific ken, so have certain limitation for the first.
And in a large amount of predefined knowledge bases, mate answer for the second, and this method problem-solving ability depends on the size of pre-stored knowledge base data volume, the problem that exceeds knowledge base scope can't realize automatic question answering.
In a word, in the prior art, automatic question answering must rely on specific knowledge field or knowledge base; So long as exceed the problem of ken or knowledge base, all can't realize automatic question answering.
Summary of the invention
In view of this, the invention provides a kind of automatic question-answering method and device, can according to user's question and answer data of existing Ask-Answer Community, realize automatic question answering.For reaching above-mentioned purpose, technical scheme of the present invention specifically is achieved in that
A kind of automatic question-answering method, the method comprises:
Problem string according to the user terminal input obtains relevant existing user's question and answer data;
Add up the word frequency of the summary centre word partly of described existing user's question and answer data;
According to the inverse document frequency of the word frequency of described each centre word and described each centre word of counting in advance, calculate the word weight of described each centre word, the centre word of word weight maximum is defined as the answer word;
Determine the answer of the automatic question answering that described problem string is corresponding according to described answer word.
Preferably, described problem string according to the user terminal input obtains relevant existing user's question and answer data, comprising:
Described problem string as retrieval string, is input to the search engine of Ask-Answer Community, obtains the Query Result corresponding with described retrieval string, every Query Result comprises title division and with the summary part of distinctive mark.
Preferably, add up the word frequency of the summary centre word partly of described existing user's question and answer data, comprising:
Add up one by one the centre word word frequency of the summary part of each bar Query Result, finish until all Query Results are all added up;
Wherein, for each bar Query Result, its summary part take the fullstop cutting as sentence, for each sentence statistics word frequency of each centre word wherein, is added up the word frequency of all centre words in obtaining making a summary with the word frequency of the centre word in all sentences.
Preferably, described word frequency with the centre word in all sentences adds up, and the word frequency of all centre words in obtaining making a summary comprises:
If the word with distinctive mark is arranged in the sentence, then the word frequency of each centre word is cumulative by 3 times of standard weights in this sentence; If the word with distinctive mark is arranged in the adjacent sentence before or after this sentence, then the word frequency of each centre word is cumulative by 2 times of standard weights in this sentence; Otherwise the word frequency of each centre word is cumulative by the standard weight in this sentence, thereby obtains the Weighted Term Frequency of all centre words in this sentence.
Preferably, described centre word word frequency of adding up one by one the summary part of each bar Query Result is finished until all Query Results are all added up, and comprising:
Compare the title division of each bar Query Result and the similarity between the described problem string, if the similarity of the title of current Query Result and described problem string is greater than default threshold value, then carry out the step of described statistics centre word word frequency, otherwise skip the step of the statistics centre word word frequency of current Query Result.
Preferably, the word weight of described each centre word of calculating comprises:
The inverse document frequency of the word frequency of the word weight of centre word=this centre word * this centre word.
Preferably, describedly determine to comprise the answer of the automatic question answering that described problem string is corresponding according to the answer word:
In the summary of described Query Result, find and front s maximum summary of answer word occur; S is the integer more than or equal to 1;
Described s summary respectively is divided into a plurality of sentences by fullstop; In these sentences, find the maximum sentence of centre word number that answer word and customer problem string occur, as the answer of automatic question answering corresponding to described problem string.
A kind of automatic call answering arrangement, this device comprises:
The question and answer data acquisition module is used for obtaining relevant existing user's question and answer data according to the problem string of user terminal input;
The word frequency statistics module is for the word frequency of the summary of adding up described existing user's question and answer data centre word partly;
Answer word determination module is used for the inverse document frequency according to the word frequency of described each centre word and described each centre word of counting in advance, and the word weight of described each centre word of calculating is defined as the answer word with the centre word of word weight maximum;
Automatic question answering answer determination module is used for determining according to described answer word the answer of the automatic question answering that described problem string is corresponding.
Preferably, described question and answer data acquisition module comprises:
Retrieval unit is used for described problem string is gone here and there as retrieval, is input to the search engine of Ask-Answer Community;
Acquiring unit is used for obtaining the Query Result corresponding with described retrieval string, and every Query Result comprises title division and with the summary part of distinctive mark.
Preferably, described word frequency statistics module comprises:
The cutting unit is used for for each bar Query Result, and it is made a summary partly take the fullstop cutting as sentence;
Statistic unit is used for each sentence for described cutting unit cutting, and statistics is the word frequency of each centre word wherein;
Cumulative unit is used for the word frequency of the centre word of all sentences of described statistic unit statistics is added up the word frequency of all centre words in obtaining making a summary;
Control module is used for controlling described cutting unit, statistic unit and cumulative unit, adds up one by one the centre word word frequency of the summary part of each bar Query Result, finishes until all Query Results are all added up.
Preferably, described cumulative unit comprises:
The sign judgment sub-unit is for the distinctive mark of the sentence of judging described cutting unit cutting;
The weight subelement that adds up, it is cumulative to be used for carrying out word frequency according to the judgement of described sign judgment sub-unit; If the word with distinctive mark is arranged in the sentence, then the word frequency of each centre word is cumulative by 3 times of standard weights in this sentence; If the word with distinctive mark is arranged in the adjacent sentence before or after this sentence, then the word frequency of each centre word is cumulative by 2 times of standard weights in this sentence; Otherwise the word frequency of each centre word is cumulative by the standard weight in this sentence, thereby obtains the Weighted Term Frequency of all centre words in this sentence.
Preferably, described word frequency statistics module further comprises:
The similarity comparing unit is used for comparing the title division of each bar Query Result and the similarity between the described problem string;
Described control module is further used for, if the similarity of the title of current Query Result and described problem string is greater than default threshold value, then control described cutting unit, statistic unit and cumulative unit, carry out the step of described statistics centre word word frequency, otherwise skip the step of the statistics centre word word frequency of current Query Result.
Preferably, described answer word determination module comprises:
The word weight calculation unit is used for according to formula: the inverse document frequency of the word frequency of the word weight of centre word=this centre word * this centre word, the word weight of described each centre word of calculating;
Answer word determining unit is used for the centre word of word weight maximum is defined as the answer word.
Preferably, described automatic question answering answer determination module comprises:
The summary acquiring unit is used for summary at described Query Result and finds and front s maximum summary of answer word occur; S is the integer more than or equal to 1;
Summary cutting unit is used for described s summary respectively is divided into a plurality of sentences by fullstop;
The answer determining unit is used for finding the maximum sentence of centre word number that answer word and customer problem string occur at the sentence of described summary cutting unit cutting, as the answer of automatic question answering corresponding to described problem string.
As seen from the above technical solution, this automatic question-answering method of the present invention and device, take full advantage of the existing user's question and answer of Ask-Answer Community data, do not need to set up the question and answer knowledge base, the ken that does not also need the limited subscriber problem, and from existing question and answer data, find out the maximally related answer of problem that proposes with the user according to parameters such as word frequency, inverse document frequency, text similarities, realize automatically answering.In addition, the present invention can also be used for general problem or text string are carried out semantic extension, can be used for classification or search etc.
Description of drawings
Fig. 1 is the automatic question-answering method process flow diagram of the embodiment of the invention;
Fig. 2 is the automatic call answering arrangement structural representation of the embodiment of the invention;
Fig. 3 is the question and answer data acquisition module structural representation of the embodiment of the invention;
Fig. 4 is the word frequency statistics modular structure synoptic diagram of the embodiment of the invention;
Fig. 5 is the cumulative cellular construction synoptic diagram of the embodiment of the invention;
Fig. 6 is the answer word determination module structural representation of the embodiment of the invention;
Fig. 7 is the automatic question answering answer determination module structural representation of the embodiment of the invention.
Embodiment
For making purpose of the present invention, technical scheme and advantage clearer, referring to the accompanying drawing embodiment that develops simultaneously, the present invention is described in more detail.
The present invention utilizes the existing question and answer data of Ask-Answer Community, obtain the question and answer data research result relevant with the problem string of user's proposition by search engine, and according to word frequency, inverse document frequency, and the parameter such as similarity between the text chunk are selected word candidate from these result for retrieval, and the weight of calculating these word candidate also sorts, with the word candidate of weight maximum as the answer word, and with the sentence at this answer word place, the automatic question answering answer of the problem string that proposes as the user.
Idiographic flow comprises the steps: as shown in Figure 1
Step 101, the problem string of inputting according to user terminal obtains relevant existing user's question and answer data;
Obtain the problem string (representing with q) that user terminal proposes, problem string q as the retrieval string, is input to the search engine of Ask-Answer Community, obtain n bar Query Result, every result comprises that title (uses t
iI=1|n represents) with the summary of distinctive mark, the distinctive mark in the summary is for will making a summary, the sign that is marked with word identical in the problem string of user terminal input, in Search Results, mark when returning Search Results by the Ask-Answer Community search engine, with prompting user; Generally be to mark with red font, thus claim again to mark red summary with the summary of distinctive mark, the red word of summary acceptance of the bid that in summary, occur in fact with exactlying with identical during retrieval is gone here and there word.Certainly, according to the difference of search engine, the Query Result of acquisition also may adopt other distinctive mark, as long as get access to the summary with distinctive mark here, the form of concrete sign is any.
These Query Results are existing user's question and answer data relevant with the problem string of user's input in the Ask-Answer Community, and wherein title is the problem relevant with the problem string q of user's input, and summary then is corresponding answer.
Step 102, the word frequency of adding up the summary centre word partly of described existing user's question and answer data;
Obtaining Query Result namely after existing user's question and answer data, need to analyze one by one this n bar Query Result, and calculate in these existing user's question and answer data, the word frequency of the centre word of summary part, specific as follows:
From article one Query Result, i.e. i=1;
Title division and the similarity between the problem string q that at first can be by the comparison query result, the not high Query Result of eliminating similarity needs the Query Result quantity of analyzing and processing with minimizing, if problem string q and title t
iSimilarity greater than default threshold value, illustrate that then this Search Results is enough relevant with problem string q, need to analyze, otherwise end process is then carried out the analyzing and processing of next bar Query Result;
If problem string q and title t
iSimilarity greater than default threshold value, then concrete processing procedure is as follows:
Summary part a with this Query Result
iWith fullstop "." cutting is that m sentence (used a
I, j, j=1|m represents).For each sentence a
I, jJ=1|m statistics wherein each centre word (centre word does not namely comprise stop words, high frequency words and symbol, as " I ", " ", " " etc. remaining word) word frequency tf, i.e. occurrence number, the tf of the centre word in all m sentence is added up, obtain a
iIn the tf of all centre words.
Wherein, because centre word and the problem string q correlativity with distinctive mark is larger in the summary, in order to embody the difference of centre word and problem string q degree of correlation, obtain more accurately reasonably tf, when statistics tf, can also adopt weighted calculation; For example, if sentence a
I, jIn word with distinctive mark is arranged, a then
I, jIn the word frequency tf of each centre word cumulative by 3 times of standard weights; If a
I, jFront or rear adjacent sentence (a
I, j|1Or a
I, j+1) in word with distinctive mark is arranged, a then
I, jIn the tf of each centre word cumulative by 2 times of standard weights; Otherwise, a
I, jIn the tf of each centre word cumulative by the standard weight, thereby obtain a
I, jIn the Weighted Term Frequency of each centre word.
Word frequency statistics is finished, perhaps problem string q and title t
iSimilarity less than or equal to default threshold value, then finish the analysis of this Query Result, process next Query Result, even i=i+1 and repeats above-mentioned processing procedure until n bar Query Result is all handled.Wherein, problem string q and title t
iThe calculating of similarity can adopt the algorithm of similarity between existing any two texts, for example word neighbour scoring method (Term proximity scoring).
Step 103, the inverse document frequency (idf) of the tf of each centre word that comes out according to said process and the centre word that counts in advance calculates the word weights W of all centre words, wherein W=tf*idf; And the word weights W of each centre word sorted from big to small, the centre word of word weights W maximum is defined as the answer word.
Wherein, inverse document frequency is the inverse of document frequency, document frequency refers to occur the document number of certain word, can from the internet, add up by collecting text in advance, capture range is any, can from specific website, community, collect, perhaps directly from the Ask-Answer Community that the automatic question answering place is provided, collect.
The automatic question answering answer corresponding to problem string of user terminal input determined in the answer word that step 104, basis are determined.
Concrete steps are: in the summary of n bar Query Result, find front s maximum summary of answer word (value of s for example gets 2 for more than or equal to 1 arbitrary integer) to occur, with this s make a summary respectively press fullstop "." be divided into some sentences, then in these sentences, find the maximum sentence of centre word number that answer word and customer problem string q occur, as the answer of automatic question answering.The summary that certainly, directly will contain the sentence of answer word or comprise this answer word is defined as the automatic question answering answer and also is fine.
In addition, the present invention also provides a kind of automatic call answering arrangement, and as shown in Figure 2, this device comprises:
Question and answer data acquisition module 201 is used for obtaining relevant existing user's question and answer data according to the problem string of user's input;
Word frequency statistics module 202 is for the word frequency of the summary of adding up described existing user's question and answer data centre word partly;
Answer word determination module 203 is used for the inverse document frequency according to the word frequency of described each centre word and described each centre word of counting in advance, and the word weight of described each centre word of calculating is defined as the answer word with the centre word of word weight maximum;
Automatic question answering answer determination module 204 is used for determining according to described answer word the answer of automatic question answering.
Wherein, the concrete structure of described question and answer data acquisition module 201 comprises as shown in Figure 3:
Retrieval unit 301 is used for described problem string is gone here and there as retrieval, is input to the search engine of Ask-Answer Community;
Acquiring unit 302 is used for obtaining the Query Result corresponding with described retrieval string, and every Query Result comprises title division and with the summary part of distinctive mark.
Described word frequency statistics module 202 comprises as shown in Figure 4:
Cutting unit 401 is used for for each bar Query Result, and it is made a summary partly take the fullstop cutting as sentence;
Statistic unit 402 is used for each sentence for 401 cuttings of described cutting unit, and statistics is the word frequency of each centre word wherein;
Cumulative unit 403 is used for the word frequency of the centre word of all sentences of described statistic unit 402 statistics is added up the word frequency of all centre words in obtaining making a summary;
Control module 404 is used for controlling described cutting unit 401, statistic unit 402 and cumulative unit 403, adds up one by one the centre word word frequency of the summary part of each bar Query Result, finishes until all Query Results are all added up.
Wherein, described cumulative unit 403 as shown in Figure 5, comprising:
Sign judgment sub-unit 501 is for the distinctive mark of the sentence of judging 401 cuttings of described cutting unit;
The weight subelement 502 that adds up, it is cumulative to be used for carrying out word frequency according to the judgement of described sign judgment sub-unit 501; If the word with distinctive mark is arranged in the sentence, then the word frequency of each centre word is cumulative by 3 times of standard weights in this sentence; If the word with distinctive mark is arranged in the adjacent sentence before or after this sentence, then the word frequency of each centre word is cumulative by 2 times of standard weights in this sentence; Otherwise the word frequency of each centre word is cumulative by the standard weight in this sentence, thereby obtains the Weighted Term Frequency of all centre words in this sentence.
As shown in Figure 4, as another embodiment, described word frequency statistics module 202 may further include:
Similarity comparing unit 405 is used for comparing the title division of each bar Query Result and the similarity between the described problem string;
Described control module 404 is further used for, if the similarity of the title of current Query Result and described problem string is greater than default threshold value, then control described cutting unit, statistic unit and cumulative unit, carry out the step of described statistics centre word word frequency, otherwise skip the step of the statistics centre word word frequency of current Query Result.
Described answer word determination module 203 comprises as shown in Figure 6:
Word weight calculation unit 601 is used for according to formula: the inverse document frequency of the word frequency of the word weight of centre word=this centre word * this centre word, the word weight of described each centre word of calculating;
Answer word determining unit 602 is used for the centre word of word weight maximum is defined as the answer word.
Described automatic question answering answer determination module 204 comprises as shown in Figure 7:
Summary acquiring unit 701 is used for summary at described Query Result and finds and front s maximum summary of answer word occur; S is the integer more than or equal to 1;
Summary cutting unit 702 is used for described s summary respectively is divided into a plurality of sentences by fullstop;
Answer determining unit 703 is used for finding the maximum sentence of centre word number that answer word and customer problem string occur at the sentence of described summary cutting unit 702 cuttings, as automatic question answering answer corresponding to problem string.
By the above embodiments as seen, this automatic question-answering method of the present invention and device, take full advantage of the existing user's question and answer of Ask-Answer Community data, do not need to set up the question and answer knowledge base, the ken that does not also need the limited subscriber problem, and from existing question and answer data, find out the maximally related answer of problem that proposes with the user according to parameters such as word frequency, inverse document frequency, text similarities, realize automatically answering.In addition, the present invention can also be used for general problem or text string are carried out semantic extension, can be used for classification or search etc.
The above only is preferred embodiment of the present invention, and is in order to limit the present invention, within the spirit and principles in the present invention not all, any modification of making, is equal to replacement, improvement etc., all should be included within the scope of protection of the invention.