CN109446527A

CN109446527A - Meaningless corpus analysis method and system

Info

Publication number: CN109446527A
Application number: CN201811260440.5A
Authority: CN
Inventors: 魏誉荧
Original assignee: Guangdong Genius Technology Co Ltd
Current assignee: Guangdong Genius Technology Co Ltd
Priority date: 2018-10-26
Filing date: 2018-10-26
Publication date: 2019-03-08
Anticipated expiration: 2038-10-26
Also published as: CN109446527B

Abstract

The invention provides a method and a system for analyzing meaningless linguistic data, wherein the method comprises the following steps: acquiring a meaningless corpus, and summarizing a corpus regular expression according to the meaningless corpus; obtaining a meaningless judgment condition for judging the corpus according to the corpus regular expression; acquiring a user statement; when the user statement meets the judgment condition, judging that the user statement is meaningless; and when the user sentence is meaningless, analyzing the keywords of the user sentence and/or extracting the effective backbone of the user sentence to carry out intention recommendation and/or voice guidance. The method can accurately and quickly identify the meaningless corpus input by the user, and then can extract keywords or effective stems to conjecture the user intention from the corpus so as to carry out recommendation.

Description

A kind of analysis method and system of meaningless corpus

Technical field

The present invention relates to technical field of language recognition, the analysis method and system of espespecially a kind of meaningless corpus.

Background technique

In existing interactive voice, during microphone collects user speech, the environment as locating for user is made an uproar The problem of sound, more people link up etc., often will lead to microphone and has included meaningless segment voice messaging, and by segment Voice messaging carries out speech recognition, and has obtained some meaningless corpus.

But in interactive system, after having obtained some meaningless corpus, it tends to be difficult to do relevant be effectively treated. Obtained meaningless corpus can not be effectively treated, refine the true intention of user, to take appropriate measures, instead The dialogue result entanglement of reply, gives an irrelevant answer.When user intentionally gets effective service, then it can cause user's dislike, because this It is not user and wishes the information that interactive system can be got.

For said circumstances, interactive system is on the one hand needed to analyze identification one by one to all voices being collected into, if greatly The meaningless voice of amount, which mixes, wherein will cause large effect to the processing of interactive system, such as processing speed is slower Deng on the other hand can not correctly identifying the intention of user, lead to not make correct feedback, affect user experience.Cause This saves a kind of method that can be analyzed meaningless corpus to needs at present.

Summary of the invention

The object of the present invention is to provide the analysis method and system of a kind of meaningless corpus, realize that accurate quickly identification ground is known The meaningless corpus of other user input, then can extract keyword from the corpus or effective trunk supposition user be intended into And recommended.

Technical solution provided by the invention is as follows:

The present invention provides a kind of analysis method of meaningless corpus characterized by comprising

Meaningless corpus is obtained, corpus regular expression is summarized according to the meaningless corpus；

It is obtained according to the corpus regular expression and determines the meaningless decision condition of corpus；

Obtain user's sentence；

When user's sentence meets the decision condition, determine that user's sentence is meaningless；

After determining that user's sentence is meaningless, analyzes the keyword of user's sentence and/or extract the user Effective trunk of sentence carries out being intended to recommendation and/or voice guide.

Further, the meaningless corpus of the acquisition summarizes corpus regular expressions according to the meaningless corpus Formula specifically includes:

The meaningless corpus is obtained, the corpus sample in the meaningless corpus is divided according to participle technique Word obtains the word for including in the corpus sample and corresponding part of speech；

Corpus regular expression is summarized according to language material feature, the language material feature includes the word and the part of speech.

Further, described obtained according to the corpus regular expression determines that the meaningless decision condition of corpus is specific Include:

Count the type and quantity of the part of speech for including in the corpus regular expression；

The type and quantity for analyzing part of speech in all corpus regular expressions obtain and determine that corpus is sentenced described in meaningless Fixed condition, the decision condition are that the quantity of the word of one or more parts of speech reaches threshold value；

Semantic slot is converted by the part of speech for including in the decision condition and corresponding word.

Further, after acquisition user's sentence, user's sentence of working as meets the decision condition When, determine that user's sentence is meaningless includes: before

User's sentence is segmented according to the participle technique, is converted into corresponding regular expression；

By in the regular expression word and corresponding part of speech and the semantic slot match.

Further, described after determining that user's sentence is meaningless, analyze user's sentence keyword and/ Or effective trunk of extraction user's sentence carries out being intended to recommendation and/or voice guide specifically includes:

After determining that user's sentence is meaningless, using the corresponding word of one or more parts of speech as user's sentence Keyword, carried out being intended to recommendation and/or voice guide according to the keyword；And/or

It excludes, is extracted in the regular expression by the word met is matched with the semantic slot in the regular expression Effective trunk of the remaining word as user's sentence carries out being intended to recommendation and/or voice draws according to effective trunk It leads.

The present invention also provides a kind of analysis systems of meaningless corpus characterized by comprising

Processing module obtains meaningless corpus, summarizes corpus regular expression according to the meaningless corpus；

Control module obtains according to the corpus regular expression that the processing module is summarized and determines that corpus is meaningless Decision condition；

Module is obtained, user's sentence is obtained；

Determination module, when user's sentence that the acquisition module obtains meets the decision condition, described in judgement User's sentence is meaningless；

Analysis module analyzes the pass of user's sentence after the determination module determines that user's sentence is meaningless Keyword and/or the effective trunk for extracting user's sentence carry out being intended to recommendation and/or voice guide.

Further, the processing module specifically includes:

Participle unit obtains the meaningless corpus, according to participle technique to the corpus in the meaningless corpus Sample is segmented, and the word for including in the corpus sample and corresponding part of speech are obtained；

Processing unit summarizes corpus regular expression according to language material feature, and the language material feature includes that participle unit obtains The word and the part of speech.

Further, the control module specifically includes:

Statistic unit counts the type and quantity of the part of speech for including in the corpus regular expression；

Control unit analyzes the type and number of part of speech in all corpus regular expressions that described control unit analyzes Amount obtains and determines the meaningless decision condition of corpus, and the decision condition is the quantity of the word of one or more parts of speech Reach threshold value；

Conversion unit turns the part of speech for including in the decision condition that the statistic unit obtains and corresponding word Turn to semantic slot.

Further, further includes:

Word segmentation module segments user's sentence according to the participle technique, is converted into corresponding regular expressions Formula；

Matching module, the word in the regular expression and corresponding part of speech that the word segmentation module is converted and described Semantic slot is matched.

Further, the analysis module specifically includes:

Analytical unit, after determining that user's sentence is meaningless, using the corresponding word of one or more parts of speech as institute State the keyword of user's sentence；

Execution unit carries out being intended to recommendation and/or voice guide according to the keyword；And/or

The analytical unit excludes the word met in the regular expression with the semantic slot, and extraction is described just Then effective trunk of the remaining word as user's sentence in expression formula；

The execution unit carries out being intended to recommendation and/or voice guide according to effective trunk.

The analysis method and system of a kind of meaningless corpus provided through the invention, can bring following at least one to have Beneficial effect:

1, in the present invention, meaningless corpus is formed by collecting a large amount of meaningless corpus sample, it is then therefrom total Conclusion material regular expression, determines the meaningless decision condition of corpus to obtain, establishes the judgement obtained in great amount of samples Condition can more accurately filter out meaningless user's sentence, and a possibility that omitting or is wrong occurs in reduction.

2, the keyword or effective in the present invention, after determining that user's sentence is meaningless, still in analysis user's sentence Trunk, therefrom obtains the true intention of user, and then carries out being intended to recommendation or voice guide, avoids according to initial user's sentence Make incoherent feedback.

Detailed description of the invention

Below by clearly understandable mode, preferred embodiment is described with reference to the drawings, a kind of meaningless corpus is divided Above-mentioned characteristic, technical characteristic, advantage and its implementation of analysis method and system are further described.

Fig. 1 is a kind of flow chart of one embodiment of the analysis method of meaningless corpus of the present invention；

Fig. 2 is a kind of flow chart of second embodiment of the analysis method of meaningless corpus of the present invention；

Fig. 3 is a kind of flow chart of the third embodiment of the analysis method of meaningless corpus of the present invention；

Fig. 4 is a kind of flow chart of 4th embodiment of the analysis method of meaningless corpus of the present invention；

Fig. 5 is a kind of structural schematic diagram of 5th embodiment of the analysis system of meaningless corpus of the present invention；

Fig. 6 is a kind of structural schematic diagram of 6th embodiment of the analysis system of meaningless corpus of the present invention；

Fig. 7 is a kind of structural schematic diagram of 7th embodiment of the analysis system of meaningless corpus of the present invention；

Fig. 8 is a kind of structural schematic diagram of 8th embodiment of the analysis system of meaningless corpus of the present invention.

Drawing reference numeral explanation:

The analysis system of 100 meaningless corpus

110 processing module, 111 participle unit, 112 processing unit

120 control module, 121 statistic unit, 122 control unit, 123 conversion unit

130 obtain module

140 word segmentation modules

150 matching modules

160 determination modules

170 analysis module, 171 analytical unit, 172 execution unit

Specific embodiment

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, Detailed description of the invention will be compareed below A specific embodiment of the invention.It should be evident that drawings in the following description are only some embodiments of the invention, for For those of ordinary skill in the art, without creative efforts, it can also be obtained according to these attached drawings other Attached drawing, and obtain other embodiments.

To make simplified form, part related to the present invention is only schematically shown in each figure, they are not represented Its practical structures as product.In addition, there is identical structure or function in some figures so that simplified form is easy to understand Component only symbolically depicts one of those, or has only marked one of those.Herein, "one" is not only indicated " only this ", can also indicate the situation of " more than one ".

The first embodiment of the present invention, as shown in Figure 1, a kind of analysis method of meaningless corpus, method include:

S100 obtains meaningless corpus, summarizes corpus regular expression according to the meaningless corpus.

Specifically, collecting a large amount of meaningless corpus sample, wherein corpus sample can be writtening language for specification, can also To be user speech, audio etc., because user speech input and text input are all the friendships of mainstream during human-computer interaction Mutual mode.

In addition, since entire analytic process is for penman text, so if what is collected is the languages such as user speech, audio Sound file, it is necessary first to convert identification text for voice document, then the identification text is performed corresponding processing.

The word in each corpus sample and corresponding part of speech are analyzed, to show that each corpus sample is corresponding Corpus regular expression, can specify in each corpus sample the word of part of speech in corresponding corpus regular expression with pair The part of speech expression answered, such as verb, adjective etc., it is other can not be with the word for the word or specific part of speech that part of speech replaces in language Material regular expression in still indicated with original word, such as how, how many etc..

S200 is obtained according to the corpus regular expression determines the meaningless decision condition of corpus.

Specifically, each corpus sample can obtain a corresponding corpus regular expression according to above-mentioned method, The corpus regular expression of all corpus samples of comprehensive analysis, finds out the common trait in meaningless corpus, sentences to obtain The meaningless decision condition of attribute material.

Since the corpus sample size for including in meaningless corpus is more, a certain feature may be not present in all languages All exist in material regular expression, therefore can be independently arranged by default or user, meets certain amount or certain The public characteristic of the corresponding corpus regular expression of the corpus sample of ratio is the decision condition.

S300 obtains user's sentence.

S600 determines that user's sentence is meaningless when user's sentence meets the decision condition.

Specifically, obtain user's sentence, if what user inputted by interactive system is text, directly by the use of input Family sentence and decision condition are matched, if matched the result is that be consistent, determine that user's sentence of input is meaningless. If matched the result is that be not consistent, determine that user's sentence of input has certain practical significance, therefore to the use of input Family sentence is parsed, to identify that the true intention of user is fed back accordingly.

If user's sentence that user by way of the interactive voice that interactive system selects, first inputs user It is converted into identification text, then matches the identification text and above-mentioned decision condition, if matched the result is that be consistent, Then determine that user's sentence of input is meaningless.If matched the result is that be not consistent, illustrate that user's sentence has practical significance, then It is fed back accordingly by the intention of the identification user of user's sentence.

S700 is analyzed described in keyword and/or the extraction of user's sentence after determining that user's sentence is meaningless Effective trunk of user's sentence carries out being intended to recommendation and/or voice guide.

Specifically, when the above results show obtain user's sentence it is meaningless after, then analyze user's sentence or by with The identification text of family sentence conversion, obtains keyword therein or is screened to obtain effective trunk, according to the keyword or Person's effective trunk judges the intention of user, and then carries out corresponding intention recommendation or voice guide.

In the present embodiment, meaningless corpus is formed by collecting a large amount of meaningless corpus sample, it is then therefrom total Conclusion material regular expression, determines the meaningless decision condition of corpus to obtain, establishes the judgement obtained in great amount of samples Condition can more accurately filter out meaningless user's sentence, and a possibility that omitting or is wrong occurs in reduction.

In addition, after determining that user's sentence is meaningless, keyword or effective trunk still in analysis user's sentence, from The middle true intention for obtaining user, and then carry out being intended to recommendation or voice guide, it avoids being made not according to initial user's sentence Relevant feedback.

The second embodiment of the present invention is the optimal enforcement example of above-mentioned first embodiment, as shown in Figure 2, comprising:

S110 obtains the meaningless corpus, according to participle technique to the corpus sample in the meaningless corpus into Row participle, obtains the word for including in the corpus sample and corresponding part of speech.

Specifically, obtaining meaningless corpus, the corpus sample in meaningless corpus is divided according to participle technique Word, if the corpus sample is the voice documents such as user speech, audio, it is necessary first to identification text is converted by voice document, Then the identification text is segmented.

Above-mentioned participle technique method particularly includes: the structure of sentence in corpus sample is first determined whether, then by corpus sample In every a word in entire sentence is divided by word, word and phrase etc. according to the relationship between the part of speech and word of word Participle is constituted.

S120 summarizes corpus regular expression according to language material feature, and the language material feature includes the word and institute's predicate Property.

Specifically, several language material features are obtained after being segmented corpus sample by above-mentioned participle technique, according to this Language material feature summary obtains corpus regular expression, the language material feature be the participle such as word, word and phrase after above-mentioned participle, The corresponding part of speech of the participle and the participle corresponding relationship in the sentence of corpus sample.

Wherein, the form of expression of the participle such as each word, word and phrase in corresponding corpus regular expression may It is corresponding part of speech, it is also possible to which the participle such as initial word, word and phrase can be independently arranged with default or user.

For example, a certain corpus sample are as follows: which the composition for describing autumn has.Judged in the corpus sample by participle technique The part of speech for the word covered: (auxiliary word) composition (noun) for describing (verb) autumn (time word) has (verb) which (pronoun), Relationship between word are as follows: relationship in fixed: (verb) is described in composition (noun)-, moves guest's relationship: describing (the time in (verb)-autumn Word), there is (verb)-which (pronoun).Wherein, a part of word, word are replaced with corresponding part of speech, the word, word of another part The initial word of pragmatic, word indicate, therefore the corresponding corpus regular expression of the corpus sample are as follows: describe the # noun # of # time word # Which has.

S210 counts the type and quantity for the part of speech for including in the corpus regular expression.

Specifically, counting the corpus regular expression according to the part of speech of each participle in each corpus regular expression In include part of speech type and the participles such as the corresponding word of every kind of part of speech, word and phrase quantity, and then calculate every kind of word Property the participle ratio shared in the participle such as all words, word and phrase such as corresponding word, word and phrase.

Word, word and the phrase etc. do not expressed with corresponding part of speech in corpus regular expression are segmented, Can directly by initial word, word and phrase carry out statistic of classification, that is using initial word, word and the phrase as A certain " part of speech ".

It is minimum in statistic processes due to the otherness of everyone form of presentation for the participle of this part of speech It is likely to be encountered completely the same participle, it is therefore desirable in view of the semanteme of participle, then be classified as semantic identical participle same Class.Such as " ", " ground " and " obtaining " or "and", "AND", " and " etc..

For example, a certain corpus sample are as follows: which the composition for describing autumn has.Corresponding corpus regular expression are as follows: describe # Which the # noun # of time word # has.Statistics obtains " describing " quantity one, " time word " quantity one, " " quantity one, " noun " number Amount one." which has " quantity one, by " description ", " ", the word of " which has " as the same rank of " time word " and " noun " Property.

S220 analyzes the type and quantity of part of speech in all corpus regular expressions, obtains and determines the meaningless institute of corpus Decision condition is stated, the decision condition is that the quantity of the word of one or more parts of speech reaches threshold value.

Specifically, passing through the type and every kind of word of the part of speech for including in the single corpus regular expression of above-mentioned statistics Property corresponding number or ratio, analyze the type and quantity of the part of speech for including in all corpus regular expressions, obtain judgement language Expect meaningless decision condition.

The type in all corpus regular expressions comprising all parts of speech is obtained, counts the participle of every kind of part of speech one by one The ratio occurred in each corpus regular expression, wherein the part of speech for the type having is in one or more corpus canonical tables It may be 0 up to the ratio occurred in formula, especially this kind of part of speech is word, word and phrase initial in corpus regular expression.

Then the ratio that every kind of part of speech of comparative analysis occurs in each corpus regular expression, obtains meaningless corpus In in a certain proportion of corpus sample the ratio of the participle of certain or a variety of parts of speech be more than certain threshold value, just by this in corpus kind or The ratio of the participle of a variety of parts of speech is more than the threshold value as decision condition.

Such as obtain in meaningless corpus 70% corpus sample " " ratio be more than 40%, then will be in corpus " " ratio be more than 40% as decision condition.70% and 40% two threshold value in the example above is only as an example, practical User can be freely arranged in application process, and the numerical value of the two can be identical or not identical.

The part of speech for including in the decision condition and corresponding word are converted semantic slot by S230.

Specifically, being determined due to subsequent user's sentence, that is, it is compared with above-mentioned decision condition, therefore will Part of speech and corresponding word in decision condition are converted into semantic slot, wherein with corresponding part of speech in corpus regular expression The participle of expression only converts semantic slot for part of speech, and the participle with initial word, word, phrase expression is then by part of speech and corresponding word Language is converted to semantic slot.

For the example above, by corpus " " ratio be more than 40% as decision condition, then by part of speech " " and " ground " of identical semanteme, " obtaining " are converted into semantic slot.If it is determined that condition is that adjectival ratio is more than 40%, then by part of speech shape Hold word and is converted into semantic slot.

S300 obtains user's sentence.

In the present embodiment, each of meaningless corpus corpus sample is parsed to obtain corresponding corpus one by one Regular expression statisticallys analyze the corpus regular expression of all corpus samples, then obtains decision condition, so that it is guaranteed that energy Enough accurately identify meaningless corpus.

The third embodiment of the present invention is the optimal enforcement example of above-mentioned first embodiment and second embodiment, such as Fig. 3 institute Show, comprising:

S200 is obtained according to the corpus regular expression determines the meaningless decision condition of corpus

S300 obtains user's sentence.

S400 segments user's sentence according to the participle technique, is converted into corresponding regular expression.

Specifically, segmenting according to user sentence of the participle technique to acquisition, sentence in user's sentence is first determined whether Then structure will draw entire sentence according to the relationship between the part of speech and word of word in every a word in user's sentence It is divided into the participles such as word, word and phrase composition, to obtain corresponding regular expression.

S500 by the regular expression word and corresponding part of speech and the semantic slot match.

Specifically, by regular expression word and corresponding part of speech and semantic slot match, due to canonical table The form of expression up to the participles such as each word, word and phrase in formula may be corresponding part of speech, it is also possible to initial word, word And the participle such as phrase, it is contemplated that matching speed, priority match are with the participle that corresponding part of speech is expressed in regular expression No and semantic slot matching, the participle and semantic slot that then will be expressed in regular expression with initial word, word and phrase again It is matched.

But it is used initially actually in regular expression with the participle of corresponding part of speech expression and in regular expression Word, word and phrase expression participle and the semantic matched sequencing of slot have no effect on matching result, can voluntarily select.

In the present embodiment, is segmented according to user corpus of the identical participle technique to acquisition, obtain corresponding canonical Then the part of speech for including in regular expression and corresponding word and semantic slot are carried out matching and obtain matching knot by expression formula Fruit, so that rapidly and accurately whether identification user's sentence is meaningless.

The fourth embodiment of the present invention is the optimal enforcement example of above-mentioned first embodiment, as shown in Figure 4, comprising:

S300 obtains user's sentence.

S710 is after determining that user's sentence is meaningless, using the corresponding word of one or more parts of speech as the user The keyword of sentence carries out being intended to recommendation and/or voice guide according to the keyword；And/or

Specifically, selecting one according to the sequencing of user setting after user's sentence that above-mentioned judgement obtains is meaningless Then kind or the corresponding word of a variety of parts of speech carry out being intended to recommendation or voice guide according to keyword as keyword.

For example, user setting chooses a kind of word of part of speech as keyword, adjective is preferentially chosen, is secondly selected dynamic Word finally selects time word, if not having adjective in user's sentence, worries verb and time word, then selects verb corresponding Word is as keyword.

S720 will match the word met with the semantic slot in the regular expression and exclude, and extract the regular expressions Effective trunk of the remaining word as user's sentence in formula carries out being intended to recommendation and/or language according to effective trunk Sound guidance.

Specifically, when above-mentioned judgement obtain user's sentence it is meaningless after, it is also an option that by regular expression with language The word that adopted slot matching meets excludes, and extracts remaining word as effective trunk, then carries out being intended to push away according to effective trunk It recommends or voice guide.

For example, decision condition be in corpus " " ratio be more than 40%, semantic slot for part of speech " " and identical semanteme " ground ", " obtaining ", then by the corresponding regular expression of user's sentence " ", " ground ", " obtaining " all exclude, remaining part Identification user is carried out as effective trunk to be intended to.

In the present embodiment, after the user's sentence for determining to obtain is meaningless, still through selection keyword or extraction The mode of effective trunk identifies the true intention of user as much as possible, and identifies according to choosing keyword or extracting effective trunk User's true intention eliminates the interference of some words, reduces misread a possibility that user is intended to a certain extent.

The fifth embodiment of the present invention, as shown in figure 5, a kind of analysis system 100 of meaningless corpus, comprising:

Processing module 110 obtains meaningless corpus, summarizes corpus regular expression according to the meaningless corpus.

Specifically, processing module 110 collects a large amount of meaningless corpus sample, wherein corpus sample can be the book of specification Face term is also possible to user speech, audio etc., because user speech input and text input are all during human-computer interaction It is the interactive mode of mainstream.

Processing module 110 analyzes word and corresponding part of speech in each corpus sample, to obtain each language Expect the corresponding corpus regular expression of sample, the word of part of speech can be specified in corresponding corpus canonical in each corpus sample It is expressed in expression formula with corresponding part of speech, such as verb, adjective etc., it is other to use the word that part of speech replaces either specific word The word of property still indicates with original word in corpus regular expression, for example, how, how many etc..

Control module 120, according to the processing module 110 summarize the corpus regular expression obtain determine corpus without The decision condition of meaning.

Specifically, each corpus sample can obtain a corresponding corpus regular expression according to above-mentioned method, The corpus regular expression of all corpus samples of 120 comprehensive analysis of control module, finds out the common trait in meaningless corpus, The meaningless decision condition of corpus is determined to obtain.

Module 130 is obtained, user's sentence is obtained.

Determination module 160 is sentenced when user's sentence that the acquisition module 130 obtains meets the decision condition Fixed user's sentence is meaningless.

Specifically, obtaining module 130 obtains user's sentence, if what user inputted by interactive system is text, sentence Cover half block 160 directly matches the user's sentence and decision condition of input, if matched the result is that be consistent, determines defeated The user's sentence entered is meaningless.If matched the result is that be not consistent, it is certain to determine that user's sentence of input has Practical significance, therefore user's sentence of input is parsed, to identify that the true intention of user is fed back accordingly.

If user, by way of the interactive voice that interactive system selects, determination module 160 is defeated by user first The user's sentence entered is converted into identification text, then matches the identification text and above-mentioned decision condition, if matching The result is that be consistent, then determine that user's sentence of input is meaningless.If matched the result is that be not consistent, illustrate that user's sentence has It is of practical significance, is then fed back accordingly by the intention of the identification user of user's sentence.

Analysis module 170 analyzes user's language after the determination module 160 determines that user's sentence is meaningless The keyword of sentence and/or the effective trunk for extracting user's sentence carry out being intended to recommendation and/or voice guide.

Specifically, then analysis module 170 analyzes user's language after the above results show that the user's sentence obtained is meaningless The identification text that sentence is perhaps converted by user's sentence obtains keyword therein or is screened to obtain effective trunk, according to Perhaps effective trunk judges the intention of user and then carries out corresponding intention recommendation or voice guide the keyword.

The sixth embodiment of the present invention is the optimal enforcement example of above-mentioned 5th embodiment, as shown in Figure 6, comprising:

The processing module 110 specifically includes:

Participle unit 111 obtains the meaningless corpus, according to participle technique to the language in the meaningless corpus Material sample is segmented, and the word for including in the corpus sample and corresponding part of speech are obtained.

Specifically, participle unit 111 obtains meaningless corpus, according to participle technique to the corpus in meaningless corpus Sample is segmented, if the corpus sample is the voice documents such as user speech, audio, it is necessary first to convert voice document to It identifies text, then the identification text is segmented.

Processing unit 112 summarizes corpus regular expression according to language material feature, and the language material feature includes participle unit The 111 obtained words and the part of speech.

Specifically, obtaining several language material features after being segmented corpus sample by above-mentioned participle technique, processing is single Member 112 according to the language material feature summary obtain corpus regular expression, the language material feature be above-mentioned participle after word, word and The participle such as phrase, the corresponding part of speech of the participle and the participle corresponding relationship in the sentence of corpus sample.

The control module 120 specifically includes:

Statistic unit 121 counts the type and quantity of the part of speech for including in the corpus regular expression.

Specifically, statistic unit 121 counts the language according to the part of speech of each participle in each corpus regular expression The quantity of the participles such as the type for the part of speech for including in material regular expression and the corresponding word of every kind of part of speech, word and phrase, into And the participle ratio shared in the participle such as all words, word and phrase such as calculate the corresponding word of every kind of part of speech, word and phrase Example.

Control unit 122 analyzes the kind of part of speech in all corpus regular expressions that described control unit 122 analyzes Class and quantity obtain and determine the meaningless decision condition of corpus, and the decision condition is the word of one or more parts of speech Quantity reach threshold value.

Specifically, passing through the type and every kind of word of the part of speech for including in the single corpus regular expression of above-mentioned statistics Property corresponding number or ratio, control unit 122 analyze the type and quantity for the part of speech for including in all corpus regular expressions, It obtains and determines the meaningless decision condition of corpus.

Conversion unit 123, by the part of speech for including in the decision condition that the statistic unit 121 obtains and corresponding Word is converted into semantic slot.

Specifically, being determined due to subsequent user's sentence, that is, it is compared, therefore turns with above-mentioned decision condition Change unit 123 by decision condition part of speech and corresponding word be converted into semantic slot, wherein in corpus regular expression The participle expressed with corresponding part of speech only converts semantic slot for part of speech, and the participle with initial word, word, phrase expression is then by word Property and corresponding word are converted to semantic slot.

Module 130 is obtained, user's sentence is obtained.

The seventh embodiment of the present invention is the optimal enforcement example of above-mentioned 5th embodiment and sixth embodiment, such as Fig. 7 institute Show, comprising:

Module 130 is obtained, user's sentence is obtained.

Word segmentation module 140 segments user's sentence according to the participle technique, is converted into corresponding canonical table Up to formula.

Specifically, word segmentation module 140 is segmented according to user's sentence of the participle technique to acquisition, user's language is first determined whether Then the structure of sentence in sentence will incite somebody to action in every a word in user's sentence according to the relationship between the part of speech and word of word Entire sentence is divided into the participles such as word, word and phrase composition, to obtain corresponding regular expression.

Matching module 150, the word in the regular expression that the word segmentation module 140 is converted and corresponding part of speech It is matched with the semantic slot.

Specifically, matching module 150 by regular expression word and corresponding part of speech and semantic slot match, Since the form of expression of the participles such as each word, word and phrase in regular expression may be corresponding part of speech, it is also possible to The participle such as initial word, word and phrase, it is contemplated that matching speed, priority match is in regular expression with corresponding part of speech table Whether the participle reached matches with semantic slot, point that then will be expressed in regular expression with initial word, word and phrase again Word and semantic slot are matched.

The eighth embodiment of the present invention is the optimal enforcement example of above-mentioned 5th embodiment, as shown in Figure 8, comprising:

Module 130 is obtained, user's sentence is obtained.

The analysis module 170 specifically includes:

Analytical unit 171, after determining that user's sentence is meaningless, using the corresponding word of one or more parts of speech as The keyword of user's sentence.

Execution unit 172 carries out being intended to recommendation and/or voice guide according to the keyword；And/or

Specifically, analytical unit 171 is according to the successive of user setting after user's sentence that above-mentioned judgement obtains is meaningless The corresponding word of the one or more parts of speech of sequential selection is as keyword, and then execution unit 172 is intended to according to keyword Recommendation or voice guide.

The analytical unit 171 excludes the word met in the regular expression with the semantic slot, described in extraction Effective trunk of the remaining word as user's sentence in regular expression.

The execution unit 172 carries out being intended to recommendation and/or voice guide according to effective trunk.

Specifically, analytical unit 171 is it is also an option that by canonical table after user's sentence that above-mentioned judgement obtains is meaningless It is excluded up to the word that meets is matched in formula with semantic slot, extracts remaining word as effective trunk, then execution unit 172 It carries out being intended to recommendation or voice guide according to effective trunk.

It should be noted that above-described embodiment can be freely combined as needed.The above is only of the invention preferred Embodiment, it is noted that for those skilled in the art, in the premise for not departing from the principle of the invention Under, several improvements and modifications can also be made, these modifications and embellishments should also be considered as the scope of protection of the present invention.

Claims

1. a kind of analysis method of meaningless corpus characterized by comprising

Obtain user's sentence；

After determining that user's sentence is meaningless, analyzes the keyword of user's sentence and/or extract user's sentence Effective trunk carry out being intended to recommendation and/or voice guide.

2. the analysis method of meaningless corpus according to claim 1, which is characterized in that the meaningless corpus of the acquisition Collection is summarized corpus regular expression according to the meaningless corpus and is specifically included:

The meaningless corpus is obtained, the corpus sample in the meaningless corpus is segmented according to participle technique, Obtain the word for including in the corpus sample and corresponding part of speech；

3. the analysis method of meaningless corpus according to claim 2, which is characterized in that it is described according to the corpus just Then expression formula, which obtains, determines that the meaningless decision condition of corpus specifically includes:

The type and quantity for analyzing part of speech in all corpus regular expressions obtain and determine the meaningless judgement item of corpus Part, the decision condition are that the quantity of the word of one or more parts of speech reaches threshold value；

4. the analysis method of meaningless corpus according to claim 3, which is characterized in that the acquisition user sentence it Afterwards, described when user's sentence meets the decision condition, determine that user's sentence is meaningless includes: before

5. the analysis method of meaningless corpus according to claim 4, which is characterized in that described working as determines the user After sentence is meaningless, the keyword for analyzing user's sentence and/or the effective trunk for extracting user's sentence are intended to Recommend and/or voice guide specifically include:

After determining that user's sentence is meaningless, using the corresponding word of one or more parts of speech as the pass of user's sentence Keyword carries out being intended to recommendation and/or voice guide according to the keyword；And/or

It excludes, extracts remaining in the regular expression by the word met is matched with the semantic slot in the regular expression Effective trunk of the word as user's sentence, carried out being intended to recommendation and/or voice guide according to effective trunk.

6. a kind of analysis system of meaningless corpus characterized by comprising

Control module obtains according to the corpus regular expression that the processing module is summarized and determines the meaningless judgement of corpus Condition；

Module is obtained, user's sentence is obtained；

Determination module determines the user when user's sentence that the acquisition module obtains meets the decision condition Sentence is meaningless；

Analysis module analyzes the keyword of user's sentence after the determination module determines that user's sentence is meaningless And/or effective trunk of extraction user's sentence carries out being intended to recommendation and/or voice guide.

7. the analysis system of meaningless corpus according to claim 6, which is characterized in that the processing module is specifically wrapped It includes:

Participle unit obtains the meaningless corpus, according to participle technique to the corpus sample in the meaningless corpus It is segmented, obtains the word for including in the corpus sample and corresponding part of speech；

Processing unit summarizes corpus regular expression according to language material feature, and the language material feature includes the institute that participle unit obtains Predicate language and the part of speech.

8. the analysis system of meaningless corpus according to claim 7, which is characterized in that the control module is specifically wrapped It includes:

Control unit analyzes the type and quantity of part of speech in all corpus regular expressions that described control unit analyzes, It obtains and determines the meaningless decision condition of corpus, the decision condition is that the quantity of the word of one or more parts of speech reaches Threshold value；

Conversion unit converts the part of speech for including in the decision condition that the statistic unit obtains and corresponding word to Semantic slot.

9. the analysis system of meaningless corpus according to claim 8, which is characterized in that further include:

Word segmentation module segments user's sentence according to the participle technique, is converted into corresponding regular expression；

Matching module, word in the regular expression and corresponding part of speech that the word segmentation module is converted and the semanteme Slot is matched.

10. the analysis system of meaningless corpus according to claim 9, which is characterized in that the analysis module is specifically wrapped It includes:

Analytical unit, after determining that user's sentence is meaningless, using the corresponding word of one or more parts of speech as the use The keyword of family sentence；

The word met in the regular expression with the semantic slot is excluded, extracts the canonical table by the analytical unit Effective trunk up to word remaining in formula as user's sentence；