CN103870489B - Chinese personal name based on search daily record is from extending recognition methods - Google Patents
Chinese personal name based on search daily record is from extending recognition methods Download PDFInfo
- Publication number
- CN103870489B CN103870489B CN201210539985.6A CN201210539985A CN103870489B CN 103870489 B CN103870489 B CN 103870489B CN 201210539985 A CN201210539985 A CN 201210539985A CN 103870489 B CN103870489 B CN 103870489B
- Authority
- CN
- China
- Prior art keywords
- name
- template
- candidate
- query string
- daily record
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
- 238000000034 method Methods 0.000 title claims abstract description 39
- 230000008569 process Effects 0.000 claims abstract description 6
- 230000008878 coupling Effects 0.000 claims description 7
- 238000010168 coupling process Methods 0.000 claims description 7
- 238000005859 coupling reaction Methods 0.000 claims description 7
- 239000000463 material Substances 0.000 claims description 7
- 238000004364 calculation method Methods 0.000 claims description 5
- 230000003247 decreasing effect Effects 0.000 claims description 4
- 239000000284 extract Substances 0.000 claims 1
- 238000012804 iterative process Methods 0.000 claims 1
- 238000012216 screening Methods 0.000 abstract description 4
- 238000003058 natural language processing Methods 0.000 abstract description 2
- 230000002354 daily effect Effects 0.000 description 19
- 238000013461 design Methods 0.000 description 4
- 206010020675 Hypermetropia Diseases 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000003203 everyday effect Effects 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000001149 cognitive effect Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000009415 formwork Methods 0.000 description 1
- 230000001788 irregular Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
The invention belongs to the natural language processing field of computational linguistics, disclose a kind of Chinese personal name based on search daily record from extending recognition methods, the method, by " Surname drive " name identification thought, utilizes search log query string lead-in surname feature, excavates seed name;Seed name is utilized to generate candidate's name template set in search daily record;According to candidate's name template in corresponding query string and whole inquiry log frequency variation tendency, screen name template;According to name template generation candidate's name;Utilize forward-backward algorithm Keywords matching to define, screening of candidates name, it is thus achieved that name set;Based on from extension iteration thought, utilize the current name set obtained as the seed name collection of next iteration process, front n the template that in name template set, discrimination is the highest is as the seed pattern of next iteration, thus excavate name in search daily record, the search feature of daily record own is utilized to build seed name, generate name template, according to name context in corresponding query string and the variation tendency of whole search log query string, filter name template, reduce noise information during name identification, improve name discrimination in search daily record.
Description
Technical field
The invention belongs to the natural language processing field in computational linguistics, in particular it relates to one is based on search daily record
Chinese personal name from extending recognition methods.
Background technology
Along with the sharp increase of the network information, search engine has increasingly shown its epoch-making meaning.Chinese search is drawn
Hold up and be developed so far, had huge customer volume, process several hundred million requests every day, have accumulated large-scale inquiry log.Life
Name entity accounts for significant proportion in search daily record.Add up according to relevant scholar's research: in the Webpage search inquiry that every day updates
Face, has the inquiry of 2 ~ 4% to be made up of single name;Name is comprised inside the inquiry of about 30%;76717 are looked into by research worker
Inquiry string is labeled, and finds that name occurs in that 961, and the frequency is 6245, accounts for the 8.14% of total inquiry number.In the face of sharp increase
Data and information, business circles and academia are actively finding improvement, the effective way of lifting search quality one after another.Effectively identify and look into
Ask the name in daily record, on the one hand can be accurately positioned the Search Requirement of user, promote search quality;On the other hand can obtain more
Newly, more fully name information, expand related resource storehouse.
Traditional name identification majority is to carry out on the plain text, and recognizer substantially can be divided into rule-based and base
In the method for statistics, plain text has an abundant contextual information, and search for that the query string quantity of information in daily record is few, content not
Following strict syntax grammatical rules, randomness greatly, therefore the name identification in inquiry log can not directly utilize text field
Method realizes.Name identification based on search daily record currently mainly launches from following several respects: (1) is based on manually marking language in a large number
That expects has supervision recognition methods;(2) Weakly supervised recognition methods based on template iteration normal form.Former approach labor intensive, material resources,
Recognition efficiency is also affected by mark language material simultaneously, has bigger subjectivity, and second method produces new by template iteration
Candidate's name, the template chosen defines Potential names category, causes recognition result not accurate enough.Therefore, the invention provides
Certainly extension people's name recognition method of a kind of improvement, extends name contextual information according to template in the variation tendency of query string, takes out
Taking new candidate's name, the method makes the query string context from extending the most only contain target entity classification, solves and works as
The problem that in front search daily record, name recognition efficiency is low, provides Technical Reference for short text Entity recognition simultaneously and reality depends on
According to.
Summary of the invention
The technical problem to be solved in the present invention is to provide in a kind of search daily record from extension people's name recognition method.
For solving the low inferior problem of name recognition efficiency in current search daily record, the invention provides a kind of based on search day
The Chinese personal name of will is from extending recognition methods, and the method comprises the steps:
S11 determines target corpus, also will be used for excavating query string set Q in all inquiry logs of name entity;
S12 builds seed name set C,;
S13, according to seed name set C and place query string set Qc thereof, generates candidate name template set Mc;
S14 utilizes the query string bar number variation tendency that name template is mated in Qc and Q, screens name template,
Choose the contextual information of applicable name identification, it is thus achieved that name template set Mc’;
S15 mates name template set Mc in query string set Q’In template, generate candidate name collection CN;
Candidate's name is screened by S16, removes the noise information of some interference, obtains the name set that credibility is higher
N;
S17 utilizes these 5 steps of name set N more new seed name set C, the above-mentioned S12 to S16 of iteration, until obtaining
The credibility of name reach certain threshold value.
Wherein, in S12, the structure of seed name by means of the name identification thought of text field " Surname drive ", in conjunction with looking into
Asking in daily record has quite a few name to be positioned at the feature of query string the beginning part, is driven according to lead-in surname, automatically sends out
Existing candidate seed name, simultaneously by other candidate seed name numbers of statistics candidate seed name place template matching to time
Seed selection Ziren name carries out confidence evaluation, filters out seed name set C.
In S13 in candidate's name template set Mc each element so that < candidate template M, candidate template M is at current queries trail
The query string bar number mated in closing Qc > presented in.
During an iteration of name identification, the template in candidate template set Mc is mated in Qc according to it
Query string bar number is sequentially generated ordered set last to successively decrease, also the most suitable by successively decreasing by its query string bar number mated in Q
Sequence generates ordered set now, to arbitrary template M, defines five-tuple IM, IM=(M, Rank_now, a Rank_now_sum,
Rank_last, Rank_last_sum), wherein Rank_now represents template M ranking in ordered set now, Rank_
Now_sum represents the maximum sequence number of ranking in ordered set now, and Rank_last represents template M row in ordered set last
Name, Rank_last_sum represents template maximum sequence number of ranking in ordered set last.
Definition indexing is in order to the description template height to name separating capacity, for template M, according to it in ordered set
In last and now, the variation tendency of relative order judges its credibility, and therefore, discrimination calculation is as follows:
Template is ranked up by the size according to distinguishing angle value by incremental order, is certainly extending cognitive phase, each iteration
Select front n template as the seed pattern of next iteration, it is thus achieved that name template set Mc '.
Utilize discrimination to pick out the name template of extension, but there is also template matching content and comprise the feelings of non-name
Condition, the candidate's name obtaining template matching, in addition it is also necessary to delimit a boundary line further, screen, to ensure to identify the standard of name
Really rate.According to the feature of candidate's name in query string, design " forward-backward algorithm key word matching method " carries out limit to candidate's name
Boundary defines, reach candidate's name define, the purpose of filtering screening.
Forward-backward algorithm key word matching method, firstly the need of building a keywords database, can represent for arbitrary candidate's name
For W={W1…Wi... Wn}, wherein WiRepresent a Chinese character.Method specifically can be described as: proceeds by forward direction from lead-in
Big coupling, has coupling then matching content to be deleted from W, updates W;Start consequent maximum match from tail word, have coupling then from W
Delete corresponding matching content, update W.Iteration above-mentioned forward-backward algorithm matching process, until W no longer updates.Maximum match length
Max and smallest match length min, can be arranged according to candidate's name feature.
The Chinese personal name based on search daily record that technical solution of the present invention provides is from extending recognition methods, according to search daily record
Middle query string quantity of information is few, the irregular feature of content, excavates search by text field " Surname drive " name identification thought
Seed name in daily record, according to seed name designer's name template, utilizes name template at seed name place query string and whole
Sequence variation tendency in individual target language material query string, screens name template, according to the name template obtained and institute
The forward-backward algorithm Keywords matching of design, is bound candidate's name, filters, and uses from extending thought, and final realization is searched
Chinese personal name recognition in Suo Zhi, reduces noise information during name identification, improves discrimination.
Accompanying drawing explanation
Chinese personal name based on the search daily record extension recognition methods core technology stream certainly that Fig. 1 provides for the embodiment of the present invention
Cheng Tu.
The Chinese personal name based on search daily record that Fig. 2 provides for the embodiment of the present invention is sent out from extension recognition methods seed name
Existing flow chart.
Detailed description of the invention
For complying with current precision search need, solving name identification problem during retrieval and inquisition, the embodiment of the present invention provides
Based on search daily record in Chinese personal name recognition method, by from extension identify thought, by seed name build people famous model
Plate, according to template in seed name place query string and the sequence variation tendency of whole target language material query string, screening name
Hereafter, application mode matching idea defines candidate's name, reduces noise information during name identification, improves discrimination.
For making the purpose of the embodiment of the present invention, technical method and advantage clearer, below in conjunction with accompanying drawing to this
The technical scheme that bright embodiment provides carries out similar explanation.
The Chinese personal name based on search daily record being illustrated in figure 1 in the embodiment of the present invention is from extension recognition methods core skill
Art flow chart, selected target corpus (search log query string) Q(S11), utilize seed name as shown in Figure 2 to find stream
The technical scheme that journey figure is provided, by " Surname drive " thought, excavates the seed name collection C(S12 in target corpus Q),
Obtain seed name place query string Qc, accordingly generate candidate name template set Mc(S13), according to the template frequency in Q and Qc
Different variation tendency calculation template discriminations, candidate's name template set is screened (S14), according to obtain name template
Candidate name collection CN(S15 is generated in Q), candidate's name contains certain non-name information, designs forward-backward algorithm key word
Pairing candidate's name is bound and screens (S16), it is thus achieved that name set N, utilizes name set N more new seed name collection C, repeatedly
For these 5 steps of S12-S16, until the name in Q reaches to set threshold value.
The Chinese personal name based on search daily record being illustrated in figure 2 in the embodiment of the present invention is from extension recognition methods kind Ziren
Name finds flow chart, and " Surname drive " thought needs to utilize Chinese surname list, the frequency of lead-in surname in statistical query string, choosing
Select the highest front 4 surnames of lead-in frequency of occurrence as seed surname, lead-in is belonged to seed surname, a length of 2 or the inquiry of 3
String is as seed name.
While generating candidate's name template in S13, record generates the query string bar number of this template, is derived from candidate
Name template set Mc, its element is: < candidate template M, the query string bar number that candidate template M is mated in current queries set of strings Qc
>。
To the template in candidate template set Mc according to its query string bar number mated in Qc with the row of being sequentially generated of successively decreasing
Ordered sets last, is also sequentially generated ordered set now by successively decreasing, to arbitrary mould by its query string bar number mated in Q simultaneously
Plate M, defines a five-tuple IM, IM=(M, Rank_now, Rank_now_sum, Rank_last, Rank_last_
Sum), wherein Rank_now represents template M ranking in ordered set now, and Rank_now_sum represents in ordered set now
The maximum sequence number of ranking, Rank_last represents template M ranking in ordered set last, and Rank_last_sum represents template
The maximum sequence number of ranking in ordered set last.
Calculation template is to name separating capacity size, and in order to screen selecting formwork, its discrimination calculation is:
Template is ranked up by the size according to distinguishing angle value by incremental order, selects front 50 moulds when next iteration
Plate is as seed pattern, it is thus achieved that name template set Mc '.
Name template set Mc is mated one by one in query string set Q’In template, generate candidate name collection CN.
Although utilizing discrimination to select the template of extension, but there is template matching content and comprise the feelings of non-name
Condition.As a example by " .+ resume ", how query string " writes a CV or resume " mates this template, but substantially " how writing " is not a name.Right
In query string " CCTV king little bifurcation resume ", the candidate that matching template obtains is entitled " the little bifurcation of CCTV king ", and " king is little to include modification
Bifurcation " " CCTV ".So candidate's name that template matching is obtained, in addition it is also necessary to candidate's name is carried out further border circle
Fixed, screening, to ensure the accuracy rate of the name identified.
Owing to this body length of query string of inquiry log is shorter, do not follow general syntactic rule, and major part be by
Multiple words connect and compose.The contextual information of name in query string, more of relates to the relevant appellation of name, place, occupation etc.
Attribute information.Therefore, design " forward-backward algorithm key word matching method ", utilize People's Daily's participle language material to add urban district, China's provincial
The information of place names such as county build keywords database.
To a candidate name W={W1…Wi... Wn}, WiRepresent a Chinese character.Before lead-in starts to carry out with dictionary
To maximum match, there is then matching content being deleted from W of coupling, update W;Start backward maximum match from tail word, have coupling
Then from W delete corresponding matching content, update W.Continue iteration above-mentioned forward-backward algorithm matching process, until W no longer updates.Examine
Consider the ordinary circumstance to Chinese key, set the longest matching length max as 5 Chinese characters.
Owing to being directly based upon dictionary coupling, and name itself may belong to dictionary, and such as name is " farsighted ", in dictionary
There is also " farsighted ", for reducing the erroneous matching to this situation as far as possible, it is stipulated that if deleting remaining candidate after matching content
Name length is not mated less than 2.
Claims (6)
1. Chinese personal name based on search daily record is from extending recognition methods, it is characterised in that including:
S11 determines target corpus, also will be used for excavating query string set Q in all inquiry logs of name entity;
S12 builds seed name set C;
S13, according to seed name set C and place query string set Qc thereof, generates candidate name template set Mc;
S14 utilizes the query string bar number variation tendency that name template is mated in Qc and Q, screens name template, chooses
It is suitable for the contextual information of name identification, it is thus achieved that name template set Mc ';Candidate name template set Mc needs to record each candidate
The query string bar number that name template is mated in the query string set Qc of seed name place, its element constitutional formula is: < candidate template
M, the query string bar number that candidate template M is mated in current queries set of strings Qc >;
To successively decrease, the template in candidate template set Mc is sequentially generated sequence according to its query string bar number mated in Qc collect
Close last, be also sequentially generated ordered set now by its query string bar number mated in Q by successively decreasing simultaneously, to arbitrary template M,
Define a five-tuple IM, IM=(M, Rank_now, Rank_now_sum, Rank_last, Rank_last_sum), wherein
Rank_now represents template M ranking in ordered set now, and Rank_now_sum represents in ordered set now that ranking is
Big sequence number, Rank_last represents template M ranking in ordered set last, and Rank_last_sum represents that template is at sequence collection
Close the maximum sequence number of ranking in last;Define one for description template, the conceptual regions of name separating capacity height to be indexed, come
Candidate's name template is screened;The calculation of discrimination is:
S15 mates the template in name template set Mc ' in query string set Q, generates candidate name collection CN;
Candidate's name is screened by S16, removes the noise information of some interference, obtains the name set N that credibility is higher;
S17 utilizes these 5 steps of name set N more new seed name set C, the above-mentioned S12 to S16 of iteration, until the people obtained
The credibility of name reaches certain threshold value.
Chinese personal name based on search daily record the most according to claim 1 is from extending recognition methods, it is characterised in that from expanding
Exhibition identification process is an iterative process, and an iteration refers to perform step S12 to the process of step S16.
Chinese personal name based on search daily record the most according to claim 1 is from extending recognition methods, it is characterised in that utilize
Candidate's name is screened by " forward-backward algorithm key word matching method ", deletes the son in keywords database in candidate's name string
String.
4., according to the Chinese personal name based on search daily record described in claim 1 or claim 3 from extending recognition methods, it is special
Levying and be, keywords database, by People's Daily's participle language material, rejects name, foreign language, individual character, adds China's provincial, city, district, the ground in county
Name information and constitute.
Chinese personal name based on search daily record the most according to claim 4 is from extending recognition methods, it is characterised in that crucial
Word coupling is carried out from forward, backward simultaneously, arranges maximum match length and smallest match length according to language material feature.
6., according to the Chinese personal name based on search daily record described in claim 1 or claim 2 from extending recognition methods, it is special
Levy and be, during next iteration, utilize current name set N more new seed name set C, n before name template set Mc ' extracts
Individual template is as the seed pattern of next iteration.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210539985.6A CN103870489B (en) | 2012-12-13 | 2012-12-13 | Chinese personal name based on search daily record is from extending recognition methods |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210539985.6A CN103870489B (en) | 2012-12-13 | 2012-12-13 | Chinese personal name based on search daily record is from extending recognition methods |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103870489A CN103870489A (en) | 2014-06-18 |
CN103870489B true CN103870489B (en) | 2016-12-21 |
Family
ID=50909032
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201210539985.6A Expired - Fee Related CN103870489B (en) | 2012-12-13 | 2012-12-13 | Chinese personal name based on search daily record is from extending recognition methods |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103870489B (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10839441B2 (en) * | 2014-06-09 | 2020-11-17 | Ebay Inc. | Systems and methods to seed a search |
CN106156056B (en) * | 2015-03-27 | 2020-03-06 | 联想(北京)有限公司 | Text mode learning method and electronic equipment |
CN105335351B (en) * | 2015-10-27 | 2018-08-28 | 北京信息科技大学 | A kind of synonym automatic mining method based on patent search daily record user behavior |
CN111859967B (en) * | 2020-06-12 | 2024-04-09 | 北京三快在线科技有限公司 | Entity identification method and device and electronic equipment |
CN113158671B (en) * | 2021-03-25 | 2023-08-11 | 胡明昊 | Open domain information extraction method combined with named entity identification |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6029123A (en) * | 1994-12-13 | 2000-02-22 | Canon Kabushiki Kaisha | Natural language processing system and method for expecting natural language information to be processed and for executing the processing based on the expected information |
CN102708100A (en) * | 2011-03-28 | 2012-10-03 | 北京百度网讯科技有限公司 | Method and device for digging relation keyword of relevant entity word and application thereof |
CN102722525A (en) * | 2012-05-15 | 2012-10-10 | 北京百度网讯科技有限公司 | Methods and systems for establishing language model of address book names and searching voice |
-
2012
- 2012-12-13 CN CN201210539985.6A patent/CN103870489B/en not_active Expired - Fee Related
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6029123A (en) * | 1994-12-13 | 2000-02-22 | Canon Kabushiki Kaisha | Natural language processing system and method for expecting natural language information to be processed and for executing the processing based on the expected information |
CN102708100A (en) * | 2011-03-28 | 2012-10-03 | 北京百度网讯科技有限公司 | Method and device for digging relation keyword of relevant entity word and application thereof |
CN102722525A (en) * | 2012-05-15 | 2012-10-10 | 北京百度网讯科技有限公司 | Methods and systems for establishing language model of address book names and searching voice |
Also Published As
Publication number | Publication date |
---|---|
CN103870489A (en) | 2014-06-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111782965B (en) | Intention recommendation method, device, equipment and storage medium | |
CN110245981B (en) | Crowd type identification method based on mobile phone signaling data | |
CN107609121B (en) | News text classification method based on LDA and word2vec algorithm | |
JP5092165B2 (en) | Data construction method and system | |
CN105045875B (en) | Personalized search and device | |
CN105843850B (en) | Search optimization method and device | |
CN103870489B (en) | Chinese personal name based on search daily record is from extending recognition methods | |
US11449564B2 (en) | System and method for searching based on text blocks and associated search operators | |
CN107423820B (en) | Knowledge graph representation learning method combined with entity hierarchy categories | |
CN101404033A (en) | Automatic generation method and system for noumenon hierarchical structure | |
CN104516903A (en) | Keyword extension method and system and classification corpus labeling method and system | |
CN107291895B (en) | Quick hierarchical document query method | |
CN112287118B (en) | Event mode frequent subgraph mining and prediction method | |
CN106227788A (en) | Database query method based on Lucene | |
CN103778206A (en) | Method for providing network service resources | |
CN110516704A (en) | A kind of MLKNN multi-tag classification method based on correlation rule | |
CN110232126A (en) | Hot spot method for digging and server and computer readable storage medium | |
CN110990676A (en) | Social media hotspot topic extraction method and system | |
CN103034726A (en) | Text filtering system and method | |
CN103761286B (en) | A kind of Service Source search method based on user interest | |
CN102521402B (en) | Text filtering system and method | |
CN103377224A (en) | Method and device for recognizing problem types and method and device for establishing recognition models | |
CN107679209A (en) | Expression formula generation method of classifying and device | |
CN112148735A (en) | Construction method for structured form data knowledge graph | |
CN118445406A (en) | Integration system based on massive polymorphic circuit heritage information |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20161221 |