Nothing Special   »   [go: up one dir, main page]

CN103870489B - Chinese personal name based on search daily record is from extending recognition methods - Google Patents

Chinese personal name based on search daily record is from extending recognition methods Download PDF

Info

Publication number
CN103870489B
CN103870489B CN201210539985.6A CN201210539985A CN103870489B CN 103870489 B CN103870489 B CN 103870489B CN 201210539985 A CN201210539985 A CN 201210539985A CN 103870489 B CN103870489 B CN 103870489B
Authority
CN
China
Prior art keywords
name
template
candidate
query string
daily record
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201210539985.6A
Other languages
Chinese (zh)
Other versions
CN103870489A (en
Inventor
吕学强
文彬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Information Science and Technology University
Original Assignee
Beijing Information Science and Technology University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Information Science and Technology University filed Critical Beijing Information Science and Technology University
Priority to CN201210539985.6A priority Critical patent/CN103870489B/en
Publication of CN103870489A publication Critical patent/CN103870489A/en
Application granted granted Critical
Publication of CN103870489B publication Critical patent/CN103870489B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention belongs to the natural language processing field of computational linguistics, disclose a kind of Chinese personal name based on search daily record from extending recognition methods, the method, by " Surname drive " name identification thought, utilizes search log query string lead-in surname feature, excavates seed name;Seed name is utilized to generate candidate's name template set in search daily record;According to candidate's name template in corresponding query string and whole inquiry log frequency variation tendency, screen name template;According to name template generation candidate's name;Utilize forward-backward algorithm Keywords matching to define, screening of candidates name, it is thus achieved that name set;Based on from extension iteration thought, utilize the current name set obtained as the seed name collection of next iteration process, front n the template that in name template set, discrimination is the highest is as the seed pattern of next iteration, thus excavate name in search daily record, the search feature of daily record own is utilized to build seed name, generate name template, according to name context in corresponding query string and the variation tendency of whole search log query string, filter name template, reduce noise information during name identification, improve name discrimination in search daily record.

Description

Chinese personal name based on search daily record is from extending recognition methods
Technical field
The invention belongs to the natural language processing field in computational linguistics, in particular it relates to one is based on search daily record Chinese personal name from extending recognition methods.
Background technology
Along with the sharp increase of the network information, search engine has increasingly shown its epoch-making meaning.Chinese search is drawn Hold up and be developed so far, had huge customer volume, process several hundred million requests every day, have accumulated large-scale inquiry log.Life Name entity accounts for significant proportion in search daily record.Add up according to relevant scholar's research: in the Webpage search inquiry that every day updates Face, has the inquiry of 2 ~ 4% to be made up of single name;Name is comprised inside the inquiry of about 30%;76717 are looked into by research worker Inquiry string is labeled, and finds that name occurs in that 961, and the frequency is 6245, accounts for the 8.14% of total inquiry number.In the face of sharp increase Data and information, business circles and academia are actively finding improvement, the effective way of lifting search quality one after another.Effectively identify and look into Ask the name in daily record, on the one hand can be accurately positioned the Search Requirement of user, promote search quality;On the other hand can obtain more Newly, more fully name information, expand related resource storehouse.
Traditional name identification majority is to carry out on the plain text, and recognizer substantially can be divided into rule-based and base In the method for statistics, plain text has an abundant contextual information, and search for that the query string quantity of information in daily record is few, content not Following strict syntax grammatical rules, randomness greatly, therefore the name identification in inquiry log can not directly utilize text field Method realizes.Name identification based on search daily record currently mainly launches from following several respects: (1) is based on manually marking language in a large number That expects has supervision recognition methods;(2) Weakly supervised recognition methods based on template iteration normal form.Former approach labor intensive, material resources, Recognition efficiency is also affected by mark language material simultaneously, has bigger subjectivity, and second method produces new by template iteration Candidate's name, the template chosen defines Potential names category, causes recognition result not accurate enough.Therefore, the invention provides Certainly extension people's name recognition method of a kind of improvement, extends name contextual information according to template in the variation tendency of query string, takes out Taking new candidate's name, the method makes the query string context from extending the most only contain target entity classification, solves and works as The problem that in front search daily record, name recognition efficiency is low, provides Technical Reference for short text Entity recognition simultaneously and reality depends on According to.
Summary of the invention
The technical problem to be solved in the present invention is to provide in a kind of search daily record from extension people's name recognition method.
For solving the low inferior problem of name recognition efficiency in current search daily record, the invention provides a kind of based on search day The Chinese personal name of will is from extending recognition methods, and the method comprises the steps:
S11 determines target corpus, also will be used for excavating query string set Q in all inquiry logs of name entity;
S12 builds seed name set C,;
S13, according to seed name set C and place query string set Qc thereof, generates candidate name template set Mc;
S14 utilizes the query string bar number variation tendency that name template is mated in Qc and Q, screens name template, Choose the contextual information of applicable name identification, it is thus achieved that name template set Mc
S15 mates name template set Mc in query string set QIn template, generate candidate name collection CN;
Candidate's name is screened by S16, removes the noise information of some interference, obtains the name set that credibility is higher N;
S17 utilizes these 5 steps of name set N more new seed name set C, the above-mentioned S12 to S16 of iteration, until obtaining The credibility of name reach certain threshold value.
Wherein, in S12, the structure of seed name by means of the name identification thought of text field " Surname drive ", in conjunction with looking into Asking in daily record has quite a few name to be positioned at the feature of query string the beginning part, is driven according to lead-in surname, automatically sends out Existing candidate seed name, simultaneously by other candidate seed name numbers of statistics candidate seed name place template matching to time Seed selection Ziren name carries out confidence evaluation, filters out seed name set C.
In S13 in candidate's name template set Mc each element so that < candidate template M, candidate template M is at current queries trail The query string bar number mated in closing Qc > presented in.
During an iteration of name identification, the template in candidate template set Mc is mated in Qc according to it Query string bar number is sequentially generated ordered set last to successively decrease, also the most suitable by successively decreasing by its query string bar number mated in Q Sequence generates ordered set now, to arbitrary template M, defines five-tuple IM, IM=(M, Rank_now, a Rank_now_sum, Rank_last, Rank_last_sum), wherein Rank_now represents template M ranking in ordered set now, Rank_ Now_sum represents the maximum sequence number of ranking in ordered set now, and Rank_last represents template M row in ordered set last Name, Rank_last_sum represents template maximum sequence number of ranking in ordered set last.
Definition indexing is in order to the description template height to name separating capacity, for template M, according to it in ordered set In last and now, the variation tendency of relative order judges its credibility, and therefore, discrimination calculation is as follows:
r div = Rank _ now / Rank _ now _ sum Rank _ last / Rank _ last _ sum
Template is ranked up by the size according to distinguishing angle value by incremental order, is certainly extending cognitive phase, each iteration Select front n template as the seed pattern of next iteration, it is thus achieved that name template set Mc '.
Utilize discrimination to pick out the name template of extension, but there is also template matching content and comprise the feelings of non-name Condition, the candidate's name obtaining template matching, in addition it is also necessary to delimit a boundary line further, screen, to ensure to identify the standard of name Really rate.According to the feature of candidate's name in query string, design " forward-backward algorithm key word matching method " carries out limit to candidate's name Boundary defines, reach candidate's name define, the purpose of filtering screening.
Forward-backward algorithm key word matching method, firstly the need of building a keywords database, can represent for arbitrary candidate's name For W={W1…Wi... Wn}, wherein WiRepresent a Chinese character.Method specifically can be described as: proceeds by forward direction from lead-in Big coupling, has coupling then matching content to be deleted from W, updates W;Start consequent maximum match from tail word, have coupling then from W Delete corresponding matching content, update W.Iteration above-mentioned forward-backward algorithm matching process, until W no longer updates.Maximum match length Max and smallest match length min, can be arranged according to candidate's name feature.
The Chinese personal name based on search daily record that technical solution of the present invention provides is from extending recognition methods, according to search daily record Middle query string quantity of information is few, the irregular feature of content, excavates search by text field " Surname drive " name identification thought Seed name in daily record, according to seed name designer's name template, utilizes name template at seed name place query string and whole Sequence variation tendency in individual target language material query string, screens name template, according to the name template obtained and institute The forward-backward algorithm Keywords matching of design, is bound candidate's name, filters, and uses from extending thought, and final realization is searched Chinese personal name recognition in Suo Zhi, reduces noise information during name identification, improves discrimination.
Accompanying drawing explanation
Chinese personal name based on the search daily record extension recognition methods core technology stream certainly that Fig. 1 provides for the embodiment of the present invention Cheng Tu.
The Chinese personal name based on search daily record that Fig. 2 provides for the embodiment of the present invention is sent out from extension recognition methods seed name Existing flow chart.
Detailed description of the invention
For complying with current precision search need, solving name identification problem during retrieval and inquisition, the embodiment of the present invention provides Based on search daily record in Chinese personal name recognition method, by from extension identify thought, by seed name build people famous model Plate, according to template in seed name place query string and the sequence variation tendency of whole target language material query string, screening name Hereafter, application mode matching idea defines candidate's name, reduces noise information during name identification, improves discrimination.
For making the purpose of the embodiment of the present invention, technical method and advantage clearer, below in conjunction with accompanying drawing to this The technical scheme that bright embodiment provides carries out similar explanation.
The Chinese personal name based on search daily record being illustrated in figure 1 in the embodiment of the present invention is from extension recognition methods core skill Art flow chart, selected target corpus (search log query string) Q(S11), utilize seed name as shown in Figure 2 to find stream The technical scheme that journey figure is provided, by " Surname drive " thought, excavates the seed name collection C(S12 in target corpus Q), Obtain seed name place query string Qc, accordingly generate candidate name template set Mc(S13), according to the template frequency in Q and Qc Different variation tendency calculation template discriminations, candidate's name template set is screened (S14), according to obtain name template Candidate name collection CN(S15 is generated in Q), candidate's name contains certain non-name information, designs forward-backward algorithm key word Pairing candidate's name is bound and screens (S16), it is thus achieved that name set N, utilizes name set N more new seed name collection C, repeatedly For these 5 steps of S12-S16, until the name in Q reaches to set threshold value.
The Chinese personal name based on search daily record being illustrated in figure 2 in the embodiment of the present invention is from extension recognition methods kind Ziren Name finds flow chart, and " Surname drive " thought needs to utilize Chinese surname list, the frequency of lead-in surname in statistical query string, choosing Select the highest front 4 surnames of lead-in frequency of occurrence as seed surname, lead-in is belonged to seed surname, a length of 2 or the inquiry of 3 String is as seed name.
While generating candidate's name template in S13, record generates the query string bar number of this template, is derived from candidate Name template set Mc, its element is: < candidate template M, the query string bar number that candidate template M is mated in current queries set of strings Qc >。
To the template in candidate template set Mc according to its query string bar number mated in Qc with the row of being sequentially generated of successively decreasing Ordered sets last, is also sequentially generated ordered set now by successively decreasing, to arbitrary mould by its query string bar number mated in Q simultaneously Plate M, defines a five-tuple IM, IM=(M, Rank_now, Rank_now_sum, Rank_last, Rank_last_ Sum), wherein Rank_now represents template M ranking in ordered set now, and Rank_now_sum represents in ordered set now The maximum sequence number of ranking, Rank_last represents template M ranking in ordered set last, and Rank_last_sum represents template The maximum sequence number of ranking in ordered set last.
Calculation template is to name separating capacity size, and in order to screen selecting formwork, its discrimination calculation is:
r div = Rank _ now / Rank _ now _ sum Rank _ last / Rank _ last _ sum
Template is ranked up by the size according to distinguishing angle value by incremental order, selects front 50 moulds when next iteration Plate is as seed pattern, it is thus achieved that name template set Mc '.
Name template set Mc is mated one by one in query string set QIn template, generate candidate name collection CN.
Although utilizing discrimination to select the template of extension, but there is template matching content and comprise the feelings of non-name Condition.As a example by " .+ resume ", how query string " writes a CV or resume " mates this template, but substantially " how writing " is not a name.Right In query string " CCTV king little bifurcation resume ", the candidate that matching template obtains is entitled " the little bifurcation of CCTV king ", and " king is little to include modification Bifurcation " " CCTV ".So candidate's name that template matching is obtained, in addition it is also necessary to candidate's name is carried out further border circle Fixed, screening, to ensure the accuracy rate of the name identified.
Owing to this body length of query string of inquiry log is shorter, do not follow general syntactic rule, and major part be by Multiple words connect and compose.The contextual information of name in query string, more of relates to the relevant appellation of name, place, occupation etc. Attribute information.Therefore, design " forward-backward algorithm key word matching method ", utilize People's Daily's participle language material to add urban district, China's provincial The information of place names such as county build keywords database.
To a candidate name W={W1…Wi... Wn}, WiRepresent a Chinese character.Before lead-in starts to carry out with dictionary To maximum match, there is then matching content being deleted from W of coupling, update W;Start backward maximum match from tail word, have coupling Then from W delete corresponding matching content, update W.Continue iteration above-mentioned forward-backward algorithm matching process, until W no longer updates.Examine Consider the ordinary circumstance to Chinese key, set the longest matching length max as 5 Chinese characters.
Owing to being directly based upon dictionary coupling, and name itself may belong to dictionary, and such as name is " farsighted ", in dictionary There is also " farsighted ", for reducing the erroneous matching to this situation as far as possible, it is stipulated that if deleting remaining candidate after matching content Name length is not mated less than 2.

Claims (6)

1. Chinese personal name based on search daily record is from extending recognition methods, it is characterised in that including:
S11 determines target corpus, also will be used for excavating query string set Q in all inquiry logs of name entity;
S12 builds seed name set C;
S13, according to seed name set C and place query string set Qc thereof, generates candidate name template set Mc;
S14 utilizes the query string bar number variation tendency that name template is mated in Qc and Q, screens name template, chooses It is suitable for the contextual information of name identification, it is thus achieved that name template set Mc ';Candidate name template set Mc needs to record each candidate The query string bar number that name template is mated in the query string set Qc of seed name place, its element constitutional formula is: < candidate template M, the query string bar number that candidate template M is mated in current queries set of strings Qc >;
To successively decrease, the template in candidate template set Mc is sequentially generated sequence according to its query string bar number mated in Qc collect Close last, be also sequentially generated ordered set now by its query string bar number mated in Q by successively decreasing simultaneously, to arbitrary template M, Define a five-tuple IM, IM=(M, Rank_now, Rank_now_sum, Rank_last, Rank_last_sum), wherein Rank_now represents template M ranking in ordered set now, and Rank_now_sum represents in ordered set now that ranking is Big sequence number, Rank_last represents template M ranking in ordered set last, and Rank_last_sum represents that template is at sequence collection Close the maximum sequence number of ranking in last;Define one for description template, the conceptual regions of name separating capacity height to be indexed, come Candidate's name template is screened;The calculation of discrimination is:
r d i v = R a n k _ n o w / R a n k _ n o w _ s u m R a n k _ l a s t / R a n k _ l a s t _ s u m ;
S15 mates the template in name template set Mc ' in query string set Q, generates candidate name collection CN;
Candidate's name is screened by S16, removes the noise information of some interference, obtains the name set N that credibility is higher;
S17 utilizes these 5 steps of name set N more new seed name set C, the above-mentioned S12 to S16 of iteration, until the people obtained The credibility of name reaches certain threshold value.
Chinese personal name based on search daily record the most according to claim 1 is from extending recognition methods, it is characterised in that from expanding Exhibition identification process is an iterative process, and an iteration refers to perform step S12 to the process of step S16.
Chinese personal name based on search daily record the most according to claim 1 is from extending recognition methods, it is characterised in that utilize Candidate's name is screened by " forward-backward algorithm key word matching method ", deletes the son in keywords database in candidate's name string String.
4., according to the Chinese personal name based on search daily record described in claim 1 or claim 3 from extending recognition methods, it is special Levying and be, keywords database, by People's Daily's participle language material, rejects name, foreign language, individual character, adds China's provincial, city, district, the ground in county Name information and constitute.
Chinese personal name based on search daily record the most according to claim 4 is from extending recognition methods, it is characterised in that crucial Word coupling is carried out from forward, backward simultaneously, arranges maximum match length and smallest match length according to language material feature.
6., according to the Chinese personal name based on search daily record described in claim 1 or claim 2 from extending recognition methods, it is special Levy and be, during next iteration, utilize current name set N more new seed name set C, n before name template set Mc ' extracts Individual template is as the seed pattern of next iteration.
CN201210539985.6A 2012-12-13 2012-12-13 Chinese personal name based on search daily record is from extending recognition methods Expired - Fee Related CN103870489B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210539985.6A CN103870489B (en) 2012-12-13 2012-12-13 Chinese personal name based on search daily record is from extending recognition methods

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210539985.6A CN103870489B (en) 2012-12-13 2012-12-13 Chinese personal name based on search daily record is from extending recognition methods

Publications (2)

Publication Number Publication Date
CN103870489A CN103870489A (en) 2014-06-18
CN103870489B true CN103870489B (en) 2016-12-21

Family

ID=50909032

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210539985.6A Expired - Fee Related CN103870489B (en) 2012-12-13 2012-12-13 Chinese personal name based on search daily record is from extending recognition methods

Country Status (1)

Country Link
CN (1) CN103870489B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10839441B2 (en) * 2014-06-09 2020-11-17 Ebay Inc. Systems and methods to seed a search
CN106156056B (en) * 2015-03-27 2020-03-06 联想(北京)有限公司 Text mode learning method and electronic equipment
CN105335351B (en) * 2015-10-27 2018-08-28 北京信息科技大学 A kind of synonym automatic mining method based on patent search daily record user behavior
CN111859967B (en) * 2020-06-12 2024-04-09 北京三快在线科技有限公司 Entity identification method and device and electronic equipment
CN113158671B (en) * 2021-03-25 2023-08-11 胡明昊 Open domain information extraction method combined with named entity identification

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6029123A (en) * 1994-12-13 2000-02-22 Canon Kabushiki Kaisha Natural language processing system and method for expecting natural language information to be processed and for executing the processing based on the expected information
CN102708100A (en) * 2011-03-28 2012-10-03 北京百度网讯科技有限公司 Method and device for digging relation keyword of relevant entity word and application thereof
CN102722525A (en) * 2012-05-15 2012-10-10 北京百度网讯科技有限公司 Methods and systems for establishing language model of address book names and searching voice

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6029123A (en) * 1994-12-13 2000-02-22 Canon Kabushiki Kaisha Natural language processing system and method for expecting natural language information to be processed and for executing the processing based on the expected information
CN102708100A (en) * 2011-03-28 2012-10-03 北京百度网讯科技有限公司 Method and device for digging relation keyword of relevant entity word and application thereof
CN102722525A (en) * 2012-05-15 2012-10-10 北京百度网讯科技有限公司 Methods and systems for establishing language model of address book names and searching voice

Also Published As

Publication number Publication date
CN103870489A (en) 2014-06-18

Similar Documents

Publication Publication Date Title
CN111782965B (en) Intention recommendation method, device, equipment and storage medium
CN110245981B (en) Crowd type identification method based on mobile phone signaling data
CN107609121B (en) News text classification method based on LDA and word2vec algorithm
JP5092165B2 (en) Data construction method and system
CN105045875B (en) Personalized search and device
CN105843850B (en) Search optimization method and device
CN103870489B (en) Chinese personal name based on search daily record is from extending recognition methods
US11449564B2 (en) System and method for searching based on text blocks and associated search operators
CN107423820B (en) Knowledge graph representation learning method combined with entity hierarchy categories
CN101404033A (en) Automatic generation method and system for noumenon hierarchical structure
CN104516903A (en) Keyword extension method and system and classification corpus labeling method and system
CN107291895B (en) Quick hierarchical document query method
CN112287118B (en) Event mode frequent subgraph mining and prediction method
CN106227788A (en) Database query method based on Lucene
CN103778206A (en) Method for providing network service resources
CN110516704A (en) A kind of MLKNN multi-tag classification method based on correlation rule
CN110232126A (en) Hot spot method for digging and server and computer readable storage medium
CN110990676A (en) Social media hotspot topic extraction method and system
CN103034726A (en) Text filtering system and method
CN103761286B (en) A kind of Service Source search method based on user interest
CN102521402B (en) Text filtering system and method
CN103377224A (en) Method and device for recognizing problem types and method and device for establishing recognition models
CN107679209A (en) Expression formula generation method of classifying and device
CN112148735A (en) Construction method for structured form data knowledge graph
CN118445406A (en) Integration system based on massive polymorphic circuit heritage information

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20161221