CN104750819B

CN104750819B - The Biomedical literature search method and system of a kind of word-based grading sorting algorithm

Info

Publication number: CN104750819B
Application number: CN201510147696.5A
Authority: CN
Inventors: 徐博; 林鸿飞
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2015-03-31
Filing date: 2015-03-31
Publication date: 2018-01-23
Anticipated expiration: 2035-03-31
Also published as: CN104750819A

Abstract

The Biomedical literature search method and system of a kind of word-based grading sorting algorithm, search method include S1, search engine inquiry extraction step；S2, candidate extend vocabulary extraction step；S3, candidate extend feature extraction and the annotation step of vocabulary；S4, candidate extend vocabulary order models training step；S5, on-line search engine queries and extraction step；S6, online candidate extend word retrieval and its feature extraction and marking step；S7, Query Result return to step.Searching system includes search engine inquiry extraction module, candidate extends vocabulary extraction module, candidate extends the feature extraction of vocabulary and labeling module, candidate extend vocabulary order models training module, Query Reconstruction module, Query Result and return to module.The present invention, by utilizing word grading sorting algorithm and the intrinsic dictionary resources selection of biomedical sector most to express the specialized vocabulary of customer information requirement in query expansion, completes retrieval tasks, improves the performance of retrieval from query expansion angle.

Description

The Biomedical literature search method and system of a kind of word-based grading sorting algorithm

Technical field

The present invention relates to data mining and search engine technique field, especially a kind of life of word-based grading sorting algorithm Thing medical literature retrieval method and system.

Background technology

In recent years, as the fast development in biomedical (Biomedicine) field, biomedical correlative study achieve More valuable achievement, these achievements not only facilitate some treatments for once seeming insoluble disease, from more far-reaching From the point of view of, also promote the mankind for the development that itself recognizes and deeply.

But as the increase at full speed of Biomedical literature quantity, the quantity of relevant information are also being exponentially increased, sea The document of amount and information are that the acquisition of information of biomedical researcher and related practitioner bring problem, and traditional craft Information acquiring pattern gradually becomes no longer to be applicable, therefore, it is necessary to by means of information retrieval technology and method, assist related Personnel obtain required information.

The inquiry that traditional information retrieval technique can be submitted according to user, correlation row is carried out to document or webpage Sequence, and ranking results are returned into user.And traditional information retrieval method is directly applied to the retrieval of Biomedical literature In task, it is difficult to obtain preferably retrieval performance.Its reason is to fail the inherent characteristicses for sufficiently considering biomedical sector, For example biomedical sector has more specialized vocabulary, and often there is many synonyms and abbreviation in these specialized vocabularies simultaneously The situation of word.The characteristics of sufficiently if biomedical sector can be considered in traditional information retrieval method, it will further Improve the performance of biomedical information retrieval.

Query expansion technology is one of the key technology in conventional IR field.It can be in original the looking into of user's submission On the basis of inquiry, it is intended to according to the retrieval of user, inquiry is supplemented and perfect, is intended to so as to more be met user search Inquiry, improve the performance of retrieval.Existing enquiry expanding method can be divided into two major classes：One kind is looking into based on collection of document Extended method is ask, this kind of method is therefrom extracted using total data collection of document or partial data collection of document as research object Content associated with the query, improves original query；Another kind of is the query expansion technology based on outside extended resources, external resource Dictionary resources, searching system inquiry log, Anchor Text and wikipedia etc. are mainly included, many researchs show to expand using outside Exhibition resource improves original query, can preferably complete query expansion task, and then lift the performance of retrieval.

Because biomedical sector has the Domain resources such as more dictionary, if can be during information retrieval, fully The inquiry submitted using these resources to user is supplemented and perfect, and the performance of retrieval will there is a strong possibility that property gets a promotion.

The literature search for being directed to biomedical sector is established, first it should be recognized that the characteristics of the field and resource. There is substantial amounts of specialized vocabulary in the document of biomedical sector, and these vocabulary contain many synonyms and abbreviation Etc. complex situations, this brings huge challenge for the foundation of searching system, such as drug acetaminophen, its English name Word is called paracetamol, and in international standard classification of drug, its title is paracetamol (acetaminophen), in medicinal chemistry art, its scientific name is C8H9NO2 or NO2BE01, is directed to a variety of titles of the above Situation, if only inquiring about one of name in retrieval, it is difficult to retrieve all related documents.It is worth rejoice It is that also there is many intrinsic knowledge bases and resource, such as MeSH (MeSH in biomedical sector：Medical Subject Headings) and gene ontology (GO：Gene Ontology) etc., if can be sufficiently sharp during retrieval With these resources, it will bring huge lifting to the performance of Biomedical literature retrieval.

Sequence study (learning to rank) algorithm is a series of supervision being used in information retrieval to document ordering The general name of learning algorithm, it is mainly characterized by using the technology of machine learning to solve the sequencing problem in information retrieval, And obtain preferable retrieval ordering performance.Wherein sequencing problem can also regard the select permeability of an optimal item as, therefore, Ranking Algorithm is applied to multiple other tasks in recent years, such as according to user and the history of article in commending system Information is that user recommends corresponding article etc..

The content of the invention

It is an object of the invention to provide one kind can provide the user more accurate Biomedical literature, more effectively full The information requirement of sufficient user, effectively supplement and improve the Biomedical literature inspection of the word-based grading sorting algorithm of user's inquiry Rope method and system.

The present invention solves technical scheme used by prior art problem：A kind of biology doctor of word-based grading sorting algorithm Document retrieval method, including following off-line training step and online query stage are learned, wherein, off-line training step includes following step Suddenly：

S1, search engine inquiry extraction step：Recorded according to the historical query of search engine, extract more group pollings and every The preceding N bars Query Result document obtained in individual inquiry；And by inquiry and Query Result document collection into an inquiry pond, wherein N is natural number；

S2, candidate extend vocabulary extraction step：The preceding N bars each inquired about in inquiry pond are inquired about according to biomedical resource Specialized vocabulary in result document is extracted, and is counted and obtained what each specialized vocabulary occurred in the Query Result document The weighted sum of number or occurrence number；The number that occurs according to each specialized vocabulary in Query Result document or number Weighted sum descending arranges, and selects occurrence number highest or M specialized vocabulary of weighted sum highest of number as candidate's expansion word Converge, wherein M is natural number；

S3, candidate extend feature extraction and the annotation step of vocabulary：

Candidate extends the feature extraction of vocabulary and mark is carried out simultaneously；Wherein, the correlation that vocabulary is extended to candidate marks By contrasting the retrieval performance of original query and candidate extension vocabulary being added to the height of retrieval performance when in original query It is low to mark；The evaluation index of retrieval performance height includes：Accuracy rate, Average Accuracy, NDCG values and MRR values；Correlation mark The concrete mode of note is as follows：

Wherein, eval () is the evaluation index function for evaluating retrieval performance height, and eval (query+term) is to comment Scores of the valency target function eval () when candidate's extension vocabulary term is added to inquiry query by evaluation, eval (query) For score of the evaluation index function when query is inquired about in evaluation；Label is labeled as 1 expression candidate extension vocabulary and inquiry Query is related；Label is labeled as 0 expression candidate extension vocabulary and inquiry query is incoherent；

Candidate extends the feature extraction of vocabulary, is the preceding N bars returned from the inquiry in biomedical resource and inquiry pond Extracted in Query Result document candidate extend the distributed intelligence of vocabulary, distributed intelligence of candidate's vocabulary in biomedical resource with And it is that training order models are prepared that candidate, which extends vocabulary and the correlation information of original query, and extracting same candidate's extension After the various features of vocabulary, all characteristic values are normalized, by the control of all characteristic values on [0,1] section, Normalized process is as follows：

Wherein, minValue and maxValue is respectively the minimum value and maximum of a certain feature；

S4, candidate extend vocabulary order models training step：Marked according to the degree of correlation of candidate's extension vocabulary and a variety of Feature, train to obtain the weighted value of every kind of feature using word grading sorting algorithm, concretely comprise the following steps：Select quilt in a step S3 The candidate for being labeled as correlation extends vocabulary and some is marked as incoherent candidate and extends vocabulary forming a word packet, selection Some such word packets are used as training sample；The random feature for each of which candidate's expansion word assigns initial weight, leads to Characteristic weighing score is crossed to be ranked up the correlation candidate extension vocabulary in the packet of each word；The sequence knot being grouped according to each word Fruit, global weight loss is calculated, the weight per one-dimensional characteristic is adjusted according to the Grad of loss function dynamic, wherein sequence loss For：Wherein NumSample is that candidate extends the quantity that vocabulary is grouped, loss in word packet_iTo be every The penalty values of individual word packet, the penalty values are obtained by calculating the sorting position of related expanding vocabulary, and sorting position is more forward right The penalty values answered are smaller；By a process on loop iteration, until overall loss value be less than that a certain threshold value or reach specifies repeatedly The training of generation number is completed, the order models that the characteristic value of final choice is completed as training；

The online query stage comprises the following steps：

S5, on-line search engine queries and extraction step：For the new inquiry of the online submission of user, N1 bars before retrieval obtains Query Result；The specialized vocabulary in preceding N1 bars retrieval result and its various features are extracted according to biomedical resource, its Middle N1 is natural number；

S6, online candidate extend word retrieval and its feature extraction and marking step：According to biomedical resource to newly looking into Ask and extend the feature extracting method of vocabulary extracting method and candidate's extension vocabulary to preceding N1 bars using off-line phase S2-S3 candidate Online query stage specialized vocabulary and its various features in retrieval result are extracted, and obtain online query stage candidate extension Vocabulary, the feature of extraction are used to weigh importance of candidate's extension vocabulary in expanding query；Train what is obtained according to step S4 Feature weight, extend vocabulary for online query stage candidate and given a mark, and select K1 forward candidate of fraction to extend vocabulary It is added in the new inquiry submitted online and is used as expanding query, wherein K1 is natural number；

Vocabulary is extended for some the online query stage candidate for marking and extracting using biomedical resource, it It is divided intoWherein FeatureNum is the sum of feature, a_iIt is sequence mould The weighted value of ith feature, feature in type_i(term) be online query stage candidate extend corresponding to vocabulary term i-th The characteristic value of individual feature；

Vocabulary score is extended according to online query stage candidate to be ranked up it, and selected and sorted forward K1 is online Inquiry phase candidate extends vocabulary as when extending vocabulary and being added in the new inquiry submitted online, the online query rank that is added Section candidate, which extends weight of the vocabulary in expanding query, to be expressed as Wherein sign is sign function, sign when in the new inquiry that online query stage candidate's expansion word remittance abroad is submitted online now =1, otherwise sign=0, weight_originalThe weighted value for being the new inquiry submitted online in expanding query；

S7, Query Result return to step：Retrieved according to expanding query, retrieval result is returned into user.

In step S2, specialized vocabulary weighted sum of occurrence number in the Query Result document isWherein count_iThe number occurred for the vocabulary in i-th document, d_iFor The decay factor of i piece documents.

In step s3, evaluation index function eval () is Average Accuracy function, i.e.,：

Wherein, RelDoc_queryFor the number of given inquiry query relevant documentation, rank (i) is represented in document results The position of i-th relevant documentation in sorted lists.

In step sl, when situation about being recorded without historical query, by constructing biomedical inquiry and search method Mode, it is artificial to be inquired about and its record of result；The search method uses vector space model, BM25 retrieval models or base In the language model of different smoothing methods.

Penalty values are in step S4：Wherein rank_iFor candidate's expansion word of correlation row are grouped in word The position sorted in table.

Biomedical resource refers to the dictionary or knowledge base for including biomedical specialized vocabulary.

The feature that the candidate extends vocabulary includes frequency TF, Hou Xuankuo that candidate's extension vocabulary occurs in result document Open up the TF-IDF values of vocabulary, candidate extend document number, candidate's extension vocabulary that vocabulary occurs jointly with original query with it is original Inquire about occur jointly in one text window number, in biomedical resource the existing number of candidate's expansion word remittance abroad, In biomedical resource, comprising the candidate extend vocabulary term concepts number and biomedical technical term concept it Between inclusion relation.

A kind of Biomedical literature searching system of word-based grading sorting algorithm, including off-line training part and online inspection Rope part；The off-line training part is included with lower part：

Search engine inquiry extraction module：For according to the historical query of search engine record, extract more group pollings and The preceding N bars Query Result document obtained in each inquiry；And by inquiry and Query Result document collection into an inquiry pond, its Middle N is natural number；

Candidate extends vocabulary extraction module：For when given user inquires about, using the intrinsic resource of biomedical sector, In the top n Query Result document that search engine inquiry extraction module obtains, extraction obtains specialized vocabulary, and to the professional word The frequency or the weighted sum of occurrence number that remittance occurs in Query Result document are recorded；Looked into according to each specialized vocabulary The weighted sum descending arrangement of the number occurred in result document or occurrence number is ask, selects M specialty of occurrence number highest Vocabulary extends vocabulary as candidate, and wherein M is natural number；

Candidate extends feature extraction and the labeling module of vocabulary：For candidate extend vocabulary extraction module in obtained by Candidate, which extends, extracts associated feature in vocabulary, and extends influence of the vocabulary for retrieval performance according to candidate, and mark is waited The degree of correlation of choosing extension vocabulary；

Candidate extends vocabulary order models training module：For utilizing word grading sorting algorithm, in extraction candidate's expansion word Converge after feature and mark candidate's extension vocabulary degree of correlation, training vocabulary order models obtain each feature that candidate extends vocabulary Weighted value：The candidate that correlation is noted as in the feature extraction of one candidate's extension vocabulary of selection and labeling module extends vocabulary Incoherent one word packet of candidate's extension vocabulary composition is marked as with some, selects some such words to be grouped and is used as training Sample；The random feature for each of which candidate's expansion word assigns initial weight, by characteristic weighing score to each word point Correlation candidate extension vocabulary in group is ranked up；The ranking results being grouped according to each word, global weight loss is calculated, according to The Grad dynamic of loss function adjusts the weight per one-dimensional characteristic, wherein sequence loss is：Its Middle NumSample is that candidate extends the quantity that vocabulary is grouped, loss in word packet_iFor the penalty values of each word packet, the loss Value is obtained by calculating the sorting position of related expanding vocabulary, and the more forward corresponding penalty values of sorting position are smaller；Pass through circulation A process in iteration, completion is trained until overall loss value is less than a certain threshold value or reaches the iterations specified, will finally be selected The order models that the characteristic value selected is completed as training；

The on-line search part includes：

Query Reconstruction module：Vocabulary marking is extended for the specialized vocabulary extraction in newly inquiring about and candidate；Including searching online Rope engine queries extraction module, online candidate extend word retrieval and its feature extraction and scoring modules, wherein, on-line search is drawn Inquiry extraction module is held up for the new inquiry to the online submission of user, N1 bar Query Results before retrieval obtains；According to biomedicine Resource is extracted to the specialized vocabulary in preceding N1 bars retrieval result and its various features, and wherein N1 is natural number；Online candidate The candidate that extension word retrieval and its feature extraction and scoring modules are exported using vocabulary order models extends vocabulary weighted value and obtained Divide and calculate corresponding weight, and add it in original query, be expanded inquiry；

Query Result returns to module：For the result document for retrieving to obtain by expanding query, user is returned to.

The beneficial effects of the present invention are：The present invention is mainly from the angle of query expansion, by query expansion The special of customer information requirement can be most expressed using resource selections such as the intrinsic dictionaries of word grading sorting algorithm and biomedical sector Industry vocabulary, more efficiently the completing retrieval of the task, so as to provide the user the properer retrieval result of demand therewith, this hair The bright resource using in biomedical sector, original query is supplemented and improved, and then improve the performance of retrieval.When use TREC bases Because the set of task data in literature is as data acquisition system, document is carried out as reference retrieval model using traditional BM25 retrieval models During retrieval, 25.62% literature search accuracy rate can be obtained；And method and system involved in the present invention is used on this basis When being retrieved, 26.30% literature search accuracy rate can be obtained, retrieval performance is obviously improved and the present invention Involved peek-a-boo can be effectively retrieved inquires about mostly concerned Biomedical literature with user, improves user's Satisfaction.

Brief description of the drawings

Fig. 1 is the schematic flow sheet of search method of the present invention；

Fig. 2 is the logical construction schematic diagram of searching system of the present invention.

Embodiment

Below in conjunction with the drawings and the specific embodiments, the present invention will be described：

Fig. 1 is a kind of schematic flow sheet of the Biomedical literature search method of word-based grading sorting algorithm of the present invention, A kind of Biomedical literature search method of word-based grading sorting algorithm, including following off-line training step and online query rank Section, wherein, off-line training step comprises the following steps：

S1, search engine inquiry extraction step：Recorded according to the historical query of search engine, extract more group pollings and every The preceding N bars Query Result document obtained in individual inquiry；And by inquiry and Query Result document collection into an inquiry pond, N is Natural number.In the present embodiment, N=10；

Wherein, the historical query record of search engine is primarily referred to as being directed to the searching system of Biomedical literature and recorded Query history and corresponding Query Result, these inquiry and corresponding Query Result will be used for order models under off-line state Training.

, can be by way of constructing biomedical inquiry and retrieval when the situation without relevant historical inquiry record, people Work is inquired about and its record of retrieval result.Search method can use a variety of order models in conventional IR, bag Include but be not limited to vector space model, BM25 retrieval models, the language model based on different smoothing methods etc..

S2, candidate extend vocabulary extraction step：The preceding N bars each inquired about in inquiry pond are inquired about according to biomedical resource Specialized vocabulary in result document is extracted, and is counted and obtained what each specialized vocabulary occurred in the Query Result document The weighted sum of number or occurrence number；The number that occurs according to each specialized vocabulary in Query Result document or number Weighted sum descending arranges, and selects occurrence number highest or number weighted sum M specialized vocabulary of highest as candidate's expansion word Converge, wherein M is natural number；

Wherein, biomedical resource refers to the resources such as the dictionary comprising biomedical specialized vocabulary or knowledge base, including But it is not limited to：The super word of MeSH (MeSH), gene ontology (GO) and Unified Medical Language System (UMLS) issue Converge storehouse (Metathesaurus), semantic network (Semantic Network) and expert's semantic dictionary instrument (SPECIALIST Lexicon and Lexical Tools) etc..

Exemplified by using MeSH MeSH as biomedical resource used in the present invention, corresponding to extraction inquiry Specialized vocabulary in preceding N pieces Query Result document, wherein each specialized vocabulary extracted has corresponded to it and gone out in a document Existing number or the weighted sum of occurrence number.Such as specialized vocabulary term in a preceding N pieces document occurrence number weighted sum byIt is calculated, wherein count_iTime occurred for the vocabulary in i-th document Number, d_iFor the decay factor of i-th document, the number weighted sum of specialized vocabulary is used for carrying out the word frequency occurred in different document Weighting, so that the word frequency in the forward document that sorts has bigger weight, control causes in the document of sequence more rearward Comprising specialized vocabulary obtain score it is fewer.According to above-mentioned formulaIn Count (term) value is ranked up to selected specialized vocabulary from high to low, or according to score (term) value by height Selected specialized vocabulary is ranked up to low, extension vocabulary of the selected and sorted preceding M vocabulary the most forward as candidate, M value is 150 in the present embodiment.

S3, candidate extend feature extraction and the correlation annotation step of vocabulary：

Candidate extends the feature extraction of vocabulary and mark is carried out simultaneously；Wherein, the correlation that vocabulary is extended to candidate marks Realized by the retrieval performance for contrasting the retrieval performance of original query and being added to the extension vocabulary when in original query.Candidate The thinking of mark for extending vocabulary is：Single candidate extension vocabulary is added in original query and retrieved, if retrieval result The lifting of performance, then mark the extension vocabulary has correlation with original query.The evaluation index of retrieval performance includes but unlimited Schedule：Accuracy rate (Precision), Average Accuracy (MAP), NDCG values and MRR values etc..The concrete mode of mark is as follows：

Wherein, eval () is the evaluation index function for evaluating retrieval performance height, and eval (query+term) is to comment Scores of the valency target function eval () when candidate's extension vocabulary term is added to given inquiry query by evaluation, eval (query) score for evaluation index function in the given inquiry query of evaluation.When adding a certain candidate's vocabulary with original query When the evaluation score retrieved is more than the evaluation score that original query is retrieved in itself, candidate extension vocabulary is labeled as 1, being labeled as 1 means that the vocabulary to original query is related；And when original query is retrieved plus a certain candidate's vocabulary Evaluation score retrieved in itself no more than original query evaluation score when, candidate extension vocabulary is labeled as 0, mark It is incoherent when meaning the vocabulary with original query for 0.

In the present embodiment, evaluation function eval () is Average Accuracy, i.e.,：

Wherein, RelDoc_queryFor the number of given inquiry query relevant documentation, rank (i) is represented in document results The position of i-th relevant documentation in sorted lists, such as rank (3)=5 represent the 3rd related text in sort result list Shelves appear in the 5th position of sorted lists.

It is the preceding N returned from the inquiry in biomedical resource and inquiry pond and candidate extends the feature extraction of vocabulary The distributed intelligence of distributed intelligence, candidate's vocabulary that candidate's extension vocabulary is extracted in result document in biomedical resource is ask in investigation And candidate extends vocabulary and the correlation information of original query etc. and prepared for training order models, and extracting same candidate After the various features for extending vocabulary, all characteristic values are normalized；So that all characteristic values are controlled in [0,1] section On, normalized detailed process is：

MinValue and maxValue are respectively a certain The minimum value and maximum of feature.

Wherein, the feature for extending vocabulary specifically includes：

1st, candidate extends the frequency TF that vocabulary occurs in result document.This feature can be according to specialized vocabulary term in result Occurrence number obtains in document.

2nd, candidate extends the TF-IDF values of vocabulary.TF-IDF is one of classical model of information retrieval field, can be used to weigh The relative importance that measure word converges, computational methods are as shown by the following formula：

Wherein count (term) is that candidate extends the number that vocabulary occurs in i-th result document, and TotalDoc is instruction Practice the total number of documents in data, df (term) is the number for occurring the document that the candidate extends vocabulary.

3rd, candidate extends the document number that vocabulary occurs jointly with original query.This feature can be used for calculating original query The degree of correlation of vocabulary is extended with candidate.

4th, candidate extends the number that vocabulary occurs jointly with original query in one text window.This feature is used for calculating The query word in original query extends the degree of correlation of vocabulary with the candidate within the specific limits, and wherein text window refers to same A piece occurs in the range of the document of original query word and candidate's vocabulary, the word being spaced between the extension vocabulary and original query word Number.

5th, in biomedical resource such as MeSH, the existing number of candidate's expansion word remittance abroad.This feature is used for calculating and weighing The candidate extends segment information of the vocabulary in biomedical resource.

6th, in biomedical resource such as MeSH, the number of the term concepts of vocabulary is extended comprising the candidate.Cured in biology Often there is the relation included between technical term concept, this feature can equally weigh some candidate's vocabulary in biomedicine Importance in resource.

The candidate extracted more than is extended in lexical feature, and feature 1 and feature 2 are used for weighing candidate's extension vocabulary in document Distributed intelligence in set；Feature 3 and feature 4 are used for weighing the degree of correlation information that candidate extends vocabulary and original query；And Feature 5 and feature 6 are used for weighing distributed intelligence of candidate's extension vocabulary in biomedical resource.Extension involved in the present invention Lexical feature includes but is not limited to features described above, by above-mentioned manifold extraction, can be used as word grading sorting algorithm Input, preferably weigh candidate extend vocabulary significance level.

S4, candidate extend vocabulary order models training step：The correlation of vocabulary is extended according to the candidate obtained in step S3 Degree marks and various features are as input, trains to obtain the weight of every kind of feature using the order models of word grading sorting algorithm Value, concretely comprise the following steps selection one step S3 in be noted as correlation candidate extend vocabulary (i.e. label for 1 when it is corresponding Candidate extends vocabulary) and it is some be marked as incoherent candidate extend vocabulary (i.e. label for 0 when corresponding candidate extend Vocabulary) one word packet of composition, select some such word packets to be used as training sample；At random vocabulary is extended for each candidate Word feature assign initial weight, by characteristic weighing score to each word packet in related expanding vocabulary be ranked up；Root The ranking results being grouped according to each word, global weight loss is calculated, adjusted according to the Grad of loss function dynamic per one-dimensional spy The weight of sign, wherein sequence loss is：Wherein NumSample is that candidate extends vocabulary in word packet The quantity of packet, loss_iFor the penalty values of each word packet, the penalty values are obtained by calculating the sorting position of related expanding vocabulary Arrive, the more forward corresponding penalty values of sorting position are smaller；By a process on loop iteration, until overall loss value is less than a certain Threshold value reaches the iterations training completion specified, the order models that the characteristic value of final choice is completed as training；This 100 termination training of iteration are selected in embodiment.

Penalty values are in the present embodiment：Wherein rank_iIt is grouped for candidate's expansion word of correlation in word The position sorted in list, when it makes number one, loss is 0, loses and is maximized when it rolls into last place.In addition, The calculation formula of penalty values is including but not limited to this calculation formula.

In order models, the calculation formula for extending vocabulary final score is as follows：

Wherein, FeatureNum is the sum of feature, a_iFor the weighted value of ith feature, feature_i(term) it is candidate The characteristic value of ith feature corresponding to vocabulary term.The order models obtained train herein after can be used for test query correlation Extension vocabulary selection.Above step is completed in off-line case.

The online query stage comprises the following steps：

It should be noted that in the case of this step refers to online, submitted as user to Biomedical literature search engine After inquiry, this method can obtain preliminary search sequence N1 piece Query Results the most forward automatically, for the expansion inquired about user The processing such as exhibition, the processing is transparent for user.

S6, online candidate extend word retrieval and its feature extraction and marking step：According to biomedical resource to newly looking into Ask and extend the feature extracting method of vocabulary extracting method and candidate's extension vocabulary to preceding N1 bars using off-line phase S2-S3 candidate Online query stage specialized vocabulary and its various features in retrieval result are extracted, and obtain online query stage candidate extension Vocabulary, the feature of extraction are used to weigh importance of candidate's extension vocabulary in expanding query；Train what is obtained according to step S4 Feature weight, extend vocabulary for online query stage candidate and given a mark, new inquiry is built according to marking, and select fraction to lean on K1 preceding online query stage candidate extends the extension that vocabulary is added in the new inquiry submitted online as on-line stage and looked into Ask, wherein K1 is natural number；

Vocabulary score is extended according to online query candidate to be ranked up it, and the K1 vocabulary conduct that selected and sorted is forward When on-line stage candidate extension vocabulary is added in new inquiry, the on-line stage candidate added extends vocabulary in expanding query Weight can be expressed asWherein sign is sign function, Sign=1 when in the new inquiry that on-line stage candidate's expansion word remittance abroad is submitted online now, otherwise sign=0, weight_originalThe weighted value for being the new inquiry submitted online in expanding query；

The concrete form of final expanding query is as follows：

(weight₁ query_original weight₂(w₁ term₁ w₂ term₂…w_k term_k))

Wherein weight₁The weight for being the new inquiry submitted online in expanding query, weight₂For the extension newly added All weights in expanding query of vocabulary, w₁,w₂,…,w_KTo extend vocabulary term₁,term₂,…,term_KCorresponding Fraction weight, K are the number of the extension vocabulary of final choice.Weight in the present embodiment₁Value is 0.5, weight₂Value is 0.5, K value is 50.

S7, Query Result return to step：Retrieved according to expanding query, retrieval result is returned into user, completes inspection Rope process.

Corresponding with the above method, present invention also offers a kind of inspection of the Biomedical literature of word-based grading sorting algorithm Cable system.Accompanying drawing 2 gives the building-block of logic of the system.

Search engine inquiry extraction module：For according to the historical query of search engine record, extract more group pollings and The preceding N bars Query Result document obtained in each inquiry；And by inquiry and Query Result document collection into an inquiry pond, its Middle N is natural number；Search engine inquiry extraction module can retrieve the biology associated with user's inquiry according to the inquiry of user Medical literature, and the result of retrieval is returned into user, and internal system for the computings such as the extension of inquiry and operation to It is transparent for family to can't see.

Candidate extends vocabulary extraction module：For when given user inquires about, using the intrinsic resource of biomedical sector, In the top n Query Result document that search engine inquiry extraction module obtains, extraction obtains specialized vocabulary, and to the professional word The number (frequency) or the weighted sum of occurrence number that remittance occurs in Query Result document are recorded；According to each professional word The number or the weighted sum descending arrangement of occurrence number that remittance occurs in Query Result document, select occurrence number highest M Individual specialized vocabulary extends vocabulary as candidate, and wherein M is natural number；

Candidate extends feature extraction and the labeling module of vocabulary：For candidate extend vocabulary extraction module in obtained by Candidate, which extends, extracts associated feature in vocabulary, and extends influence of the vocabulary for retrieval performance according to candidate, and mark is waited The degree of correlation of choosing extension vocabulary；In off-line training, candidate, which will extend the degree of correlation mark of vocabulary and various features, to be used for The input of word grading sorting algorithm；In online query, the module is used to extract the feature letter associated with candidate's extension vocabulary Breath.

Candidate extends vocabulary order models training module：For utilizing word grading sorting algorithm, in extraction candidate's expansion word Converge after feature and mark candidate's extension vocabulary degree of correlation, training vocabulary order models output candidate extends each feature of vocabulary Weighted value；The weighted value can be used in the measurement of the significance level of the extension vocabulary to unknown inquiry.Specially：Selection one Candidate extend vocabulary feature extraction and labeling module in be noted as the candidate of correlation and extend vocabulary and some be marked as not Related candidate extends vocabulary and forms a word packet, selects some such word packets to be used as training sample；Random is wherein The feature of each candidate's expansion word assigns initial weight, and the correlation candidate in the packet of each word is expanded by characteristic weighing score Exhibition vocabulary is ranked up；The ranking results being grouped according to each word, global weight loss is calculated, according to the Grad of loss function Dynamic adjusts the weight per one-dimensional characteristic, wherein sequence loss is：Wherein NumSample is word point Candidate extends the quantity of vocabulary packet, loss in group_iFor the penalty values of each word packet, the penalty values are by calculating related expanding The sorting position of vocabulary obtains, and the more forward corresponding penalty values of sorting position are smaller；Pass through a process, Zhi Daozong on loop iteration Bulk diffusion value is less than the iterations training that a certain threshold value or reach is specified and completed, and the characteristic value using final choice is as having trained Into order models.

The on-line search part includes：

Query Reconstruction module：Vocabulary marking is extended for the specialized vocabulary extraction in newly inquiring about and candidate；Including searching online Rope engine queries extraction module, online candidate extend word retrieval and its feature extraction and scoring modules, wherein, on-line search is drawn Inquiry extraction module is held up for the new inquiry to the online submission of user, N1 bar Query Results before retrieval obtains；According to biomedicine Resource is extracted to the specialized vocabulary in preceding N1 bars retrieval result and its various features, and wherein N1 is natural number.Online candidate The candidate that extension word retrieval and its feature extraction and scoring modules are exported using vocabulary order models extends vocabulary weighted value and obtained Divide and calculate corresponding weight, and add it in original query, be expanded inquiry.

Query Result returns to module, for the result document for retrieving to obtain by expanding query, returns to user.User obtains Returning result be actually result of the returning result after query expansion that it submits input, and the process pair of query expansion It is sightless for user.

According to the above-mentioned description for being directed to method and system embodiment involved in the present invention, in conjunction with specific embodiments Illustrate.Assume that user has completed the training of order models by historical data in the present embodiment, when user submits one " during mad cow disease " (rabid ox diseases), system is first according to the word in preliminary search before examination document for individual new inquiry Frequency information, selects the extension vocabulary of candidate, and wherein candidate extends 10 extension vocabulary in the top in vocabulary and its correlation Property mark situation it is as shown in the table：

Ranking	Vocabulary	Correlation
			1	Disease (disease)	It is related
2	Prions (prion)	It is related
			3	Cause (causes)	It is uncorrelated
4	Infectious (infectivity)	It is related
			5	Conversion (conversion)	It is uncorrelated
6	Cow (ox)	It is related
			7	Spongiform (spongy tissue)	It is related
8	Fatal (fatal)	It is uncorrelated
			9	Encephalopathies (epileptic encephalopathic)	It is related
10	Mad (madness)	It is related

As can be seen from the above table, the candidate of 10 is extended in vocabulary before ranking, and uncorrelated vocabulary has 3, if directly Add it in original query, negative impact can be produced to retrieval performance.Next from document and biomedical dictionary The extraction feature related to candidate's extension vocabulary in MeSH, and the weight of every kind of feature is obtained using order models, to all Candidate extends vocabulary and is given a mark and sorted again.

10 extension vocabulary is as shown in the table before the ranking of final choice after sequence.As can be seen from the table, pass through 10 inquiries for sorting the most forward in expanding query after sequence is perfect are relative words.By these inquiries according to its normalizing Sequence score after change is added in original query, the performance of retrieval can further be improved by carrying out retrieval as weight.

The description of above-described embodiment is explained and illustrates the biomedicine of word-based grading sorting algorithm provided by the invention Document retrieval method and system.This method and system can utilize what the resources such as the knowledge base of biomedical sector were submitted to user Original query is extended, and has been used word grading sorting algorithm to be used to extend vocabulary importance measures in extension, has been passed through inquiry Expansion process to the inquiry that user submits carried out supplement and it is perfect, ensure that the accuracy of Query Result, further meet The information requirement of user.

Above content is to combine specific optimal technical scheme further description made for the present invention, it is impossible to is assert The specific implementation of the present invention is confined to these explanations.For general technical staff of the technical field of the invention, On the premise of not departing from present inventive concept, some simple deduction or replace can also be made, should all be considered as belonging to the present invention's Protection domain.

Claims

1. a kind of Biomedical literature search method of word-based grading sorting algorithm, it is characterised in that including following offline instruction Practice stage and online query stage, wherein, off-line training step comprises the following steps：

S1, search engine inquiry extraction step：Recorded according to the historical query of search engine, extract more group pollings and each look into The preceding N bars Query Result document obtained in inquiry；And by inquiry and Query Result document collection into an inquiry pond, wherein N is Natural number；

S2, candidate extend vocabulary extraction step：According to biomedical resource to inquiring about in pond the preceding N bars Query Result each inquired about Specialized vocabulary in document is extracted, and is counted and obtained the number that each specialized vocabulary occurs in the Query Result document Or the weighted sum of occurrence number；The weighting of the number or number that occur according to each specialized vocabulary in Query Result document Arranged with descending, select occurrence number highest or M specialized vocabulary of weighted sum highest of number to extend vocabulary as candidate, its Middle M is natural number；

Candidate extends the feature extraction of vocabulary and mark is carried out simultaneously；Wherein, the correlation mark that vocabulary is extended to candidate passes through The height for contrasting the retrieval performance of original query and candidate extension vocabulary being added to retrieval performance when in original query comes Mark；The evaluation index of retrieval performance height includes：Accuracy rate, Average Accuracy, NDCG values and MRR values；Correlation mark Concrete mode is as follows：

Wherein, eval () is the evaluation index function for evaluating retrieval performance height, and eval (query+term) refers to for evaluation Scores of the scalar functions eval () when candidate's extension vocabulary term is added to inquiry query by evaluation, eval (query) is to comment Score of the valency target function when query is inquired about in evaluation；Label is labeled as 1 expression candidate extension vocabulary and inquiry query Related；Label is labeled as 0 expression candidate extension vocabulary and inquiry query is incoherent；

Candidate extends the feature extraction of vocabulary, is the preceding N bars inquiry returned from the inquiry in biomedical resource and inquiry pond Candidate is extracted in result document and extends the distributed intelligence and time of the distributed intelligence, candidate's vocabulary of vocabulary in biomedical resource Choosing extension vocabulary and the correlation information of original query are prepared for training order models, and extend vocabulary extracting same candidate Various features after, all characteristic values are normalized, by all characteristic values control on [0,1] section, normalizing The process of change is as follows：

S4, candidate extend vocabulary order models training step：The degree of correlation mark and various features of vocabulary are extended according to candidate, Train to obtain the weighted value of every kind of feature using word grading sorting algorithm, concretely comprise the following steps：It is marked in one step S3 of selection For correlation candidate extend vocabulary and it is some be marked as incoherent candidate and extend vocabulary forming a word packet, select some Such word packet is used as training sample；The random feature for each of which candidate's expansion word assigns initial weight, passes through spy Sign weight score is ranked up to the correlation candidate extension vocabulary in the packet of each word；The ranking results being grouped according to each word, Global weight loss is calculated, the weight per one-dimensional characteristic is adjusted according to the Grad of loss function dynamic, wherein sequence loss is：Wherein NumSample is that candidate extends the quantity that vocabulary is grouped, loss in word packet_iFor each word The penalty values of packet, the penalty values are obtained by calculating the sorting position of related expanding vocabulary, and sorting position is more forward corresponding Penalty values are smaller；By a process on loop iteration, until overall loss value is less than a certain threshold value or reaches the iteration specified time Number training is completed, the order models that the characteristic value of final choice is completed as training；

The online query stage comprises the following steps：

S5, on-line search engine queries and extraction step：For the new inquiry of the online submission of user, N1 bars inquiry before retrieval obtains As a result；The specialized vocabulary in preceding N1 bars retrieval result and its various features are extracted according to biomedical resource, wherein N1 For natural number；

S6, online candidate extend word retrieval and its feature extraction and marking step：According to biomedical resource to new inquiry profit The feature extracting method that vocabulary extracting method and candidate's extension vocabulary are extended with off-line phase S2-S3 candidate is retrieved to preceding N1 bars As a result online query stage specialized vocabulary and its various features in are extracted, and obtain online query stage candidate's expansion word Converge, the feature of extraction is used to weigh importance of candidate's extension vocabulary in expanding query；The spy for training to obtain according to step S4 Weight is levied, extending vocabulary for online query stage candidate is given a mark, and is selected K1 forward candidate of fraction to extend vocabulary and added Enter and expanding query is used as into the new inquiry submitted online, wherein K1 is natural number；

Vocabulary is extended for some the online query stage candidate for marking and extracting using biomedical resource, it is scored atWherein FeatureNum is the sum of feature, a_iIt is in order models The weighted value of ith feature, feature_i(term) it is special i-th that online query stage candidate is extended corresponding to vocabulary term The characteristic value of sign；

Vocabulary score is extended according to online query stage candidate to be ranked up it, and the K1 online query that selected and sorted is forward When stage candidate extension vocabulary is added in the new inquiry submitted online as extension vocabulary, the online query stage added waits Weight of the choosing extension vocabulary in expanding query can be expressed asIts Middle sign is sign function, sign=when in the new inquiry that online query stage candidate's expansion word remittance abroad is submitted online now 1, otherwise sign=0, weight_originalThe weighted value for being the new inquiry submitted online in expanding query；

A kind of 2. Biomedical literature search method of word-based grading sorting algorithm according to claim 1, it is characterised in that In step S2, specialized vocabulary weighted sum of occurrence number in the Query Result document isIts Middle count_iThe number occurred for the vocabulary in i-th document, d_iFor the decay factor of i-th document.

3. a kind of Biomedical literature search method of word-based grading sorting algorithm according to claim 1, its feature It is, in step s3, evaluation index function eval () is Average Accuracy function, i.e.,：

<mrow> <msub> <mi>eval</mi> <mrow> <mi>M</mi> <mi>A</mi> <mi>P</mi> </mrow> </msub> <mo>=</mo> <mfrac> <mn>1</mn> <mrow> <msub> <mi>RelDoc</mi> <mrow> <mi>q</mi> <mi>u</mi> <mi>e</mi> <mi>r</mi> <mi>y</mi> </mrow> </msub> </mrow> </mfrac> <mo>&CenterDot;</mo> <msubsup> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mrow> <msub> <mi>RelDoc</mi> <mrow> <mi>q</mi> <mi>u</mi> <mi>e</mi> <mi>r</mi> <mi>y</mi> </mrow> </msub> </mrow> </msubsup> <mfrac> <mi>i</mi> <mrow> <mi>r</mi> <mi>a</mi> <mi>n</mi> <mi>k</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>)</mo> </mrow> </mrow> </mfrac> </mrow>

Wherein, RelDoc_queryFor the number of given inquiry query relevant documentation, rank (i) represents to sort in document results The position of i-th relevant documentation in list.

4. a kind of Biomedical literature search method of word-based grading sorting algorithm according to claim 1, its feature It is, in step sl, when situation about being recorded without historical query, by the side for constructing biomedical inquiry and search method Formula, it is artificial to be inquired about and its record of result；The search method is using vector space model, BM25 retrieval models or is based on The language model of different smoothing methods.

5. a kind of Biomedical literature search method of word-based grading sorting algorithm according to claim 1, its feature It is, penalty values are in step S4：Wherein rank_iIt is related candidate's expansion word in word group list The position of sequence.

6. a kind of Biomedical literature search method of word-based grading sorting algorithm according to claim 1, its feature It is, biomedical resource refers to the dictionary or knowledge base for including biomedical specialized vocabulary.

7. a kind of Biomedical literature search method of word-based grading sorting algorithm according to claim 1, its feature It is, the feature that the candidate extends vocabulary includes frequency TF, the candidate's extension that candidate's extension vocabulary occurs in result document The TF-IDF values of vocabulary, candidate extend the document number that vocabulary occurs jointly with original query, candidate's extension vocabulary is looked into original Ask occur jointly in one text window number, in biomedical resource the existing number of candidate's expansion word remittance abroad, in life In thing medical resource, the number of the term concepts of vocabulary is extended comprising the candidate and between biomedical technical term concept Inclusion relation.

8. a kind of Biomedical literature searching system of word-based grading sorting algorithm, it is characterised in that including off-line training portion Point and on-line search part；The off-line training part is included with lower part：

Search engine inquiry extraction module：For being recorded according to the historical query of search engine, more group pollings and each are extracted The preceding N bars Query Result document obtained in inquiry；And by inquiry and Query Result document collection into an inquiry pond, wherein N For natural number；

Candidate extends vocabulary extraction module：For when given user inquires about, using the intrinsic resource of biomedical sector, searching In the top n Query Result document that rope engine queries extraction module obtains, extraction obtains specialized vocabulary, and the specialized vocabulary is existed The frequency or the weighted sum of occurrence number occurred in Query Result document is recorded；Tied according to each specialized vocabulary in inquiry The number occurred in fruit document or the weighted sum descending arrangement of occurrence number, select occurrence number M specialized vocabulary of highest Vocabulary is extended as candidate, wherein M is natural number；

Candidate extends feature extraction and the labeling module of vocabulary：For the candidate obtained by being extended in candidate in vocabulary extraction module Associated feature is extracted in extension vocabulary, and influence of the vocabulary for retrieval performance is extended according to candidate, mark candidate expands Open up the degree of correlation of vocabulary；

Candidate extends vocabulary order models training module：For utilizing word grading sorting algorithm, it is special to extend vocabulary in extraction candidate After the mark candidate that seeks peace extends vocabulary degree of correlation, training vocabulary order models obtain the power that candidate extends each feature of vocabulary Weight values：If one candidate of selection extend vocabulary feature extraction and labeling module in be noted as correlation candidate extend vocabulary and It is dry to be marked as incoherent one word packet of candidate's extension vocabulary composition, select some such words to be grouped and be used as training sample This；The random feature for each of which candidate's expansion word assigns initial weight, and each word is grouped by characteristic weighing score Interior correlation candidate extension vocabulary is ranked up；The ranking results being grouped according to each word, global weight loss is calculated, according to damage The Grad dynamic for losing function adjusts the weight of every one-dimensional characteristic, wherein sequence loss is：Wherein NumSample is that candidate extends the quantity that vocabulary is grouped, loss in word packet_iFor the penalty values of each word packet, the penalty values Obtained by the sorting position for calculating related expanding vocabulary, the more forward corresponding penalty values of sorting position are smaller；Changed by circulation Dai Shangyi processes, completion is trained until overall loss value is less than a certain threshold value or reaches the iterations specified, by final choice Characteristic value as training complete order models；

The on-line search part includes：

Query Reconstruction module：Vocabulary marking is extended for the specialized vocabulary extraction in newly inquiring about and candidate；Draw including on-line search Inquiry extraction module, online candidate extension word retrieval and its feature extraction and scoring modules are held up, wherein, on-line search engine is looked into Extraction module is ask for the new inquiry to the online submission of user, N1 bar Query Results before retrieval obtains；According to biomedical resource Specialized vocabulary in preceding N1 bars retrieval result and its various features are extracted, wherein N1 is natural number；Online candidate's extension Word retrieval and its feature extraction and scoring modules extend vocabulary weighted value score meter using the candidate of vocabulary order models output Corresponding weight is calculated, and is added it in original query, be expanded inquiry；