CN104750819B - The Biomedical literature search method and system of a kind of word-based grading sorting algorithm - Google Patents
The Biomedical literature search method and system of a kind of word-based grading sorting algorithm Download PDFInfo
- Publication number
- CN104750819B CN104750819B CN201510147696.5A CN201510147696A CN104750819B CN 104750819 B CN104750819 B CN 104750819B CN 201510147696 A CN201510147696 A CN 201510147696A CN 104750819 B CN104750819 B CN 104750819B
- Authority
- CN
- China
- Prior art keywords
- vocabulary
- candidate
- query
- word
- mrow
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 56
- 238000013332 literature search Methods 0.000 title claims abstract description 16
- 238000000605 extraction Methods 0.000 claims abstract description 75
- 238000012549 training Methods 0.000 claims abstract description 45
- 238000002372 labelling Methods 0.000 claims abstract description 7
- 230000006870 function Effects 0.000 claims description 23
- 238000011156 evaluation Methods 0.000 claims description 22
- 238000005303 weighing Methods 0.000 claims description 9
- 239000000284 extract Substances 0.000 claims description 7
- 230000004580 weight loss Effects 0.000 claims description 6
- 238000009499 grossing Methods 0.000 claims description 3
- 239000000203 mixture Substances 0.000 claims description 3
- 235000013399 edible fruits Nutrition 0.000 claims description 2
- 238000005516 engineering process Methods 0.000 description 6
- 201000010099 disease Diseases 0.000 description 5
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 5
- 238000007689 inspection Methods 0.000 description 5
- RZVAJINKPMORJF-UHFFFAOYSA-N Acetaminophen Chemical compound CC(=O)NC1=CC=C(O)C=C1 RZVAJINKPMORJF-UHFFFAOYSA-N 0.000 description 4
- 238000004364 calculation method Methods 0.000 description 3
- 108090000623 proteins and genes Proteins 0.000 description 3
- 102000029797 Prion Human genes 0.000 description 2
- 108091000054 Prion Proteins 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 2
- 238000013480 data collection Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 229940079593 drug Drugs 0.000 description 2
- 239000003814 drug Substances 0.000 description 2
- 229960005489 paracetamol Drugs 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000012163 sequencing technique Methods 0.000 description 2
- 239000013589 supplement Substances 0.000 description 2
- 208000014644 Brain disease Diseases 0.000 description 1
- 208000032274 Encephalopathy Diseases 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000000205 computational method Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000009792 diffusion process Methods 0.000 description 1
- 230000003181 encephalopathic effect Effects 0.000 description 1
- 230000001037 epileptic effect Effects 0.000 description 1
- 208000015181 infectious disease Diseases 0.000 description 1
- 230000002458 infectious effect Effects 0.000 description 1
- 238000013549 information retrieval technique Methods 0.000 description 1
- 238000011835 investigation Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 230000035699 permeability Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000011282 treatment Methods 0.000 description 1
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The Biomedical literature search method and system of a kind of word-based grading sorting algorithm, search method include S1, search engine inquiry extraction step;S2, candidate extend vocabulary extraction step;S3, candidate extend feature extraction and the annotation step of vocabulary;S4, candidate extend vocabulary order models training step;S5, on-line search engine queries and extraction step;S6, online candidate extend word retrieval and its feature extraction and marking step;S7, Query Result return to step.Searching system includes search engine inquiry extraction module, candidate extends vocabulary extraction module, candidate extends the feature extraction of vocabulary and labeling module, candidate extend vocabulary order models training module, Query Reconstruction module, Query Result and return to module.The present invention, by utilizing word grading sorting algorithm and the intrinsic dictionary resources selection of biomedical sector most to express the specialized vocabulary of customer information requirement in query expansion, completes retrieval tasks, improves the performance of retrieval from query expansion angle.
Description
Technical field
The present invention relates to data mining and search engine technique field, especially a kind of life of word-based grading sorting algorithm
Thing medical literature retrieval method and system.
Background technology
In recent years, as the fast development in biomedical (Biomedicine) field, biomedical correlative study achieve
More valuable achievement, these achievements not only facilitate some treatments for once seeming insoluble disease, from more far-reaching
From the point of view of, also promote the mankind for the development that itself recognizes and deeply.
But as the increase at full speed of Biomedical literature quantity, the quantity of relevant information are also being exponentially increased, sea
The document of amount and information are that the acquisition of information of biomedical researcher and related practitioner bring problem, and traditional craft
Information acquiring pattern gradually becomes no longer to be applicable, therefore, it is necessary to by means of information retrieval technology and method, assist related
Personnel obtain required information.
The inquiry that traditional information retrieval technique can be submitted according to user, correlation row is carried out to document or webpage
Sequence, and ranking results are returned into user.And traditional information retrieval method is directly applied to the retrieval of Biomedical literature
In task, it is difficult to obtain preferably retrieval performance.Its reason is to fail the inherent characteristicses for sufficiently considering biomedical sector,
For example biomedical sector has more specialized vocabulary, and often there is many synonyms and abbreviation in these specialized vocabularies simultaneously
The situation of word.The characteristics of sufficiently if biomedical sector can be considered in traditional information retrieval method, it will further
Improve the performance of biomedical information retrieval.
Query expansion technology is one of the key technology in conventional IR field.It can be in original the looking into of user's submission
On the basis of inquiry, it is intended to according to the retrieval of user, inquiry is supplemented and perfect, is intended to so as to more be met user search
Inquiry, improve the performance of retrieval.Existing enquiry expanding method can be divided into two major classes:One kind is looking into based on collection of document
Extended method is ask, this kind of method is therefrom extracted using total data collection of document or partial data collection of document as research object
Content associated with the query, improves original query;Another kind of is the query expansion technology based on outside extended resources, external resource
Dictionary resources, searching system inquiry log, Anchor Text and wikipedia etc. are mainly included, many researchs show to expand using outside
Exhibition resource improves original query, can preferably complete query expansion task, and then lift the performance of retrieval.
Because biomedical sector has the Domain resources such as more dictionary, if can be during information retrieval, fully
The inquiry submitted using these resources to user is supplemented and perfect, and the performance of retrieval will there is a strong possibility that property gets a promotion.
The literature search for being directed to biomedical sector is established, first it should be recognized that the characteristics of the field and resource.
There is substantial amounts of specialized vocabulary in the document of biomedical sector, and these vocabulary contain many synonyms and abbreviation
Etc. complex situations, this brings huge challenge for the foundation of searching system, such as drug acetaminophen, its English name
Word is called paracetamol, and in international standard classification of drug, its title is paracetamol
(acetaminophen), in medicinal chemistry art, its scientific name is C8H9NO2 or NO2BE01, is directed to a variety of titles of the above
Situation, if only inquiring about one of name in retrieval, it is difficult to retrieve all related documents.It is worth rejoice
It is that also there is many intrinsic knowledge bases and resource, such as MeSH (MeSH in biomedical sector:Medical
Subject Headings) and gene ontology (GO:Gene Ontology) etc., if can be sufficiently sharp during retrieval
With these resources, it will bring huge lifting to the performance of Biomedical literature retrieval.
Sequence study (learning to rank) algorithm is a series of supervision being used in information retrieval to document ordering
The general name of learning algorithm, it is mainly characterized by using the technology of machine learning to solve the sequencing problem in information retrieval,
And obtain preferable retrieval ordering performance.Wherein sequencing problem can also regard the select permeability of an optimal item as, therefore,
Ranking Algorithm is applied to multiple other tasks in recent years, such as according to user and the history of article in commending system
Information is that user recommends corresponding article etc..
The content of the invention
It is an object of the invention to provide one kind can provide the user more accurate Biomedical literature, more effectively full
The information requirement of sufficient user, effectively supplement and improve the Biomedical literature inspection of the word-based grading sorting algorithm of user's inquiry
Rope method and system.
The present invention solves technical scheme used by prior art problem:A kind of biology doctor of word-based grading sorting algorithm
Document retrieval method, including following off-line training step and online query stage are learned, wherein, off-line training step includes following step
Suddenly:
S1, search engine inquiry extraction step:Recorded according to the historical query of search engine, extract more group pollings and every
The preceding N bars Query Result document obtained in individual inquiry;And by inquiry and Query Result document collection into an inquiry pond, wherein
N is natural number;
S2, candidate extend vocabulary extraction step:The preceding N bars each inquired about in inquiry pond are inquired about according to biomedical resource
Specialized vocabulary in result document is extracted, and is counted and obtained what each specialized vocabulary occurred in the Query Result document
The weighted sum of number or occurrence number;The number that occurs according to each specialized vocabulary in Query Result document or number
Weighted sum descending arranges, and selects occurrence number highest or M specialized vocabulary of weighted sum highest of number as candidate's expansion word
Converge, wherein M is natural number;
S3, candidate extend feature extraction and the annotation step of vocabulary:
Candidate extends the feature extraction of vocabulary and mark is carried out simultaneously;Wherein, the correlation that vocabulary is extended to candidate marks
By contrasting the retrieval performance of original query and candidate extension vocabulary being added to the height of retrieval performance when in original query
It is low to mark;The evaluation index of retrieval performance height includes:Accuracy rate, Average Accuracy, NDCG values and MRR values;Correlation mark
The concrete mode of note is as follows:
Wherein, eval () is the evaluation index function for evaluating retrieval performance height, and eval (query+term) is to comment
Scores of the valency target function eval () when candidate's extension vocabulary term is added to inquiry query by evaluation, eval (query)
For score of the evaluation index function when query is inquired about in evaluation;Label is labeled as 1 expression candidate extension vocabulary and inquiry
Query is related;Label is labeled as 0 expression candidate extension vocabulary and inquiry query is incoherent;
Candidate extends the feature extraction of vocabulary, is the preceding N bars returned from the inquiry in biomedical resource and inquiry pond
Extracted in Query Result document candidate extend the distributed intelligence of vocabulary, distributed intelligence of candidate's vocabulary in biomedical resource with
And it is that training order models are prepared that candidate, which extends vocabulary and the correlation information of original query, and extracting same candidate's extension
After the various features of vocabulary, all characteristic values are normalized, by the control of all characteristic values on [0,1] section,
Normalized process is as follows:
Wherein, minValue and maxValue is respectively the minimum value and maximum of a certain feature;
S4, candidate extend vocabulary order models training step:Marked according to the degree of correlation of candidate's extension vocabulary and a variety of
Feature, train to obtain the weighted value of every kind of feature using word grading sorting algorithm, concretely comprise the following steps:Select quilt in a step S3
The candidate for being labeled as correlation extends vocabulary and some is marked as incoherent candidate and extends vocabulary forming a word packet, selection
Some such word packets are used as training sample;The random feature for each of which candidate's expansion word assigns initial weight, leads to
Characteristic weighing score is crossed to be ranked up the correlation candidate extension vocabulary in the packet of each word;The sequence knot being grouped according to each word
Fruit, global weight loss is calculated, the weight per one-dimensional characteristic is adjusted according to the Grad of loss function dynamic, wherein sequence loss
For:Wherein NumSample is that candidate extends the quantity that vocabulary is grouped, loss in word packetiTo be every
The penalty values of individual word packet, the penalty values are obtained by calculating the sorting position of related expanding vocabulary, and sorting position is more forward right
The penalty values answered are smaller;By a process on loop iteration, until overall loss value be less than that a certain threshold value or reach specifies repeatedly
The training of generation number is completed, the order models that the characteristic value of final choice is completed as training;
The online query stage comprises the following steps:
S5, on-line search engine queries and extraction step:For the new inquiry of the online submission of user, N1 bars before retrieval obtains
Query Result;The specialized vocabulary in preceding N1 bars retrieval result and its various features are extracted according to biomedical resource, its
Middle N1 is natural number;
S6, online candidate extend word retrieval and its feature extraction and marking step:According to biomedical resource to newly looking into
Ask and extend the feature extracting method of vocabulary extracting method and candidate's extension vocabulary to preceding N1 bars using off-line phase S2-S3 candidate
Online query stage specialized vocabulary and its various features in retrieval result are extracted, and obtain online query stage candidate extension
Vocabulary, the feature of extraction are used to weigh importance of candidate's extension vocabulary in expanding query;Train what is obtained according to step S4
Feature weight, extend vocabulary for online query stage candidate and given a mark, and select K1 forward candidate of fraction to extend vocabulary
It is added in the new inquiry submitted online and is used as expanding query, wherein K1 is natural number;
Vocabulary is extended for some the online query stage candidate for marking and extracting using biomedical resource, it
It is divided intoWherein FeatureNum is the sum of feature, aiIt is sequence mould
The weighted value of ith feature, feature in typei(term) be online query stage candidate extend corresponding to vocabulary term i-th
The characteristic value of individual feature;
Vocabulary score is extended according to online query stage candidate to be ranked up it, and selected and sorted forward K1 is online
Inquiry phase candidate extends vocabulary as when extending vocabulary and being added in the new inquiry submitted online, the online query rank that is added
Section candidate, which extends weight of the vocabulary in expanding query, to be expressed as
Wherein sign is sign function, sign when in the new inquiry that online query stage candidate's expansion word remittance abroad is submitted online now
=1, otherwise sign=0, weightoriginalThe weighted value for being the new inquiry submitted online in expanding query;
S7, Query Result return to step:Retrieved according to expanding query, retrieval result is returned into user.
In step S2, specialized vocabulary weighted sum of occurrence number in the Query Result document isWherein countiThe number occurred for the vocabulary in i-th document, diFor
The decay factor of i piece documents.
In step s3, evaluation index function eval () is Average Accuracy function, i.e.,:
Wherein, RelDocqueryFor the number of given inquiry query relevant documentation, rank (i) is represented in document results
The position of i-th relevant documentation in sorted lists.
In step sl, when situation about being recorded without historical query, by constructing biomedical inquiry and search method
Mode, it is artificial to be inquired about and its record of result;The search method uses vector space model, BM25 retrieval models or base
In the language model of different smoothing methods.
Penalty values are in step S4:Wherein rankiFor candidate's expansion word of correlation row are grouped in word
The position sorted in table.
Biomedical resource refers to the dictionary or knowledge base for including biomedical specialized vocabulary.
The feature that the candidate extends vocabulary includes frequency TF, Hou Xuankuo that candidate's extension vocabulary occurs in result document
Open up the TF-IDF values of vocabulary, candidate extend document number, candidate's extension vocabulary that vocabulary occurs jointly with original query with it is original
Inquire about occur jointly in one text window number, in biomedical resource the existing number of candidate's expansion word remittance abroad,
In biomedical resource, comprising the candidate extend vocabulary term concepts number and biomedical technical term concept it
Between inclusion relation.
A kind of Biomedical literature searching system of word-based grading sorting algorithm, including off-line training part and online inspection
Rope part;The off-line training part is included with lower part:
Search engine inquiry extraction module:For according to the historical query of search engine record, extract more group pollings and
The preceding N bars Query Result document obtained in each inquiry;And by inquiry and Query Result document collection into an inquiry pond, its
Middle N is natural number;
Candidate extends vocabulary extraction module:For when given user inquires about, using the intrinsic resource of biomedical sector,
In the top n Query Result document that search engine inquiry extraction module obtains, extraction obtains specialized vocabulary, and to the professional word
The frequency or the weighted sum of occurrence number that remittance occurs in Query Result document are recorded;Looked into according to each specialized vocabulary
The weighted sum descending arrangement of the number occurred in result document or occurrence number is ask, selects M specialty of occurrence number highest
Vocabulary extends vocabulary as candidate, and wherein M is natural number;
Candidate extends feature extraction and the labeling module of vocabulary:For candidate extend vocabulary extraction module in obtained by
Candidate, which extends, extracts associated feature in vocabulary, and extends influence of the vocabulary for retrieval performance according to candidate, and mark is waited
The degree of correlation of choosing extension vocabulary;
Candidate extends vocabulary order models training module:For utilizing word grading sorting algorithm, in extraction candidate's expansion word
Converge after feature and mark candidate's extension vocabulary degree of correlation, training vocabulary order models obtain each feature that candidate extends vocabulary
Weighted value:The candidate that correlation is noted as in the feature extraction of one candidate's extension vocabulary of selection and labeling module extends vocabulary
Incoherent one word packet of candidate's extension vocabulary composition is marked as with some, selects some such words to be grouped and is used as training
Sample;The random feature for each of which candidate's expansion word assigns initial weight, by characteristic weighing score to each word point
Correlation candidate extension vocabulary in group is ranked up;The ranking results being grouped according to each word, global weight loss is calculated, according to
The Grad dynamic of loss function adjusts the weight per one-dimensional characteristic, wherein sequence loss is:Its
Middle NumSample is that candidate extends the quantity that vocabulary is grouped, loss in word packetiFor the penalty values of each word packet, the loss
Value is obtained by calculating the sorting position of related expanding vocabulary, and the more forward corresponding penalty values of sorting position are smaller;Pass through circulation
A process in iteration, completion is trained until overall loss value is less than a certain threshold value or reaches the iterations specified, will finally be selected
The order models that the characteristic value selected is completed as training;
The on-line search part includes:
Query Reconstruction module:Vocabulary marking is extended for the specialized vocabulary extraction in newly inquiring about and candidate;Including searching online
Rope engine queries extraction module, online candidate extend word retrieval and its feature extraction and scoring modules, wherein, on-line search is drawn
Inquiry extraction module is held up for the new inquiry to the online submission of user, N1 bar Query Results before retrieval obtains;According to biomedicine
Resource is extracted to the specialized vocabulary in preceding N1 bars retrieval result and its various features, and wherein N1 is natural number;Online candidate
The candidate that extension word retrieval and its feature extraction and scoring modules are exported using vocabulary order models extends vocabulary weighted value and obtained
Divide and calculate corresponding weight, and add it in original query, be expanded inquiry;
Query Result returns to module:For the result document for retrieving to obtain by expanding query, user is returned to.
The beneficial effects of the present invention are:The present invention is mainly from the angle of query expansion, by query expansion
The special of customer information requirement can be most expressed using resource selections such as the intrinsic dictionaries of word grading sorting algorithm and biomedical sector
Industry vocabulary, more efficiently the completing retrieval of the task, so as to provide the user the properer retrieval result of demand therewith, this hair
The bright resource using in biomedical sector, original query is supplemented and improved, and then improve the performance of retrieval.When use TREC bases
Because the set of task data in literature is as data acquisition system, document is carried out as reference retrieval model using traditional BM25 retrieval models
During retrieval, 25.62% literature search accuracy rate can be obtained;And method and system involved in the present invention is used on this basis
When being retrieved, 26.30% literature search accuracy rate can be obtained, retrieval performance is obviously improved and the present invention
Involved peek-a-boo can be effectively retrieved inquires about mostly concerned Biomedical literature with user, improves user's
Satisfaction.
Brief description of the drawings
Fig. 1 is the schematic flow sheet of search method of the present invention;
Fig. 2 is the logical construction schematic diagram of searching system of the present invention.
Embodiment
Below in conjunction with the drawings and the specific embodiments, the present invention will be described:
Fig. 1 is a kind of schematic flow sheet of the Biomedical literature search method of word-based grading sorting algorithm of the present invention,
A kind of Biomedical literature search method of word-based grading sorting algorithm, including following off-line training step and online query rank
Section, wherein, off-line training step comprises the following steps:
S1, search engine inquiry extraction step:Recorded according to the historical query of search engine, extract more group pollings and every
The preceding N bars Query Result document obtained in individual inquiry;And by inquiry and Query Result document collection into an inquiry pond, N is
Natural number.In the present embodiment, N=10;
Wherein, the historical query record of search engine is primarily referred to as being directed to the searching system of Biomedical literature and recorded
Query history and corresponding Query Result, these inquiry and corresponding Query Result will be used for order models under off-line state
Training.
, can be by way of constructing biomedical inquiry and retrieval when the situation without relevant historical inquiry record, people
Work is inquired about and its record of retrieval result.Search method can use a variety of order models in conventional IR, bag
Include but be not limited to vector space model, BM25 retrieval models, the language model based on different smoothing methods etc..
S2, candidate extend vocabulary extraction step:The preceding N bars each inquired about in inquiry pond are inquired about according to biomedical resource
Specialized vocabulary in result document is extracted, and is counted and obtained what each specialized vocabulary occurred in the Query Result document
The weighted sum of number or occurrence number;The number that occurs according to each specialized vocabulary in Query Result document or number
Weighted sum descending arranges, and selects occurrence number highest or number weighted sum M specialized vocabulary of highest as candidate's expansion word
Converge, wherein M is natural number;
Wherein, biomedical resource refers to the resources such as the dictionary comprising biomedical specialized vocabulary or knowledge base, including
But it is not limited to:The super word of MeSH (MeSH), gene ontology (GO) and Unified Medical Language System (UMLS) issue
Converge storehouse (Metathesaurus), semantic network (Semantic Network) and expert's semantic dictionary instrument (SPECIALIST
Lexicon and Lexical Tools) etc..
Exemplified by using MeSH MeSH as biomedical resource used in the present invention, corresponding to extraction inquiry
Specialized vocabulary in preceding N pieces Query Result document, wherein each specialized vocabulary extracted has corresponded to it and gone out in a document
Existing number or the weighted sum of occurrence number.Such as specialized vocabulary term in a preceding N pieces document occurrence number weighted sum byIt is calculated, wherein countiTime occurred for the vocabulary in i-th document
Number, diFor the decay factor of i-th document, the number weighted sum of specialized vocabulary is used for carrying out the word frequency occurred in different document
Weighting, so that the word frequency in the forward document that sorts has bigger weight, control causes in the document of sequence more rearward
Comprising specialized vocabulary obtain score it is fewer.According to above-mentioned formulaIn
Count (term) value is ranked up to selected specialized vocabulary from high to low, or according to score (term) value by height
Selected specialized vocabulary is ranked up to low, extension vocabulary of the selected and sorted preceding M vocabulary the most forward as candidate,
M value is 150 in the present embodiment.
S3, candidate extend feature extraction and the correlation annotation step of vocabulary:
Candidate extends the feature extraction of vocabulary and mark is carried out simultaneously;Wherein, the correlation that vocabulary is extended to candidate marks
Realized by the retrieval performance for contrasting the retrieval performance of original query and being added to the extension vocabulary when in original query.Candidate
The thinking of mark for extending vocabulary is:Single candidate extension vocabulary is added in original query and retrieved, if retrieval result
The lifting of performance, then mark the extension vocabulary has correlation with original query.The evaluation index of retrieval performance includes but unlimited
Schedule:Accuracy rate (Precision), Average Accuracy (MAP), NDCG values and MRR values etc..The concrete mode of mark is as follows:
Wherein, eval () is the evaluation index function for evaluating retrieval performance height, and eval (query+term) is to comment
Scores of the valency target function eval () when candidate's extension vocabulary term is added to given inquiry query by evaluation, eval
(query) score for evaluation index function in the given inquiry query of evaluation.When adding a certain candidate's vocabulary with original query
When the evaluation score retrieved is more than the evaluation score that original query is retrieved in itself, candidate extension vocabulary is labeled as
1, being labeled as 1 means that the vocabulary to original query is related;And when original query is retrieved plus a certain candidate's vocabulary
Evaluation score retrieved in itself no more than original query evaluation score when, candidate extension vocabulary is labeled as 0, mark
It is incoherent when meaning the vocabulary with original query for 0.
In the present embodiment, evaluation function eval () is Average Accuracy, i.e.,:
Wherein, RelDocqueryFor the number of given inquiry query relevant documentation, rank (i) is represented in document results
The position of i-th relevant documentation in sorted lists, such as rank (3)=5 represent the 3rd related text in sort result list
Shelves appear in the 5th position of sorted lists.
It is the preceding N returned from the inquiry in biomedical resource and inquiry pond and candidate extends the feature extraction of vocabulary
The distributed intelligence of distributed intelligence, candidate's vocabulary that candidate's extension vocabulary is extracted in result document in biomedical resource is ask in investigation
And candidate extends vocabulary and the correlation information of original query etc. and prepared for training order models, and extracting same candidate
After the various features for extending vocabulary, all characteristic values are normalized;So that all characteristic values are controlled in [0,1] section
On, normalized detailed process is:
MinValue and maxValue are respectively a certain
The minimum value and maximum of feature.
Wherein, the feature for extending vocabulary specifically includes:
1st, candidate extends the frequency TF that vocabulary occurs in result document.This feature can be according to specialized vocabulary term in result
Occurrence number obtains in document.
2nd, candidate extends the TF-IDF values of vocabulary.TF-IDF is one of classical model of information retrieval field, can be used to weigh
The relative importance that measure word converges, computational methods are as shown by the following formula:
Wherein count (term) is that candidate extends the number that vocabulary occurs in i-th result document, and TotalDoc is instruction
Practice the total number of documents in data, df (term) is the number for occurring the document that the candidate extends vocabulary.
3rd, candidate extends the document number that vocabulary occurs jointly with original query.This feature can be used for calculating original query
The degree of correlation of vocabulary is extended with candidate.
4th, candidate extends the number that vocabulary occurs jointly with original query in one text window.This feature is used for calculating
The query word in original query extends the degree of correlation of vocabulary with the candidate within the specific limits, and wherein text window refers to same
A piece occurs in the range of the document of original query word and candidate's vocabulary, the word being spaced between the extension vocabulary and original query word
Number.
5th, in biomedical resource such as MeSH, the existing number of candidate's expansion word remittance abroad.This feature is used for calculating and weighing
The candidate extends segment information of the vocabulary in biomedical resource.
6th, in biomedical resource such as MeSH, the number of the term concepts of vocabulary is extended comprising the candidate.Cured in biology
Often there is the relation included between technical term concept, this feature can equally weigh some candidate's vocabulary in biomedicine
Importance in resource.
The candidate extracted more than is extended in lexical feature, and feature 1 and feature 2 are used for weighing candidate's extension vocabulary in document
Distributed intelligence in set;Feature 3 and feature 4 are used for weighing the degree of correlation information that candidate extends vocabulary and original query;And
Feature 5 and feature 6 are used for weighing distributed intelligence of candidate's extension vocabulary in biomedical resource.Extension involved in the present invention
Lexical feature includes but is not limited to features described above, by above-mentioned manifold extraction, can be used as word grading sorting algorithm
Input, preferably weigh candidate extend vocabulary significance level.
S4, candidate extend vocabulary order models training step:The correlation of vocabulary is extended according to the candidate obtained in step S3
Degree marks and various features are as input, trains to obtain the weight of every kind of feature using the order models of word grading sorting algorithm
Value, concretely comprise the following steps selection one step S3 in be noted as correlation candidate extend vocabulary (i.e. label for 1 when it is corresponding
Candidate extends vocabulary) and it is some be marked as incoherent candidate extend vocabulary (i.e. label for 0 when corresponding candidate extend
Vocabulary) one word packet of composition, select some such word packets to be used as training sample;At random vocabulary is extended for each candidate
Word feature assign initial weight, by characteristic weighing score to each word packet in related expanding vocabulary be ranked up;Root
The ranking results being grouped according to each word, global weight loss is calculated, adjusted according to the Grad of loss function dynamic per one-dimensional spy
The weight of sign, wherein sequence loss is:Wherein NumSample is that candidate extends vocabulary in word packet
The quantity of packet, lossiFor the penalty values of each word packet, the penalty values are obtained by calculating the sorting position of related expanding vocabulary
Arrive, the more forward corresponding penalty values of sorting position are smaller;By a process on loop iteration, until overall loss value is less than a certain
Threshold value reaches the iterations training completion specified, the order models that the characteristic value of final choice is completed as training;This
100 termination training of iteration are selected in embodiment.
Penalty values are in the present embodiment:Wherein rankiIt is grouped for candidate's expansion word of correlation in word
The position sorted in list, when it makes number one, loss is 0, loses and is maximized when it rolls into last place.In addition,
The calculation formula of penalty values is including but not limited to this calculation formula.
In order models, the calculation formula for extending vocabulary final score is as follows:
Wherein, FeatureNum is the sum of feature, aiFor the weighted value of ith feature, featurei(term) it is candidate
The characteristic value of ith feature corresponding to vocabulary term.The order models obtained train herein after can be used for test query correlation
Extension vocabulary selection.Above step is completed in off-line case.
The online query stage comprises the following steps:
S5, on-line search engine queries and extraction step:For the new inquiry of the online submission of user, N1 bars before retrieval obtains
Query Result;The specialized vocabulary in preceding N1 bars retrieval result and its various features are extracted according to biomedical resource, its
Middle N1 is natural number;
It should be noted that in the case of this step refers to online, submitted as user to Biomedical literature search engine
After inquiry, this method can obtain preliminary search sequence N1 piece Query Results the most forward automatically, for the expansion inquired about user
The processing such as exhibition, the processing is transparent for user.
S6, online candidate extend word retrieval and its feature extraction and marking step:According to biomedical resource to newly looking into
Ask and extend the feature extracting method of vocabulary extracting method and candidate's extension vocabulary to preceding N1 bars using off-line phase S2-S3 candidate
Online query stage specialized vocabulary and its various features in retrieval result are extracted, and obtain online query stage candidate extension
Vocabulary, the feature of extraction are used to weigh importance of candidate's extension vocabulary in expanding query;Train what is obtained according to step S4
Feature weight, extend vocabulary for online query stage candidate and given a mark, new inquiry is built according to marking, and select fraction to lean on
K1 preceding online query stage candidate extends the extension that vocabulary is added in the new inquiry submitted online as on-line stage and looked into
Ask, wherein K1 is natural number;
Vocabulary is extended for some the online query stage candidate for marking and extracting using biomedical resource, it
It is divided intoWherein FeatureNum is the sum of feature, aiIt is sequence mould
The weighted value of ith feature, feature in typei(term) be online query stage candidate extend corresponding to vocabulary term i-th
The characteristic value of individual feature;
Vocabulary score is extended according to online query candidate to be ranked up it, and the K1 vocabulary conduct that selected and sorted is forward
When on-line stage candidate extension vocabulary is added in new inquiry, the on-line stage candidate added extends vocabulary in expanding query
Weight can be expressed asWherein sign is sign function,
Sign=1 when in the new inquiry that on-line stage candidate's expansion word remittance abroad is submitted online now, otherwise sign=0,
weightoriginalThe weighted value for being the new inquiry submitted online in expanding query;
The concrete form of final expanding query is as follows:
(weight1 queryoriginal weight2(w1 term1 w2 term2…wk termk))
Wherein weight1The weight for being the new inquiry submitted online in expanding query, weight2For the extension newly added
All weights in expanding query of vocabulary, w1,w2,…,wKTo extend vocabulary term1,term2,…,termKCorresponding
Fraction weight, K are the number of the extension vocabulary of final choice.Weight in the present embodiment1Value is 0.5, weight2Value is
0.5, K value is 50.
S7, Query Result return to step:Retrieved according to expanding query, retrieval result is returned into user, completes inspection
Rope process.
Corresponding with the above method, present invention also offers a kind of inspection of the Biomedical literature of word-based grading sorting algorithm
Cable system.Accompanying drawing 2 gives the building-block of logic of the system.
A kind of Biomedical literature searching system of word-based grading sorting algorithm, including off-line training part and online inspection
Rope part;The off-line training part is included with lower part:
Search engine inquiry extraction module:For according to the historical query of search engine record, extract more group pollings and
The preceding N bars Query Result document obtained in each inquiry;And by inquiry and Query Result document collection into an inquiry pond, its
Middle N is natural number;Search engine inquiry extraction module can retrieve the biology associated with user's inquiry according to the inquiry of user
Medical literature, and the result of retrieval is returned into user, and internal system for the computings such as the extension of inquiry and operation to
It is transparent for family to can't see.
Candidate extends vocabulary extraction module:For when given user inquires about, using the intrinsic resource of biomedical sector,
In the top n Query Result document that search engine inquiry extraction module obtains, extraction obtains specialized vocabulary, and to the professional word
The number (frequency) or the weighted sum of occurrence number that remittance occurs in Query Result document are recorded;According to each professional word
The number or the weighted sum descending arrangement of occurrence number that remittance occurs in Query Result document, select occurrence number highest M
Individual specialized vocabulary extends vocabulary as candidate, and wherein M is natural number;
Candidate extends feature extraction and the labeling module of vocabulary:For candidate extend vocabulary extraction module in obtained by
Candidate, which extends, extracts associated feature in vocabulary, and extends influence of the vocabulary for retrieval performance according to candidate, and mark is waited
The degree of correlation of choosing extension vocabulary;In off-line training, candidate, which will extend the degree of correlation mark of vocabulary and various features, to be used for
The input of word grading sorting algorithm;In online query, the module is used to extract the feature letter associated with candidate's extension vocabulary
Breath.
Candidate extends vocabulary order models training module:For utilizing word grading sorting algorithm, in extraction candidate's expansion word
Converge after feature and mark candidate's extension vocabulary degree of correlation, training vocabulary order models output candidate extends each feature of vocabulary
Weighted value;The weighted value can be used in the measurement of the significance level of the extension vocabulary to unknown inquiry.Specially:Selection one
Candidate extend vocabulary feature extraction and labeling module in be noted as the candidate of correlation and extend vocabulary and some be marked as not
Related candidate extends vocabulary and forms a word packet, selects some such word packets to be used as training sample;Random is wherein
The feature of each candidate's expansion word assigns initial weight, and the correlation candidate in the packet of each word is expanded by characteristic weighing score
Exhibition vocabulary is ranked up;The ranking results being grouped according to each word, global weight loss is calculated, according to the Grad of loss function
Dynamic adjusts the weight per one-dimensional characteristic, wherein sequence loss is:Wherein NumSample is word point
Candidate extends the quantity of vocabulary packet, loss in groupiFor the penalty values of each word packet, the penalty values are by calculating related expanding
The sorting position of vocabulary obtains, and the more forward corresponding penalty values of sorting position are smaller;Pass through a process, Zhi Daozong on loop iteration
Bulk diffusion value is less than the iterations training that a certain threshold value or reach is specified and completed, and the characteristic value using final choice is as having trained
Into order models.
The on-line search part includes:
Query Reconstruction module:Vocabulary marking is extended for the specialized vocabulary extraction in newly inquiring about and candidate;Including searching online
Rope engine queries extraction module, online candidate extend word retrieval and its feature extraction and scoring modules, wherein, on-line search is drawn
Inquiry extraction module is held up for the new inquiry to the online submission of user, N1 bar Query Results before retrieval obtains;According to biomedicine
Resource is extracted to the specialized vocabulary in preceding N1 bars retrieval result and its various features, and wherein N1 is natural number.Online candidate
The candidate that extension word retrieval and its feature extraction and scoring modules are exported using vocabulary order models extends vocabulary weighted value and obtained
Divide and calculate corresponding weight, and add it in original query, be expanded inquiry.
Query Result returns to module, for the result document for retrieving to obtain by expanding query, returns to user.User obtains
Returning result be actually result of the returning result after query expansion that it submits input, and the process pair of query expansion
It is sightless for user.
According to the above-mentioned description for being directed to method and system embodiment involved in the present invention, in conjunction with specific embodiments
Illustrate.Assume that user has completed the training of order models by historical data in the present embodiment, when user submits one
" during mad cow disease " (rabid ox diseases), system is first according to the word in preliminary search before examination document for individual new inquiry
Frequency information, selects the extension vocabulary of candidate, and wherein candidate extends 10 extension vocabulary in the top in vocabulary and its correlation
Property mark situation it is as shown in the table:
Ranking | Vocabulary | Correlation |
1 | Disease (disease) | It is related |
2 | Prions (prion) | It is related |
3 | Cause (causes) | It is uncorrelated |
4 | Infectious (infectivity) | It is related |
5 | Conversion (conversion) | It is uncorrelated |
6 | Cow (ox) | It is related |
7 | Spongiform (spongy tissue) | It is related |
8 | Fatal (fatal) | It is uncorrelated |
9 | Encephalopathies (epileptic encephalopathic) | It is related |
10 | Mad (madness) | It is related |
As can be seen from the above table, the candidate of 10 is extended in vocabulary before ranking, and uncorrelated vocabulary has 3, if directly
Add it in original query, negative impact can be produced to retrieval performance.Next from document and biomedical dictionary
The extraction feature related to candidate's extension vocabulary in MeSH, and the weight of every kind of feature is obtained using order models, to all
Candidate extends vocabulary and is given a mark and sorted again.
10 extension vocabulary is as shown in the table before the ranking of final choice after sequence.As can be seen from the table, pass through
10 inquiries for sorting the most forward in expanding query after sequence is perfect are relative words.By these inquiries according to its normalizing
Sequence score after change is added in original query, the performance of retrieval can further be improved by carrying out retrieval as weight.
The description of above-described embodiment is explained and illustrates the biomedicine of word-based grading sorting algorithm provided by the invention
Document retrieval method and system.This method and system can utilize what the resources such as the knowledge base of biomedical sector were submitted to user
Original query is extended, and has been used word grading sorting algorithm to be used to extend vocabulary importance measures in extension, has been passed through inquiry
Expansion process to the inquiry that user submits carried out supplement and it is perfect, ensure that the accuracy of Query Result, further meet
The information requirement of user.
Above content is to combine specific optimal technical scheme further description made for the present invention, it is impossible to is assert
The specific implementation of the present invention is confined to these explanations.For general technical staff of the technical field of the invention,
On the premise of not departing from present inventive concept, some simple deduction or replace can also be made, should all be considered as belonging to the present invention's
Protection domain.
Claims (8)
1. a kind of Biomedical literature search method of word-based grading sorting algorithm, it is characterised in that including following offline instruction
Practice stage and online query stage, wherein, off-line training step comprises the following steps:
S1, search engine inquiry extraction step:Recorded according to the historical query of search engine, extract more group pollings and each look into
The preceding N bars Query Result document obtained in inquiry;And by inquiry and Query Result document collection into an inquiry pond, wherein N is
Natural number;
S2, candidate extend vocabulary extraction step:According to biomedical resource to inquiring about in pond the preceding N bars Query Result each inquired about
Specialized vocabulary in document is extracted, and is counted and obtained the number that each specialized vocabulary occurs in the Query Result document
Or the weighted sum of occurrence number;The weighting of the number or number that occur according to each specialized vocabulary in Query Result document
Arranged with descending, select occurrence number highest or M specialized vocabulary of weighted sum highest of number to extend vocabulary as candidate, its
Middle M is natural number;
S3, candidate extend feature extraction and the annotation step of vocabulary:
Candidate extends the feature extraction of vocabulary and mark is carried out simultaneously;Wherein, the correlation mark that vocabulary is extended to candidate passes through
The height for contrasting the retrieval performance of original query and candidate extension vocabulary being added to retrieval performance when in original query comes
Mark;The evaluation index of retrieval performance height includes:Accuracy rate, Average Accuracy, NDCG values and MRR values;Correlation mark
Concrete mode is as follows:
<mrow>
<mi>l</mi>
<mi>a</mi>
<mi>b</mi>
<mi>e</mi>
<mi>l</mi>
<mo>=</mo>
<mfenced open = "{" close = "}">
<mtable>
<mtr>
<mtd>
<mn>1</mn>
</mtd>
<mtd>
<mrow>
<mi>e</mi>
<mi>v</mi>
<mi>a</mi>
<mi>l</mi>
<mrow>
<mo>(</mo>
<mi>q</mi>
<mi>u</mi>
<mi>e</mi>
<mi>r</mi>
<mi>y</mi>
<mo>+</mo>
<mi>t</mi>
<mi>e</mi>
<mi>r</mi>
<mi>m</mi>
<mo>)</mo>
</mrow>
<mo>></mo>
<mi>e</mi>
<mi>v</mi>
<mi>a</mi>
<mi>l</mi>
<mrow>
<mo>(</mo>
<mi>q</mi>
<mi>u</mi>
<mi>e</mi>
<mi>r</mi>
<mi>y</mi>
<mo>)</mo>
</mrow>
</mrow>
</mtd>
</mtr>
<mtr>
<mtd>
<mn>0</mn>
</mtd>
<mtd>
<mrow>
<mi>e</mi>
<mi>v</mi>
<mi>a</mi>
<mi>l</mi>
<mrow>
<mo>(</mo>
<mi>q</mi>
<mi>u</mi>
<mi>e</mi>
<mi>r</mi>
<mi>y</mi>
<mo>+</mo>
<mi>t</mi>
<mi>e</mi>
<mi>r</mi>
<mi>m</mi>
<mo>)</mo>
</mrow>
<mo>&le;</mo>
<mi>e</mi>
<mi>v</mi>
<mi>a</mi>
<mi>l</mi>
<mrow>
<mo>(</mo>
<mi>q</mi>
<mi>u</mi>
<mi>e</mi>
<mi>r</mi>
<mi>y</mi>
<mo>)</mo>
</mrow>
</mrow>
</mtd>
</mtr>
</mtable>
</mfenced>
</mrow>
Wherein, eval () is the evaluation index function for evaluating retrieval performance height, and eval (query+term) refers to for evaluation
Scores of the scalar functions eval () when candidate's extension vocabulary term is added to inquiry query by evaluation, eval (query) is to comment
Score of the valency target function when query is inquired about in evaluation;Label is labeled as 1 expression candidate extension vocabulary and inquiry query
Related;Label is labeled as 0 expression candidate extension vocabulary and inquiry query is incoherent;
Candidate extends the feature extraction of vocabulary, is the preceding N bars inquiry returned from the inquiry in biomedical resource and inquiry pond
Candidate is extracted in result document and extends the distributed intelligence and time of the distributed intelligence, candidate's vocabulary of vocabulary in biomedical resource
Choosing extension vocabulary and the correlation information of original query are prepared for training order models, and extend vocabulary extracting same candidate
Various features after, all characteristic values are normalized, by all characteristic values control on [0,1] section, normalizing
The process of change is as follows:
<mrow>
<mi>n</mi>
<mi>e</mi>
<mi>w</mi>
<mi>F</mi>
<mi>e</mi>
<mi>a</mi>
<mi>t</mi>
<mi>u</mi>
<mi>r</mi>
<mi>e</mi>
<mi>V</mi>
<mi>a</mi>
<mi>l</mi>
<mi>u</mi>
<mi>e</mi>
<mo>=</mo>
<mfrac>
<mrow>
<mi>o</mi>
<mi>l</mi>
<mi>d</mi>
<mi>F</mi>
<mi>e</mi>
<mi>a</mi>
<mi>t</mi>
<mi>u</mi>
<mi>r</mi>
<mi>e</mi>
<mi>V</mi>
<mi>a</mi>
<mi>l</mi>
<mi>u</mi>
<mi>e</mi>
<mo>-</mo>
<mi>min</mi>
<mi>V</mi>
<mi>a</mi>
<mi>l</mi>
<mi>u</mi>
<mi>e</mi>
</mrow>
<mrow>
<mi>max</mi>
<mi>V</mi>
<mi>a</mi>
<mi>l</mi>
<mi>u</mi>
<mi>e</mi>
<mo>-</mo>
<mi>min</mi>
<mi>V</mi>
<mi>a</mi>
<mi>l</mi>
<mi>u</mi>
<mi>e</mi>
</mrow>
</mfrac>
</mrow>
Wherein, minValue and maxValue is respectively the minimum value and maximum of a certain feature;
S4, candidate extend vocabulary order models training step:The degree of correlation mark and various features of vocabulary are extended according to candidate,
Train to obtain the weighted value of every kind of feature using word grading sorting algorithm, concretely comprise the following steps:It is marked in one step S3 of selection
For correlation candidate extend vocabulary and it is some be marked as incoherent candidate and extend vocabulary forming a word packet, select some
Such word packet is used as training sample;The random feature for each of which candidate's expansion word assigns initial weight, passes through spy
Sign weight score is ranked up to the correlation candidate extension vocabulary in the packet of each word;The ranking results being grouped according to each word,
Global weight loss is calculated, the weight per one-dimensional characteristic is adjusted according to the Grad of loss function dynamic, wherein sequence loss is:Wherein NumSample is that candidate extends the quantity that vocabulary is grouped, loss in word packetiFor each word
The penalty values of packet, the penalty values are obtained by calculating the sorting position of related expanding vocabulary, and sorting position is more forward corresponding
Penalty values are smaller;By a process on loop iteration, until overall loss value is less than a certain threshold value or reaches the iteration specified time
Number training is completed, the order models that the characteristic value of final choice is completed as training;
The online query stage comprises the following steps:
S5, on-line search engine queries and extraction step:For the new inquiry of the online submission of user, N1 bars inquiry before retrieval obtains
As a result;The specialized vocabulary in preceding N1 bars retrieval result and its various features are extracted according to biomedical resource, wherein N1
For natural number;
S6, online candidate extend word retrieval and its feature extraction and marking step:According to biomedical resource to new inquiry profit
The feature extracting method that vocabulary extracting method and candidate's extension vocabulary are extended with off-line phase S2-S3 candidate is retrieved to preceding N1 bars
As a result online query stage specialized vocabulary and its various features in are extracted, and obtain online query stage candidate's expansion word
Converge, the feature of extraction is used to weigh importance of candidate's extension vocabulary in expanding query;The spy for training to obtain according to step S4
Weight is levied, extending vocabulary for online query stage candidate is given a mark, and is selected K1 forward candidate of fraction to extend vocabulary and added
Enter and expanding query is used as into the new inquiry submitted online, wherein K1 is natural number;
Vocabulary is extended for some the online query stage candidate for marking and extracting using biomedical resource, it is scored atWherein FeatureNum is the sum of feature, aiIt is in order models
The weighted value of ith feature, featurei(term) it is special i-th that online query stage candidate is extended corresponding to vocabulary term
The characteristic value of sign;
Vocabulary score is extended according to online query stage candidate to be ranked up it, and the K1 online query that selected and sorted is forward
When stage candidate extension vocabulary is added in the new inquiry submitted online as extension vocabulary, the online query stage added waits
Weight of the choosing extension vocabulary in expanding query can be expressed asIts
Middle sign is sign function, sign=when in the new inquiry that online query stage candidate's expansion word remittance abroad is submitted online now
1, otherwise sign=0, weightoriginalThe weighted value for being the new inquiry submitted online in expanding query;
S7, Query Result return to step:Retrieved according to expanding query, retrieval result is returned into user.
A kind of 2. Biomedical literature search method of word-based grading sorting algorithm according to claim 1, it is characterised in that
In step S2, specialized vocabulary weighted sum of occurrence number in the Query Result document isIts
Middle countiThe number occurred for the vocabulary in i-th document, diFor the decay factor of i-th document.
3. a kind of Biomedical literature search method of word-based grading sorting algorithm according to claim 1, its feature
It is, in step s3, evaluation index function eval () is Average Accuracy function, i.e.,:
<mrow>
<msub>
<mi>eval</mi>
<mrow>
<mi>M</mi>
<mi>A</mi>
<mi>P</mi>
</mrow>
</msub>
<mo>=</mo>
<mfrac>
<mn>1</mn>
<mrow>
<msub>
<mi>RelDoc</mi>
<mrow>
<mi>q</mi>
<mi>u</mi>
<mi>e</mi>
<mi>r</mi>
<mi>y</mi>
</mrow>
</msub>
</mrow>
</mfrac>
<mo>&CenterDot;</mo>
<msubsup>
<mi>&Sigma;</mi>
<mrow>
<mi>i</mi>
<mo>=</mo>
<mn>1</mn>
</mrow>
<mrow>
<msub>
<mi>RelDoc</mi>
<mrow>
<mi>q</mi>
<mi>u</mi>
<mi>e</mi>
<mi>r</mi>
<mi>y</mi>
</mrow>
</msub>
</mrow>
</msubsup>
<mfrac>
<mi>i</mi>
<mrow>
<mi>r</mi>
<mi>a</mi>
<mi>n</mi>
<mi>k</mi>
<mrow>
<mo>(</mo>
<mi>i</mi>
<mo>)</mo>
</mrow>
</mrow>
</mfrac>
</mrow>
Wherein, RelDocqueryFor the number of given inquiry query relevant documentation, rank (i) represents to sort in document results
The position of i-th relevant documentation in list.
4. a kind of Biomedical literature search method of word-based grading sorting algorithm according to claim 1, its feature
It is, in step sl, when situation about being recorded without historical query, by the side for constructing biomedical inquiry and search method
Formula, it is artificial to be inquired about and its record of result;The search method is using vector space model, BM25 retrieval models or is based on
The language model of different smoothing methods.
5. a kind of Biomedical literature search method of word-based grading sorting algorithm according to claim 1, its feature
It is, penalty values are in step S4:Wherein rankiIt is related candidate's expansion word in word group list
The position of sequence.
6. a kind of Biomedical literature search method of word-based grading sorting algorithm according to claim 1, its feature
It is, biomedical resource refers to the dictionary or knowledge base for including biomedical specialized vocabulary.
7. a kind of Biomedical literature search method of word-based grading sorting algorithm according to claim 1, its feature
It is, the feature that the candidate extends vocabulary includes frequency TF, the candidate's extension that candidate's extension vocabulary occurs in result document
The TF-IDF values of vocabulary, candidate extend the document number that vocabulary occurs jointly with original query, candidate's extension vocabulary is looked into original
Ask occur jointly in one text window number, in biomedical resource the existing number of candidate's expansion word remittance abroad, in life
In thing medical resource, the number of the term concepts of vocabulary is extended comprising the candidate and between biomedical technical term concept
Inclusion relation.
8. a kind of Biomedical literature searching system of word-based grading sorting algorithm, it is characterised in that including off-line training portion
Point and on-line search part;The off-line training part is included with lower part:
Search engine inquiry extraction module:For being recorded according to the historical query of search engine, more group pollings and each are extracted
The preceding N bars Query Result document obtained in inquiry;And by inquiry and Query Result document collection into an inquiry pond, wherein N
For natural number;
Candidate extends vocabulary extraction module:For when given user inquires about, using the intrinsic resource of biomedical sector, searching
In the top n Query Result document that rope engine queries extraction module obtains, extraction obtains specialized vocabulary, and the specialized vocabulary is existed
The frequency or the weighted sum of occurrence number occurred in Query Result document is recorded;Tied according to each specialized vocabulary in inquiry
The number occurred in fruit document or the weighted sum descending arrangement of occurrence number, select occurrence number M specialized vocabulary of highest
Vocabulary is extended as candidate, wherein M is natural number;
Candidate extends feature extraction and the labeling module of vocabulary:For the candidate obtained by being extended in candidate in vocabulary extraction module
Associated feature is extracted in extension vocabulary, and influence of the vocabulary for retrieval performance is extended according to candidate, mark candidate expands
Open up the degree of correlation of vocabulary;
Candidate extends vocabulary order models training module:For utilizing word grading sorting algorithm, it is special to extend vocabulary in extraction candidate
After the mark candidate that seeks peace extends vocabulary degree of correlation, training vocabulary order models obtain the power that candidate extends each feature of vocabulary
Weight values:If one candidate of selection extend vocabulary feature extraction and labeling module in be noted as correlation candidate extend vocabulary and
It is dry to be marked as incoherent one word packet of candidate's extension vocabulary composition, select some such words to be grouped and be used as training sample
This;The random feature for each of which candidate's expansion word assigns initial weight, and each word is grouped by characteristic weighing score
Interior correlation candidate extension vocabulary is ranked up;The ranking results being grouped according to each word, global weight loss is calculated, according to damage
The Grad dynamic for losing function adjusts the weight of every one-dimensional characteristic, wherein sequence loss is:Wherein
NumSample is that candidate extends the quantity that vocabulary is grouped, loss in word packetiFor the penalty values of each word packet, the penalty values
Obtained by the sorting position for calculating related expanding vocabulary, the more forward corresponding penalty values of sorting position are smaller;Changed by circulation
Dai Shangyi processes, completion is trained until overall loss value is less than a certain threshold value or reaches the iterations specified, by final choice
Characteristic value as training complete order models;
The on-line search part includes:
Query Reconstruction module:Vocabulary marking is extended for the specialized vocabulary extraction in newly inquiring about and candidate;Draw including on-line search
Inquiry extraction module, online candidate extension word retrieval and its feature extraction and scoring modules are held up, wherein, on-line search engine is looked into
Extraction module is ask for the new inquiry to the online submission of user, N1 bar Query Results before retrieval obtains;According to biomedical resource
Specialized vocabulary in preceding N1 bars retrieval result and its various features are extracted, wherein N1 is natural number;Online candidate's extension
Word retrieval and its feature extraction and scoring modules extend vocabulary weighted value score meter using the candidate of vocabulary order models output
Corresponding weight is calculated, and is added it in original query, be expanded inquiry;
Query Result returns to module:For the result document for retrieving to obtain by expanding query, user is returned to.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510147696.5A CN104750819B (en) | 2015-03-31 | 2015-03-31 | The Biomedical literature search method and system of a kind of word-based grading sorting algorithm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510147696.5A CN104750819B (en) | 2015-03-31 | 2015-03-31 | The Biomedical literature search method and system of a kind of word-based grading sorting algorithm |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104750819A CN104750819A (en) | 2015-07-01 |
CN104750819B true CN104750819B (en) | 2018-01-23 |
Family
ID=53590503
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510147696.5A Active CN104750819B (en) | 2015-03-31 | 2015-03-31 | The Biomedical literature search method and system of a kind of word-based grading sorting algorithm |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104750819B (en) |
Families Citing this family (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106095838A (en) * | 2016-06-01 | 2016-11-09 | 比美特医护在线(北京)科技有限公司 | A kind of data processing method and device |
US20180025121A1 (en) * | 2016-07-20 | 2018-01-25 | Baidu Usa Llc | Systems and methods for finer-grained medical entity extraction |
CN106294654B (en) * | 2016-08-04 | 2018-01-19 | 首都师范大学 | A kind of body sort method and system |
CN106919649B (en) * | 2017-01-19 | 2020-06-26 | 北京奇艺世纪科技有限公司 | Entry weight calculation method and device |
CN108509461A (en) * | 2017-02-28 | 2018-09-07 | 华为技术有限公司 | A kind of sequence learning method and server based on intensified learning |
CN110019888A (en) * | 2017-12-01 | 2019-07-16 | 北京搜狗科技发展有限公司 | A kind of searching method and device |
CN108520038B (en) * | 2018-03-31 | 2020-11-10 | 大连理工大学 | Biomedical literature retrieval method based on sequencing learning algorithm |
CN109508392A (en) * | 2018-09-28 | 2019-03-22 | 中国标准化研究院 | A kind of technical literature index announcement search method |
CN109857731A (en) * | 2019-01-11 | 2019-06-07 | 吉林大学 | A kind of peek-a-boo and search method of biomedicine entity relationship |
CN113434767A (en) * | 2021-07-07 | 2021-09-24 | 携程旅游信息技术(上海)有限公司 | UGC text content mining method, system, device and storage medium |
CN113486156A (en) * | 2021-07-30 | 2021-10-08 | 北京鼎普科技股份有限公司 | ES-based associated document retrieval method |
CN113742459B (en) * | 2021-11-05 | 2022-03-04 | 北京世纪好未来教育科技有限公司 | Vocabulary display method and device, electronic equipment and storage medium |
CN115016873B (en) * | 2022-05-05 | 2024-07-12 | 上海乾臻信息科技有限公司 | Front-end data interaction method, system, electronic equipment and readable storage medium |
CN115659047B (en) * | 2022-11-11 | 2023-07-28 | 南京汇宁桀信息科技有限公司 | Medical document retrieval method based on hybrid algorithm |
CN117076658B (en) * | 2023-08-22 | 2024-05-03 | 南京朗拓科技投资有限公司 | Quotation recommendation method, device and terminal based on information entropy |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103942302A (en) * | 2014-04-16 | 2014-07-23 | 苏州大学 | Method for establishment and application of inter-relevance-feedback relational network |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7287025B2 (en) * | 2003-02-12 | 2007-10-23 | Microsoft Corporation | Systems and methods for query expansion |
-
2015
- 2015-03-31 CN CN201510147696.5A patent/CN104750819B/en active Active
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103942302A (en) * | 2014-04-16 | 2014-07-23 | 苏州大学 | Method for establishment and application of inter-relevance-feedback relational network |
Non-Patent Citations (3)
Title |
---|
一种基于位置优化的排序学习方法;林原等;《山东大学学报(工学版)》;20120229;全文 * |
个性化智能搜索引擎中查询扩展技术研究;朱玉皎;《万方数据》;20121225;全文 * |
基于模板抽取和丰富特征的药名词典生成;徐博等;《第五届全国信息检索学术会议论文集》;20091114;全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN104750819A (en) | 2015-07-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104750819B (en) | The Biomedical literature search method and system of a kind of word-based grading sorting algorithm | |
CN104699730B (en) | For identifying the method and system of the relation between candidate answers | |
CN104699763B (en) | The text similarity gauging system of multiple features fusion | |
CN105404632B (en) | System and method for carrying out serialized annotation on biomedical text based on deep neural network | |
CN107133213A (en) | A kind of text snippet extraction method and system based on algorithm | |
CN104331449B (en) | Query statement and determination method, device, terminal and the server of webpage similarity | |
CN106484675A (en) | Fusion distributed semantic and the character relation abstracting method of sentence justice feature | |
CN109344236A (en) | One kind being based on the problem of various features similarity calculating method | |
CN108520038B (en) | Biomedical literature retrieval method based on sequencing learning algorithm | |
CN109948143A (en) | The answer extracting method of community's question answering system | |
CN102662931A (en) | Semantic role labeling method based on synergetic neural network | |
CN110879831A (en) | Chinese medicine sentence word segmentation method based on entity recognition technology | |
CN109492105B (en) | Text emotion classification method based on multi-feature ensemble learning | |
CN109635083A (en) | It is a kind of for search for TED speech in topic formula inquiry document retrieval method | |
CN110298036A (en) | A kind of online medical text symptom identification method based on part of speech increment iterative | |
CN111008215B (en) | Expert recommendation method combining label construction and community relation avoidance | |
CN110851593B (en) | Complex value word vector construction method based on position and semantics | |
CN108090223A (en) | A kind of opening scholar portrait method based on internet information | |
CN113064999B (en) | Knowledge graph construction algorithm, system, equipment and medium based on IT equipment operation and maintenance | |
Pandiaraj et al. | Effective heart disease prediction using hybridmachine learning | |
CN106611016B (en) | A kind of image search method based on decomposable word packet model | |
CN104537280B (en) | Protein interactive relation recognition methods based on text relation similitude | |
CN113360647A (en) | 5G mobile service complaint source-tracing analysis method based on clustering | |
CN107679121B (en) | Mapping method and device of classification system, storage medium and computing equipment | |
CN101533398A (en) | Method for searching pattern matching index |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |