Nothing Special   »   [go: up one dir, main page]

CN106682136A - Traditional-Chinese-medicine medical literature classification and storage method based on data mining - Google Patents

Traditional-Chinese-medicine medical literature classification and storage method based on data mining Download PDF

Info

Publication number
CN106682136A
CN106682136A CN201611174644.8A CN201611174644A CN106682136A CN 106682136 A CN106682136 A CN 106682136A CN 201611174644 A CN201611174644 A CN 201611174644A CN 106682136 A CN106682136 A CN 106682136A
Authority
CN
China
Prior art keywords
data
key word
traditional
traditional chinese
chinese
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201611174644.8A
Other languages
Chinese (zh)
Other versions
CN106682136B (en
Inventor
谭红春
孟庆全
谷宗运
耿英保
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Yiyuan Intelligent Technology Co Ltd
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN201611174644.8A priority Critical patent/CN106682136B/en
Publication of CN106682136A publication Critical patent/CN106682136A/en
Application granted granted Critical
Publication of CN106682136B publication Critical patent/CN106682136B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3334Selection or weighting of terms from queries, including natural language queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention provides a traditional-Chinese-medicine medical literature classification and storage method based on data mining. The classification and storage method include the following steps: acquiring a traditional-Chinese-medicine medical literature basic database; according to a sequential order of loading, saving the literature by a specific coding form; extracting key information from the loaded unstructured text data, and establishing a coding schedule of the key information of the traditional-Chinese-medicine medical literature; dividing all the keywords corresponding to core data into a number of core data groups; at the same time, dividing all the keywords corresponding to non-core data into a number of non-core data groups; acquiring keyword combinations of traditional-Chinese-medicine literature, and according to the keyword combinations which are used as searching keywords, acquiring corresponding traditional-Chinese-medicine literature which is used as the traditional-Chinese-medicine medical literature, saving the traditional-Chinese-medicine medical literature, and using the keyword combinations of the traditional-Chinese-medicine literature as storage identifiers of the traditional-Chinese-medicine medical literature. The method achieves high-efficiency and accurate acquirement of traditional-Chinese-medicine medical literature data, and finishes accurate classification and storage of the traditional-Chinese-medicine medical literature.

Description

A kind of classification of traditional Chinese medical science medical literature and storage method based on data mining
Technical field
The present invention relates to data in literature processing technology field, particularly a kind of traditional Chinese medical science medical literature based on data mining Classification and storage method.
Background technology
Chinese medicine is the important component part shown unique characteristics with advantage in China Today medical and health cause, for abundant generation Boundary's medical treasure-house, protection human health, produce positive effect.Under the new situation, the correlational study in China's Chinese medicine field exists While emphasis inherits classical, the mixing together in the field such as same emphasis and western medicine, pharmacy, informaticss, biology is formed New subject growth point, in academic research field research theme sending out across double subjects or multi-disciplinary scientific paper is then shown as Table.In medical research field, domain experts, scholar are typically based on qualitatively research method, on a large amount of bases for reading document On, the study hotspot of certain field or subject is recognized according to its research experience and accumulation subjectivity, form summary class document or report Accuse, for using for reference.
Prior art is all based at present manual analyses and arranges Chinese medicine document, is also based on without a kind of the scientific of science The document classification mode of big data analysis.
The content of the invention
To solve above-mentioned technical problem, the invention provides a kind of traditional Chinese medical science medical literature based on data mining is classified and is deposited Method for storing, it is comprised the following steps:
Line retrieval is entered by traditional Chinese medical science search key in specific knowledge data base, some retrievals with the traditional Chinese medical science are got The TCM Document of Keywords matching is used as traditional Chinese medical science medical literature basic database;
Respectively according to the sequencing downloaded, a flat file the inside is incorporated into, with the preservation of specific coded format;
Then the non-structured text data to downloading carries out key message extraction, preserves into specific data form, The coding schedule of the key message of traditional Chinese medical science medical literature is set up, the key message corresponds to respectively a binary coding;
The key message includes core data and noncore data two types, the data for extracting, and is stored in first Associated databases, used as the basic data of next step data processing, then leading carries out the mining analysis of next step in people SQL;Institute It is the key word for exceeding given threshold frequency in the text data downloaded to state core data, and the noncore data is the text downloaded Occurred in notebook data at least one times and less than the key word of given threshold;The data pattern of the corresponding key word i of core data A relation table Hi (B1, B2 ..., Bmi) can be expressed as, wherein B is the property value of the key word, the attribute of each key word Value is preset according to the specific field of the key word place traditional Chinese medical science;Key word corresponding data base D1, D2 ..., Dni } Data pattern can correspond on Hi;The data pattern of the wherein corresponding key word j of noncore data can be expressed as one Individual relation table Hj (B1, B2 ..., Bmj), wherein B are the attribute of the key word, key word corresponding data base D1, D2 ..., Dnj } data pattern can correspond on Hj;
The degree of association quantized value of concrete each key word is K (Bmi, Dni)=ρdist(Bmi,Dni)- 1, K (Bmj, Dnj)= ρdist(Bmj,Dnj)- 1, dist (Bmi, Dni) are the Euclidean distance between Bmi, Dni, and dist (Bmj, Dnj) is between Bmj, Dnj Euclidean distance, ρ > 1 be contraction-expansion factor;The corresponding key word i of wherein K (Bmi, Dni), K (Bmj, Dnj) expression core data, The degree of association quantized value of the corresponding key word j of noncore data, the institute by degree of association quantized value in the range of given threshold value is relevant Keyword is divided into the corresponding all key words of core data in some core data groups as a keyword data group;Together When the corresponding all key words of noncore data are divided in some noncore data groups;
Corresponding key word in the core data group and corresponding key word in the noncore data group are carried out with Machine combination of two, gets the combination of TCM Document key word, and crucial for retrieval according to TCM Document crucial phrase cooperation Word gets corresponding TCM Document as traditional Chinese medical science medical literature, and the traditional Chinese medical science medical literature is stored, and will be described in The combination of doctor's document keyword is identified as traditional Chinese medical science medical literature storage.
The invention has the advantages that:
The traditional Chinese medical science medical literature classification based on data mining and storage method that the present invention is provided is based on Chinese medicine data base Keyword retrieval, and data extraction is carried out by the document to retrieving, the key message to getting carries out data mining point Analysis, obtains keyword data group, and the key word in keyword data group is combined into line retrieval and obtains the related traditional Chinese medical science Medical literature, and be traditional Chinese medical science medical literature storage mark by TCM Document crucial phrase cooperation.Present invention achieves Efficiently with accurately acquisition Chinese medicine data in literature, the precise classification storage of traditional Chinese medical science medical literature is completed.
Certainly, the arbitrary product for implementing the present invention it is not absolutely required to while reaching all the above advantage.
Specific embodiment
The technical scheme in the present invention is clearly and completely described below in conjunction with the embodiment of the present invention, it is clear that institute The embodiment of description is only a part of embodiment of the invention, rather than the embodiment of whole.Based on the embodiment in the present invention, All other embodiment that those of ordinary skill in the art are obtained under the premise of creative work is not made, belongs to this The scope of bright protection.
A kind of traditional Chinese medical science medical literature classification based on data mining and storage method are embodiments provided, it includes Following steps:
Line retrieval is entered by traditional Chinese medical science search key in specific knowledge data base, some retrievals with the traditional Chinese medical science are got The TCM Document of Keywords matching is used as traditional Chinese medical science medical literature basic database;
Respectively according to the sequencing downloaded, a flat file the inside is incorporated into, with the preservation of specific coded format;
Then the non-structured text data to downloading carries out key message extraction, preserves into specific data form, The coding schedule of the key message of traditional Chinese medical science medical literature is set up, the key message corresponds to respectively a binary coding;
The key message includes core data and noncore data two types, the data for extracting, and is stored in first Associated databases, used as the basic data of next step data processing, then leading carries out the mining analysis of next step in people SQL;Institute It is the key word for exceeding given threshold frequency in the text data downloaded to state core data, and the noncore data is the text downloaded Occurred in notebook data at least one times and less than the key word of given threshold;The data pattern of the corresponding key word i of core data A relation table Hi (B1, B2 ..., Bmi) can be expressed as, wherein B is the property value of the key word, the attribute of each key word Value is preset according to the specific field of the key word place traditional Chinese medical science;Key word corresponding data base D1, D2 ..., Dni } Data pattern can correspond on Hi;The data pattern of the wherein corresponding key word j of noncore data can be expressed as one Individual relation table Hj (B1, B2 ..., Bmj), wherein B are the attribute of the key word, key word corresponding data base D1, D2 ..., Dnj } data pattern can correspond on Hj;
The degree of association quantized value of concrete each key word is K (Bmi, Dni)=ρdist(Bmi,Dni)- 1, K (Bmj, Dnj)= ρdist(Bmj,Dnj)- 1, dist (Bmi, Dni) are the Euclidean distance between Bmi, Dni, and dist (Bmj, Dnj) is between Bmj, Dnj Euclidean distance, ρ > 1 be contraction-expansion factor;The corresponding key word i of wherein K (Bmi, Dni), K (Bmj, Dnj) expression core data, The degree of association quantized value of the corresponding key word j of noncore data, the institute by degree of association quantized value in the range of given threshold value is relevant Keyword is divided into the corresponding all key words of core data in some core data groups as a keyword data group;Together When the corresponding all key words of noncore data are divided in some noncore data groups;
Corresponding key word in the core data group and corresponding key word in the noncore data group are carried out with Machine combination of two, gets the combination of TCM Document key word, and crucial for retrieval according to TCM Document crucial phrase cooperation Word gets corresponding TCM Document as traditional Chinese medical science medical literature, and the traditional Chinese medical science medical literature is stored, and will be described in The combination of doctor's document keyword is identified as traditional Chinese medical science medical literature storage.
The traditional Chinese medical science medical literature acquisition methods that the present invention is provided are based on Chinese medicine keyword search over database, and by inspection Rope to document carry out data extraction, the key message to getting carries out data mining analysis, obtains keyword data group, and Key word in keyword data group is combined into line retrieval and obtains related traditional Chinese medical science medical literature, and the traditional Chinese medical science is literary Crucial phrase cooperation is offered for traditional Chinese medical science medical literature storage mark.Present invention achieves efficiently literary with accurately acquisition Chinese medicine Data are offered, the precise classification storage of traditional Chinese medical science medical literature is completed.
Present invention disclosed above preferred embodiment is only intended to help and illustrates the present invention.Preferred embodiment is not detailed All of details is described, it is only described specific embodiment also not limit the invention.Obviously, according to the content of this specification, Can make many modifications and variations.These embodiments are chosen and specifically described to this specification, is to preferably explain the present invention Principle and practical application so that skilled artisan can be best understood by and utilize the present invention.The present invention is only Limited by claims and its four corner and equivalent.

Claims (1)

1. a kind of classification of traditional Chinese medical science medical literature and storage method based on data mining, it is characterised in that comprise the following steps:
Line retrieval is entered by traditional Chinese medical science search key in specific knowledge data base, is got some crucial with traditional Chinese medical science retrieval The TCM Document of word matching is used as traditional Chinese medical science medical literature basic database;
Respectively according to the sequencing downloaded, a flat file the inside is incorporated into, with the preservation of specific coded format;
Then the non-structured text data to downloading carries out key message extraction, preserves into specific data form, sets up The coding schedule of the key message of traditional Chinese medical science medical literature, the key message corresponds to respectively a binary coding;
The key message includes core data and noncore data two types, the data for extracting, and is stored in first corresponding Data base, used as the basic data of next step data processing, then leading carries out the mining analysis of next step in people SQL;The core According to the key word for exceeding given threshold frequency in the text data for download, the noncore data is the textual data downloaded to calculation Occurred at least one times and less than the key word of given threshold according to middle;The data pattern of the corresponding key word i of core data can be with It is expressed as a relation table Hi (B1, B2 ..., Bmi), wherein B is the property value of the key word, the property value root of each key word Preset according to the specific field of the key word place traditional Chinese medical science;The number of the corresponding data base of key word { D1, D2 ..., Dni } Can correspond on Hi according to pattern;The data pattern of the wherein corresponding key word j of noncore data can be expressed as a pass It is table Hj (B1, B2 ..., Bmj), wherein B is the attribute of the key word, key word corresponding data base D1, D2 ..., Dnj } Data pattern can correspond on Hj;
The degree of association quantized value of concrete each key word is K (Bmi, Dni)=ρdist(Bmi,Dni)- 1, K (Bmj, Dnj)= ρdist(Bmj,Dnj)- 1, dist (Bmi, Dni) are the Euclidean distance between Bmi, Dni, and dist (Bmj, Dnj) is between Bmj, Dnj Euclidean distance, ρ > 1 be contraction-expansion factor;The corresponding key word i of wherein K (Bmi, Dni), K (Bmj, Dnj) expression core data, The degree of association quantized value of the corresponding key word j of noncore data, the institute by degree of association quantized value in the range of given threshold value is relevant Keyword is divided into the corresponding all key words of core data in some core data groups as a keyword data group;Together When the corresponding all key words of noncore data are divided in some noncore data groups;
Random two are carried out with corresponding key word in the noncore data group to corresponding key word in the core data group Two combinations, get the combination of TCM Document key word, and are obtained for search key according to the TCM Document crucial phrase cooperation Corresponding TCM Document is got as traditional Chinese medical science medical literature, the traditional Chinese medical science medical literature is stored, and the traditional Chinese medical science is literary Crucial phrase cooperation is offered for traditional Chinese medical science medical literature storage mark.
CN201611174644.8A 2016-12-19 2016-12-19 A kind of classification of traditional Chinese medical science medical literature and storage method based on data mining Expired - Fee Related CN106682136B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611174644.8A CN106682136B (en) 2016-12-19 2016-12-19 A kind of classification of traditional Chinese medical science medical literature and storage method based on data mining

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611174644.8A CN106682136B (en) 2016-12-19 2016-12-19 A kind of classification of traditional Chinese medical science medical literature and storage method based on data mining

Publications (2)

Publication Number Publication Date
CN106682136A true CN106682136A (en) 2017-05-17
CN106682136B CN106682136B (en) 2018-03-16

Family

ID=58869635

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611174644.8A Expired - Fee Related CN106682136B (en) 2016-12-19 2016-12-19 A kind of classification of traditional Chinese medical science medical literature and storage method based on data mining

Country Status (1)

Country Link
CN (1) CN106682136B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111314030A (en) * 2020-03-11 2020-06-19 重庆邮电大学 SCMA (sparse code multiple access) multi-user detection method based on spherical decoding optimization
WO2022160539A1 (en) * 2021-01-26 2022-08-04 浪达网络科技(浙江)有限公司 Data processing system and data mining method
CN116431838A (en) * 2023-06-15 2023-07-14 北京墨丘科技有限公司 Document retrieval method, device, system and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030236785A1 (en) * 2002-06-21 2003-12-25 Takahiko Shintani Method of extracting item patterns across a plurality of databases, a network system and a processing apparatus
US20080063264A1 (en) * 2006-09-08 2008-03-13 Porikli Fatih M Method for classifying data using an analytic manifold
CN101149751A (en) * 2007-10-29 2008-03-26 浙江大学 Generalized relating rule digging method for analyzing traditional Chinese medicine recipe drug matching rule
CN101599088A (en) * 2008-11-18 2009-12-09 北京美智医疗科技有限公司 The mining multi-dimensional data system and method for medical information system
CN104978347A (en) * 2014-04-11 2015-10-14 中国中医科学院中医临床基础医学研究所 Data mining method and data mining system for sensitive keywords in Chinese biomedical literature database

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030236785A1 (en) * 2002-06-21 2003-12-25 Takahiko Shintani Method of extracting item patterns across a plurality of databases, a network system and a processing apparatus
US20080063264A1 (en) * 2006-09-08 2008-03-13 Porikli Fatih M Method for classifying data using an analytic manifold
CN101149751A (en) * 2007-10-29 2008-03-26 浙江大学 Generalized relating rule digging method for analyzing traditional Chinese medicine recipe drug matching rule
CN101599088A (en) * 2008-11-18 2009-12-09 北京美智医疗科技有限公司 The mining multi-dimensional data system and method for medical information system
CN104978347A (en) * 2014-04-11 2015-10-14 中国中医科学院中医临床基础医学研究所 Data mining method and data mining system for sensitive keywords in Chinese biomedical literature database

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
古求知 等: "中医医案类文献的分析挖掘研究", 《辽宁中医杂志》 *
金力 等: "数据挖掘在中医诊疗规则提取中的应用研究", 《时珍国医国药》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111314030A (en) * 2020-03-11 2020-06-19 重庆邮电大学 SCMA (sparse code multiple access) multi-user detection method based on spherical decoding optimization
WO2022160539A1 (en) * 2021-01-26 2022-08-04 浪达网络科技(浙江)有限公司 Data processing system and data mining method
CN116431838A (en) * 2023-06-15 2023-07-14 北京墨丘科技有限公司 Document retrieval method, device, system and storage medium
CN116431838B (en) * 2023-06-15 2024-01-30 北京墨丘科技有限公司 Document retrieval method, device, system and storage medium

Also Published As

Publication number Publication date
CN106682136B (en) 2018-03-16

Similar Documents

Publication Publication Date Title
CN105095195B (en) Nan-machine interrogation's method and system of knowledge based collection of illustrative plates
CN103123618B (en) Text similarity acquisition methods and device
CN107784102A (en) A kind of data difference comparative approach based on oracle database
CN102789464B (en) Natural language processing methods, devices and systems based on semantics identity
CN104008106A (en) Method and apparatus for obtaining hot topic
RU2010107150A (en) IDENTIFICATION OF SEMANTIC RELATIONS IN INDIRECT SPEECH
CN106682209A (en) Cross-language scientific and technical literature retrieval method and cross-language scientific and technical literature retrieval system
CN106682136A (en) Traditional-Chinese-medicine medical literature classification and storage method based on data mining
WO2020155749A1 (en) Method and apparatus for constructing personal knowledge graph, computer device, and storage medium
CN106227788A (en) Database query method based on Lucene
CN107291858A (en) Data indexing method based on character string suffix
WO2019228015A1 (en) Index creating method and apparatus based on nosql database of mobile terminal
CN104252542A (en) Dynamic-planning Chinese words segmentation method based on lexicons
CN103838876A (en) Method for retrieving document through pinyin and document retrieval system
CN111061828A (en) Digital library knowledge retrieval method and device
CN106802937A (en) The conversion method and system of Word document
CN103150409B (en) Method and system for recommending user search word
CN106777137B (en) A kind of traditional Chinese medicine document analysis method
CN105630822A (en) Method for marking similar contents in patent retrieval in red color
CN103927339A (en) System and method for reorganizing knowledge
CN104850559A (en) Slide independent storage, retrieval and recombination method and equipment based on presentation document
CN106295252A (en) Search method for gene prod
CN107870935A (en) A kind of searching method and device
Yuan et al. A mathematical information retrieval system based on RankBoost
CN103853832B (en) Customizable data grasping means in a kind of text retrieval system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information
CB03 Change of inventor or designer information

Inventor after: Liu Kai

Inventor after: Wu Zhili

Inventor before: Tan Hongchun

Inventor before: Meng Qingquan

Inventor before: Gu Zongyun

Inventor before: Geng Yingbao

TA01 Transfer of patent application right

Effective date of registration: 20180208

Address after: 518000 Nanshan District, Guangdong Tsinghua Science and technology information research building,, Shenzhen

Applicant after: Liu Kai

Address before: 230000 Mei Shan Road, Shushan District, Hefei, Anhui Province, No. 70

Applicant before: Tan Hongchun

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20180913

Address after: 518000 Beek, science and technology building, room 9, No. 9, research road, high tech Zone, Nanshan District, Shenzhen, Guangdong.

Patentee after: Shenzhen Yiyuan Intelligent Technology Co., Ltd.

Address before: 518000 scientific research building, Tsinghua information port, Nanshan District, Shenzhen, Guangdong 107

Patentee before: Liu Kai

CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20180316

Termination date: 20181219