Implementing the NLP Infrastructure for Greek Biomedical Data Mining

iatrolexi.gr

Implementing the NLP infrastructure for Greek Biomedical Data Mining Aristides Vagelatos RACTI 10 Davaki str. GR-11526 Athens, Greece vagelat@cti.gr Elena Mantzari Giorgos Orphanos Christos Tsalidis Chryssoula Kalamara Christos Diolis Neurosoft S.A. Neurosoft S.A. Neurosoft S.A. Athens’ Euroclinic RACTI 32 Kifisias Ave. 32 Kifisias Ave. 32 Kifisias Ave. 9 Athanasiadou str. 10 Davaki str. GR-15125 Athens GR-15125 Athens GR-15125 Athens GR-15121 Athens GR-11526 Athens, Greece Greece Greece Greece Greece emantz@tee.gr orphan@neurosoft.gr tsalidis@neurosoft.gr kalamach@otenet.gr diolis@cti.gr Abstract This paper presents the design and implementation of terminological and specialized textual resources that are produced in the framework of the Greek national R&D project “IATROLEXI”. The aim of the project is to create the critical infrastructure for the Greek language, i.e. linguistic resources and tools, to be used in high level Natural Language Processing (NLP) applications in the domain of Biomedicine. The project builds upon existing resources that have been developed by the project partners, i.e. a Greek morphological lexicon of about 100.000 words, and language processing tools such as a lemmatiser and a morphosyntactic tagger, and it will further develop new resources such as a specialised corpus of biomedical texts and an ontology of medical terminology. Keywords Ontologies, data mining, biomedical terminology. 1. Introduction The amount of biomedical information which is contemporarily produced by the medical society, i.e. health institutions, educational organisms and research institutes, has been enormously increased. This information which is mainly available in digital form and mostly accessible through Internet has been characterized by Eysenbach [4] as “information jungle” of narrative form, due to its enormous size and its unstructured form. However, information is only valuable to the extent that it is accessible, easily retrieved and relevant to the users' interests. The growing volume of data, the lack of structured information, and the information diversity have made information and knowledge management a real challenge towards the effort to support the medical society. It has been realised that added value is not gained merely through larger quantities of data, but through structuring of the data into knowledge for more sophisticated access to the required information. In order to access information, medical practitioners, researchers, patients, or other interesting parts in the medical market are usually provided with unsophisticated tools, such as simple search engines which are seriously limited by their reliance on keyword-matching. These search mechanisms are unable to find information described by different terms and they often return information results that use the same words with a different meaning, while they are unable to combine information from diverse sources. These problems can be alleviated if search engines no longer search for matching keywords but for matching semantic concepts that underlie the information in web pages. The lack of high level language tools to facilitate accuracy and precision in accessing and retrieving the relevant information is harder in a less-used language like Greek, due to the limited research funding and the restricted interest by the medical industry, and also due to the intrinsic particularities of the Greek language morphology. The project IATROLEXI1 (http://www.iatrolexi.gr) aims at the creation of the critical infrastructure for the Greek language which will constitute the groundwork for advanced NLP applications in the domain of biomedicine: i.e. text indexing, information extraction and retrieval, data mining, question answering systems, etc. To accomplish this, a number of essential tools and resources for the Greek language are under construction, which will allow better management and processing of the digitally encoded information in the biomedical field. More specifically, the expected output of the project are tools that will address directly the final user of the biomedical information, such as a spelling checker of Greek medical terms as well as a specialized search engine, and also tools that will mainly assist processing of the Greek biomedical texts and improve search and retrieval of biomedical data, such as a tagger for morphosyntactic annotation appropriately tuned to the particularities of the biomedical sublanguage and an ontology of the Greek biomedical terminology. This paper is structured as follows: Section 2 presents some background information on Natural Language Processing in the biomedical domain towards data mining. 1 "IATROLEXI" project is being partially funded by General Secretariat of Research & Development (project code: 9) within Measure 3.3 of "Information Society" Operational Program. Section 3 gives a description of IATROLEXI project’s goals and presents the environment and their main components. Finally in section 4 the conclusions are given. 2. Background Natural Language Processing (NLP) has been applied to biomedical text for decades, in fact, soon after computerized clinical record systems were introduced in the mid 1960s [2]. The computerization of clinical records increased the tension in the field of medical reporting and recording. In [12] a broad overview of NLP in medicine can be found, with special attention to milestone projects and systems such as the Linguistic String Project, Specialist, Recit, MedLEE, and Menelas. The overview of [6] also concentrates on NLP with clinical narrative, giving a short summary of earlier projects and the state of the art at that point in time. In recent years, research has continued to focus on text indexing and document coding to allow powerful, meaningful retrieval of documents. Document indexing uses terms from a glossary or ontology (MeSH, Gene Ontology, Galen4) or text features such as words or phrases. Most NLP systems in clinical medicine work with text from patient records such as discharge summaries and diagnosis reports. NLP systems in bioinformatics use mostly articles or abstracts from the scientific medical literature. Differences between these two types of text affect the choice of techniques for NLP. Biomedical literature is carefully constructed and meticulously proofread, so spelling errors and incomplete parses are less of a problem. On the other hand, new concepts may be introduced, such as a newly unraveled molecule. concepts, the task of adding entries to the lexicon is considerable demanding [11]. The National Library of Medicine has undertaken a large-scale effort to facilitate access to biomedical information. The development of the UMLS (http://umlsinfo.nlm.nih.gov/) and the release of the SPECIALIST lexicon will substantially benefit NLP systems. A UMLS concept is given a unique identifier, and all synonymous concepts have the same identifier. This feature provides a substantial body of knowledge that NLP systems need: link words in text to a controlled vocabulary (the UMLS or to one of the other source vocabularies). The UMLS also has a semantic network and assigns semantic categories to all concepts. For example, “fever” is assigned the category SIGN/SYMPTOM. The categorization provides the semantic knowledge needed by NLP systems to identify relevant units of information. The SPECIALIST Lexicon, which has over 250,000 entries, assigns syntactic categories to words and phrases in biomedical text. The lexicon is not only useful for NLP extraction tasks, but also for indexing and vocabulary development. Other nomenclatures are also important knowledge sources. Some work has been published investigating the use of SNOMED (http://www.snomed.com/) and ICD10 (http://www.who.int/classifications/icd/en/) as knowledge sources for lexical work. Like the UMLS, these nomenclatures are also effective for identifying relevant clinical terms and semantic categorization. Both SNOMED and ICD10 are particularly useful to groups involved in multilingual work because they are available in other languages and because the codes provide a way to link a concept to a similar concept in other languages. Ontologies are considered to be a fundamental prerequisite for advanced language processing, knowledge management and the Semantic Web, since they offer the mechanisms for the formal representation and the description of the concepts in a given domain[1], [13]. Typically, an ontology identifies classes of objects that are important in a domain and organises these classes in a hierarchy. Each class is characterised by some properties and is related to other classes or to elements of other classes through a number of significant relations. The predominance of ontologies as knowledge sources in information processing lies on their power to represent knowledge in a model that is comprehensible equally by either humans or machines, thus assisting communication between human agents, achieving interoperability among computer systems, and advancing the systems' quality performance on indexing, processing, retrieval and extraction of required information. Other types of knowledge sources needed by NLP systems, such as grammars, and domain models, are not available to NLP researchers. These are usually developed by each individual research group, and are more complex and interrelated than nomenclatures. They are also typically very difficult to adapt to different systems. A significant amount of work in developing an NLP system concerns extending lexical knowledge. Since there is a very large number of words and phrases associated with clinical To the best of our knowledge there are no Greek electronic medical corpora exist, structurally or linguistically annotated. Thus within the projects' framework, a medical 3. Project’s main goals: Resources and tools In order to apply data mining techniques Greek biomedical texts, it is inevitable that a number of text analysis tools and linguistic resources need to be developed. These tools constitute the basis for any application regarding data mining, NLP, indexing, etc. In this chapter the main goals of the project are discussed as well as the environment and its constituent parts are presented. 3.1 Corpus of biomedical texts corpus is under construction, mainly from the literature that is already published on the web. Balance and representativeness are the main requirements for corpus design. According to these requirements, the scope was to develop a Greek corpus of written texts, coming from all different domains of biomedicine. The corpus should contain documents from as many biomedical text fields as possible. Recent research makes clear that full-text articles are preferable from abstracts, if we want to build high-recall text mining systems [3]. Therefore, it seems clear that a corpus that is to be used for biomedical text mining systems should include full text and not samples, which we seriously took under consideration in the development of the IATROLEXI corpus. Corpus annotation is the distillation procedure adding (or extracting the) value to the texts. The annotation process of the IATROLEXI corpus involves almost all NLP components adopted, constructed or under construction in the framework of IATROLEXI: a tokeniser, a sentence splitter, a morphosyntactic tagger, a biomedical gazetteer, a multi-word term recogniser, and an ontology-based semantic tagger. Due to time limitations we considered only documents from Internet sites, thus we recorded portals or other websites that included directories of health-related information. We started our investigation from websites of research and academic institutions, e.g.: MedNet Hellas – http://www.mednet.gr (a Greek Medical Network), Greek National http://www.ekt.gr, Documentation Library of University http://www.lib.uom.gr of Center – Macedonia – The above sites proved to be very helpful, since they contained a rather exhaustive list of directories of Greek biomedical journals. Next, we utilised popular search engines in order to identify additional websites that might contain interesting texts, e.g.: Google – http://www.google.com Yahoo – http://www.yahoo.gr Live Search – http://search.live.com Through these search engines, we mainly acquired the web addresses of Greek medical conferences that were not listed in the directories mentioned above. Overall, forty websites were identified to contain appropriate medical documents for IATROLEXI. So far, the total number of documents is touching 6,250 (about 11.5 million words). 3.2 Creation, enhancement and/or adaptation of existing resources and tools A number of resources have been created, enhanced and/or adapted in order to constitute an environment supporting a) the discovery of syntactic patterns that can be candidate multiword terms, b) the construction of the ontology, c) the detection of medicine terms in the documents of the corpus d) semantic indexing of the documents. The core mechanism for the most of the software components working on the documents of the corpus is annotation. The software implementation platform of all NLP components is Java v 1.5. The operational environment integrating and orchestrating the software components working with annotations is the Apache UIMA platform. UIMA stands for Unstructured Information Management Architecture; it was developed by teams from IBM Research and IBM Software Group and is now released to the open-source community as an Apache project. The main components constructed or are under construction, participating in the analysis, annotation and indexing of the documents, along with the resources they use, are presented in the following sections. 3.2.1 Document conversion The documents collected from the internet are either in html or in pdf format. On the other side all the tools process documents in a common format which is pure text decorated with annotations. The UIMA terminology for this common format is CAS (Common Annotation Structure). To satisfy the requirement of feeding the annotation process with documents of a common format, we decided this format to be plain text, for the reason that only the textual content of the documents is of interest; scripting, styling, formatting and page rendering information had to be filtered out. Therefore, we developed two document converters: an html-to-txt converter and a pdf-to-txt converter. The html-to-txt converter incorporates the functionality of the CyberNeco HTML Parser along with the xpath facilities provided by Apache Xalan. To convert an html document to plain text, it is first parsed by the HTML parser and an HTML DOM (Document Object Model) is constructed into memory; noisy elements, such as <style>, <script> and <applet>, are filtered out during parsing. Then, the textual content is selected from the DOM with the help of xpath queries. The pdf-to-txt converter is based on the PDFBox library. The main problems we faced during pdf-to-txt conversion were: a) the incorrect interpretation of Greek characters, especially for pdf documents produced on Mac systems and b) the injection of newline (‘\n’) characters in unwanted positions, even in the middle of words. The output of document conversion is one CAS per input document, which contains the plain text extracted from the document along with global annotations 3.2.2 Tokenisation and sentence splitting Content analysis starts with tokenization, i.e. conversion of the character stream to a token stream. Tokenisation is carried out in two steps. In the first step, a text stream is roughly converted into a token stream based on white space delimiters and some symbol characters. At the same time, the orthography of each token is recorded. By “token orthography” we mean the classes of the constituent characters, e.g. is a Greek-letter-lower-case token, Disease is an English-letter-first-capital token, H.I.V. is an English-letter-all-capital + middle-dots + ending-dot token. In the second step, the token stream passes through a refinement module. Tokens of a specific orthography may further split into two or three tokens. For example, a token that ends with a comma or question mark or exclamation mark or colon or semi-colon will split into two tokens; a token that starts with a quote and ends with a quote will split into three tokens. Special care is taken for tokens that end with a dot, so as to decide whether this dot is part of the token (e.g. the token is an abbreviation) or the dot is a punctuation mark (i.e. a full stop). Among the various tests performed towards the disambiguation of the ending dot, the one worth-mentioning (because it covers the ninety percent of the cases) refers to tokens where all the characters before the dot are Greek letters. If these letters are more than two and constitute a valid Greek word, then the token splits into two tokens: a Greek-word token and a full-stop token. The validity of a Greek word is examined through lookup in Neurosoft’s Morphological Lexicon, a broad-coverage lexicon of Modern Greek (~90.000 words, ~1.200.000 word-forms). Sentence splitting examines the token stream produced from the second step of tokenization and locates tokens that traditionally play the role of sentence delimiters, i.e. full stops, question marks, exclamation marks and dot-ending tokens. It then examines the local context of the candidate sentence delimiters and sets the sentence boundaries on tokens that are proved to be real sentence delimiters. for verbs. The first word-form of a morphological lemma, the headword, plays the role of lemma representative; referring to the headword is the same as referring to the lemma. As the morphological lexicon is monolingual, morphosyntactic annotations are assigned only to Greek words. Each Greek-letter token identified during tokenization is assumed to be a Greek word-form. Every word-form is looked-up in the morphological lexicon. The possible outcomes are three: a) the word-form is found in one morphological lemma, b) the word-form is found in two or more morphological lemmas and c) the word-form is not found. Since the goal of morphosyntactic analysis is to assign unambiguous morphosyntactic annotations to wordforms, outcomes (b) and (c) are problematic; outcome (b) introduces ambiguity while outcome (c) introduces failure. If the morphological lemmas of outcome (b) have different part-of-speech values (which is the most frequent), the selection of the appropriate lemma can be interpreted as the selection of the appropriate part-of-speech value. Also, to overpass the failure of outcome (c), the only way is to guess the values of as many morphosyntactic attributes as possible – at least the part-of-speech. Part-of-speech disambiguation and guessing is carried out with the help of decision trees through examination of the local context (see [10]), achieving an accuracy of ninety-seven percent in part-ofspeech disambiguation and eighty-nine percent in part-ofspeech guessing. 3.2.4 Biomedical word identification The next step was to mark words that belong to the biomedical domain. This marking was crucial for the next processing steps. Every single biomedical word may be a biomedical term by itself (which can be certified through look-up in a biomedical dictionary or ontology) or may be part of a multi-word biomedical term. Biomedical words are identified with the help of a gazetteer that currently contains ~52,000 biomedical word-forms (that correspond to ~9,000 biomedical words). The contents of the gazetteer partly come from the Morphological Lexicon and partly were collected through a process described in section 3.3. 3.2.3 Morphosyntactic tagging 3.2.5 Multi-word term recognition Morphosyntactic tagging is based on the Morphological Lexicon. The contents of the lexicon are organised into morphological lemmas. Each lemma contains all the wordforms of a Greek word accompanied by the values of their morphosyntactic attributes. The basic morphosyntactic attribute of a word-form is its part-of-speech. The value of part-of-speech determines what other morphosyntactic attributes characterise a word-form: gender, number and case for nouns, adjectives, articles, pronouns and present perfect participles; voice, tense, mood, number and person The multiword recognition mechanism is one of the advanced outcomes of the project. It is based on a rule description system where every rule recognizes a syntactic pattern in the input text. Rules can be applied in a consecutive and aggregative manner. Consecutive means that rules are applied in the same sequence of annotated text spans repeatedly i.e. as far as we can apply rules and the size of the text span’s sequence is decreased, the processing continues. Aggregative means that a set of rules can be applied after another set of rules. The format of the rules resembles the context free BNF rules where every symbol is presented as a set of feature value pairs. The grammar is strongly typed in the sense that every feature has a type which specifies the values of its instances in the rules. The syntax of the rules is depicted in the following sample grammar consisting of two rules: options: grammar = "Article"; maxdepth = "8"; types: ATTRS is set of external "com.neurolingo.NLP3.morphology.IMorphology"; features: MORPHO is object; ONTO is object; functions: Contains in module Morpho of file internal is object of (ATTRS) as object; GNC_Agreement in module Agreement of file internal is predicate of (number, ATTRS) as object; GNC_Reduction in module Reduction of file internal is rule of (number, ATTRS, ATTRS, text) as object; //GNC_Reduction is called in order to create the // reduced predicate. The arguments are: // pivot: number is the pivot predicate // (in our case the second). // select_attrs: The attributes of the pivot // element (in case that the pivot // predicate has more than one alts) // common_attrs: The result attributes (or the // common one) The Gender, Case & Number // attributes are taken from pivot // predicate. The remaining attributes // are these attributes // lemma_frmt: Is an format string describing // how the headword (lemma) of the // multiword text span will be computed rules: /* A_R1 */ [MORPHO=GNC_Reduction(2,[N],[N],"%2")] => \ [MORPHO=Contains([ART]), name “maxdepth” specifies the number of levels that operators like * (Kleene star) and + will be expanded. types presents new derived types that features can use. Our formalism uses the primitive types number and text and the derived types of set and value. In our sample we can see another important characteristic of the formalism, the ability to communicate with the implementation Java environment. The members of set ATTRS is defined in the interface class identified with the full path name "com.neurolingo.NLP3. morphology.IMorphology". This way we can use the morphological attributes of our lexical resources in grammar rules and in the software components we develop without the need to have duplicate definitions. features defines the names and types of the features we are going to use in the grammar rules. All features that are going to define grammar symbols in the following rules must be defined in this section. There are no untyped features, as we already mentioned, and the system accomplishes a strong type checking of how values and types are used in the rules. In our sample we define two features with names MORPHO and ONTO which both are of type object. This is another extension characteristic of our formalism permitting incomplete or generic types that are defined in the Java environment. The way these types are instantiated and used in the rules will be shown in the following paragraphs. functions section defines functions that can appear in expressions specifying the values of features in the rules. There are four types of functions 1. Object functions can appear in the body symbols (predicates) of a rule. There are object (instance) methods (in Java parlance) that can take a list of parameters and return a value assigned in a feature. Function Contains in the sample grammar takes as input parameter a set of attributes and its return type is the superclass type object. The module Morpho must be known to the environment executing these rules. This module contains the definition of the object’s actual type where this function is encapsulated. The system can accept external modules placed in jar files and loaded dynamically where needed permitting the extension or incorporation of the rules component with external systems. 2. Predicate functions are static methods of a Java class. They can be presented only in the body predicates. Except from the defined parameter list, these functions enriched with an extra parameter. This parameter is the table of all feature value pairs assigned to the predicate they appear in. The [MORPHO=Contains([N]), / ; /* A_R2 */ [MORPHO=GNC_Reduction(2,[ADJ], [ADJ],"%2")] => \ [MORPHO=Contains([ART]), [MORPHO=Contains([ADJ]), / ; Figure 1 Sample grammar The parts presented in a rules file are: options affects the way the rules will be processed and used. In the sample grammar of Figure 1 we set two options. The name of the grammar rules “Article” which specifies the name of the annotator that will apply these rules to text spans. The other option with function GNC_Agreement checks the agreement of Gender, Number and Case of the neighbor symbols found in input. The first parameter appearing in its definition is a number denoting the way this agreement must be checked. We can specify if we want full agreement in Gender, Number and Case or partial agreement in Gender and Number, in Number and Case, only in Case, etc. The second parameter specifies a set of attributes that the symbol must possess as an extra matching condition. 3. 4. Rule functions are used in the head symbol (predicate) of a rule and are mapped to static methods of a class. They take as extra input parameter, a representation of the reduction i.e. the predicates recognized accompanied with the values of the features they contain. Function GNC_Reduction is used in order to compute the morphological attributes and the headword of the multiword reduced text span of the rule. The interpretation of the parameters, appearing in line comments, follows the definition. Feature functions are the fourth type of functions. They appear in the head predicates and mapped also to static functions of a class. They take as input parameters a list of feature names. These feature names must appear in the body predicates and when called by the system all values of these body features have been evaluated. rules section contains the actual grammar rules. Every rule contains a head predicate and one or more body predicates. Head is defined in terms of the body predicates and this means that if a sequence of symbols (text spans) matches the body predicates then we can reduce these predicates to the one of the body. Rules are independent of each other. Their order does not matter the way they are evaluated. The system can use different heuristics about which rule to choose for reduction in case that multiple rules match an input sequence of symbols. The current applied technique chooses the longest (in terms of size of predicates in the body of a rule) rule. The symbols ‘\’ and ‘/’ specify the left and right context of a reduction. We can have a list of predicates at the left of the ‘\’ symbol denoting the left context of the reduction. The meaning of the left context is that we expect to match all the predicates presented in the left context but we will not use them in the reduction. The same holds for the right context. Only the predicates presented between the ‘\’ and ‘/’ symbols will be reduced. Parentheses can also be used to group sequence of predicates. A body predicate or group can be right followed by a repeating operator of the ‘*’, ‘+’, {m,n}. The meaning of ‘*’ is zero or more instances of the predicate or group existing in the left of the operator must be matched. The ‘+’ operator is interpreted as one or more instances while the expression {m,n} means that we expect to match at least m and an most n instances. We constructed a parser based on ANTLR. The parser takes as input a unification grammar (written according to the already specified formalism) and produces a compiled representation of the rules. The actual application of the rules is performed by an execution engine, which loads the compiled rules at start-up (i.e. the parser is the execution engine plus the parsing model). The execution engine incorporates a prototype unification algorithm for the efficient handling of multi-valued features, which facilitates the treatment of the inherent morphosyntactic ambiguity (for more on unification, see [8]). 3.2.6 Ontology-based semantic tagging According to Kiryakov et al. [7], there are a number of basic prerequisites for the representation of semantic annotations: an ontology (or taxonomy, at the least), defining the entity classes; entity identifiers, which allow those to be distinguished and linked to their semantic descriptions; a knowledge base with entity descriptions. As the aim of IATROLEXI is to build a generic and application independent infrastructure for the language processing of the Greek biomedical data, the project team opted for the adoption of the UMLS knowledge resources, namely UMLS Metathesaurus (MT) and UMLS Semantic Network (SN). Adopting UMLS semantic network as an initial top-level ontology, and mapping it into Greek, we gain access to the conceptual information for some thousands of biomedical terms. Up to now, the whole number of the SN semantic types and semantic relations have been translated into Greek, while both English and Greek versions of the SN have been fed into Protégé for further processing and evaluation. By semantic tagging in the context of IATROLEXI we mean providing automatic annotations with references to the semantic types of the Greek version of the UMLS Semantic Network. 3.3 A methodology for the development of a biomedical ontology The methodology will combine bottom-up and top-down approaches for the determination of the semantic/conceptual framework to be used for the knowledge representation of the biomedical domain (i.e. selection of a conceptual hierarchy, semantic classes, relations between concepts, etc.) and the selection of the relevant biomedical terms that designate and instantiate the concepts of those hierarchy nodes. The UMLS semantic network will be used as a frame basis for expressing the IATROLEXI's ontology. The construction and the gradual enrichment of the ontology will be accomplished through the following steps: 1. determination of an initial up-level taxonomy which will be gradually enriched with lower level information on concepts and terms, 2. collection of specialized texts in the biomedical domain, 3. semi-automatic excerption of the texts' terminology, 4. determination of the morpho-syntactic rules that describe the structures in which the relevant terms are realised, 5. extraction of candidate terms, 6. enrichment of ontology with selected terms and relations, and 7. a loop of steps 4, 5 and 6, for as many times as needed. 4. Conclusions NLP infrastructure is a key element in the further development of informatics applications in several areas, such as data mining, knowledge-based decision support, terminology management, and systems interoperability and integration. A significant body of work now exists that reports on experiences with various approaches in important problem areas of research. On the contrary in the biomedical field and especially for the Greek language, there is not much work implemented. Currently, a part of our efforts focuses on the completion of the multi-word term recogniser. In subsection 3.2.5 we presented the extraction of candidate multi-word terms from the corpus, based on linguistic knowledge. To automatically decide upon real multi-word terms, we have to exploit some type of statistical evidence which will help us to compute a term-validity metric (e.g. the C/NC-value metric, see [5]). Project IATROLEXI aims to cover this certain gap by developing a number of NLP resources as well as application for the scientific community. On the one hand the scientist may use the outcomes of the project in his/her own way towards his/her special research needs. On the other hand, the user may look for information in texts or make searches with specific terms or combination of terms or relations that relate terms to each other. We envisage (at least) three applications of the bilingual biomedical dictionary: 1. Semantic tagging. Any term found in the dictionary can receive an annotation that encodes its semantic type and thus links the term with the UMLS Semantic Network. 2. Bilingual term searching. A Greek term can be translated to its American equivalent(s) and then searched in American texts, and vice-versa. 3. Ontology-based query expansion. A query that contains a term of a specific semantic type can be enriched with other terms of the same semantic type or with terms of narrower semantic types. References [1] Alexander, U. (2006). Methods in Biomedical Ontology. Journal of Biomedical Informatics, Vol. 39 (2006) 252--266 [2] Bruijn, B., Martin J: Getting to the (c)ore of knowledge: mining biomedical literature. Int. Journal of Medical Informatics, Vol. 67. (2002) 7—18 [3] Cohen B., L. Fox, P. Ogren and L. Hunter (2005) Corpus Design for biomedical natural language processing. ACLISMB Workshop on Linking Biological Literature, Ontologies and Databases: Mining Biological Semantics, Detroit. [4] Eysenbach, G.: The Semantic Web and healthcare consumers: a new challenge and opportunity on the horizon?. International Journal of Healthcare Technology and Management, Vol. 5, No.3/4/5 (2003) 194 – 212 [5] Frantzi, K. T. and S. Ananiadou (1999) ‘The C-value/NCvalue domain-independent method for multi-word term extraction’. Journal of Natural Language Processing, Vol. 6, No. 3, 145-179. [6] Friedman C., Gripcsak G.: Natural language Processing and its future in medicine. Acad. Med., Vol. 74, No.8 (1999) 890 – 895 [7] Kiryakov, A., B. Popov, I. Terziev, D. Manov and D. Ognyanoff (2003) Semantic Annotation, Indexing and Retrieval. ISWC’ 2003, Florida. [8] Knight, K. (1989) ‘Unification: A multidisciplinary survey’. ACM Computing Surveys, 21(1), pp. 93-124. [9] Kokkinakis, D.: Developing resources for Swedish BioMedical text mining. Proceedings of the 2nd Int. Symposium on Semantic Mining in Biomedicine (2006), Jena, Germany. [10] Orphanos G. and D. Christodoulakis (1999) Part-of-speech Disambiguation and Unknown Word Guessing with Decision Trees. 9th EACL Conference, Bergen, Norway. [11] Rosse, M.: A Reference Ontology for Biomedical Informatics: The Foundational Model of Anatomy. Journal of Biomedical Informatics, Vol. 36 (2003) 478--500 [12] Spyns, P.: Natural Language Processing in medicine: an overview. Methods Inf. Med. Vol. 35 (1996) 285--301 [13] Vickery, C.: Ontologies. Journal of Information Science. Vol. 23 (1997) 277--28

Log In

Implementing the NLP Infrastructure for Greek Biomedical Data Mining

Related papers

Related papers

Related topics