Evolutionary learning of syntax patterns for genic interaction extraction
Proceedings of the 2015 Annual Conference on Genetic and Evolutionary …, 2015•dl.acm.org
There is an increasing interest in the development of techniques for automatic relation
extraction from unstructured text. The biomedical domain, in particular, is a sector that may
greatly benefit from those techniques due to the huge and ever increasing amount of
scientific publications describing observed phenomena of potential clinical interest. In this
paper, we consider the problem of automatically identifying sentences that contain
interactions between genes and proteins, based solely on a dictionary of genes and proteins …
extraction from unstructured text. The biomedical domain, in particular, is a sector that may
greatly benefit from those techniques due to the huge and ever increasing amount of
scientific publications describing observed phenomena of potential clinical interest. In this
paper, we consider the problem of automatically identifying sentences that contain
interactions between genes and proteins, based solely on a dictionary of genes and proteins …
There is an increasing interest in the development of techniques for automatic relation extraction from unstructured text. The biomedical domain, in particular, is a sector that may greatly benefit from those techniques due to the huge and ever increasing amount of scientific publications describing observed phenomena of potential clinical interest. In this paper, we consider the problem of automatically identifying sentences that contain interactions between genes and proteins, based solely on a dictionary of genes and proteins and a small set of sample sentences in natural language. We propose an evolutionary technique for learning a classifier that is capable of detecting the desired sentences within scientific publications with high accuracy. The key feature of our proposal, that is internally based on Genetic Programming, is the construction of a model of the relevant syntax patterns in terms of standard part-of-speech annotations. The model consists of a set of regular expressions that are learned automatically despite the large alphabet size involved. We assess our approach on two realistic datasets and obtain 74% accuracy, a value sufficiently high to be of practical interest and that is in line with significant baseline methods.