Abstract
Currently the best algorithms for predicting transcription factor binding sites in DNA sequences are severely limited in accuracy. There is good reason to believe that predictions from different classes of algorithms could be used in conjunction to improve the quality of predictions. In this paper, we apply single layer networks, rules sets, support vector machines and the Adaboost algorithm to predictions from 12 key real valued algorithms. Furthermore, we use a ‘window’ of consecutive results as the input vector in order to contextualise the neighbouring results. We improve the classification result with the aid of under- and over-sampling techniques. We find that support vector machines and the Adaboost algorithm outperform the original individual algorithms and the other classifiers employed in this work. In particular they give a better tradeoff between recall and precision.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Abnizova I, Rust A, Robinson M, Te Boekhorst R, Gilks WR (2006) Transcription binding site prediction using markov models. J Bioinform Comput Biol 4(2):425–441, 16819793 (P,S,G,E,B)
Abnizova I, te Boekhorst R, Walter C, Gilks WR (2005) Some statistical properties of regulatory DNA sequences and their use in predicting regulatory regions in Drosophila genome: the fluffy tail test. BMC Bioinformatics 6:109
Apostolico A, Bock ME, Lonardi S, Xu X (2000) Efficient detection of unusual words. J Comput Biol 7(1–2):71–94
Bailey TL, Elkan C (1994) Fitting a mixture model by expectation maximization to discover motifs in biopolymers. In: Proceedings of the second international conference on intelligent systems for molecular biology. AAAI Press, pp 28–36
Bishop CM (1995) Neural networks for pattern recognition. Oxford University Press, New York
Blanchette M, Tompa M (2003) FootPrinter: a program designed for phylogenetic footprinting. Nucleic Acids Res 31(13):3840–3842
Brown CT (2002) New computational approaches for analysis of cis-regulatory networks. Dev Biol 246(1):86–102
Buckland M, Gey F (1994) The relationship between recall and precision. J Am Soc Inform Sci 45(1):12–19
Bucher P (1990) Weight matrix descriptions of four eukaryotic RNA polymerase II promotor elements derived from 502 unrelated promotor sequences. J Mol Biol 212:563–578
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Fawcett R (2004) ROC graphs: notes and practical considerations for researchers. Kluwer, Dordrecht
Fawcett T (2001) Using rule sets to maximize ROC performance. In: Proceedings of the IEEE international conference on data mining (ICDM-2001), IEEE Computer Society, Los Alamitos, pp 131–138
Hanley JA, McNeil BJ (1982) The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143:29–36
Hughes JD, Estep PW, Tavazoie S, Church GM (2000) Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. J Mol Biol 296(5):1205–1214
Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55(1):119–139
Wu G, Chang EY (2003) Class-boundary alignment for imbalanced dataset learning. Workshop on learning from imbalanced datasets, II, ICML, Washington
Japkowicz N (2003) Class imbalances: are we focusing on the right issure? Workshop on learning from imbalanced datasets, II, ICML, Washington
Joshi M, Kumar V, Agarwal R (2001) Evaluating Boosting algorithms to classify rare classes: comparison and improvements. In: First IEEE international conference on data mining, San Jose
Markstein M, Stathopoulos A, Markstein V, Markstein P, Harafuji N, Keys D, Lee B, Richardson P, Rokshar D, Levine M (2002) Decoding noncoding regulatory DNAs in metazoan genomes. In: Proceeding of 1st IEEE computer society bioinformatics conference (CSB 2002), Stanford, August 2002, pp 14–16
Quinlan JR (1993) C4.5: programs for machine learning, Morgan Kauffman, Los Altos
Rajewsky N, Vergassola M, Gaul U, Siggia ED (2002) Computational detection of genomic cis regulatory modules, applied to body patterning in the early Drosophila embryo. BMC Bioinformatics 3:30
Schapire RE, Freund Y, Bartlett PL, Lee WS (1998) Boosting the margin: a new explanation for the effectiveness of voting methods. Ann Stat 26(5):1651–1686
Scholköpf B, Smola AJ (2002) Learning with kernels: support vector machines, regularization, optimization, and beyond. The MIT Press, Cambridge
Sun Y, Robinson M, Adams R, Kayes P, Rust AG, Davey N (2005) Integrating binding site predictions using meta classification methods. In: Proceedings ICANNGA05
Sun Y, Robinson M, te Boekhorst R, Adams R, Rust AG, Davey N (2006) Using feature selection filtering metohds for binding site predictions. In: The 5th IEEE international conference on cognitive informatics, ICCI05, Beijing
Sun Y, Robinson M, te Boekhorst R, Adams R, Rust AG, Davey N (2006) Using sampling methods to improve binding site predictions. In: 14th European symposium on artificial neural networks, ESANN, Bruges
Sun Y, Robinson M, Adams R, Rust A, Davey N (2008) Prediction of binding sites in the mouse genome using support vector machine. In: Kurkova V, Neruda R, Koutnik J (eds) Proceedings of 18th international conference on artificial neural networks (ICANN2008). Springer Part 2 (LNCS 5164), Prague, September 2008, pp 91–100
Te Boekhorst R, Abnizova I, Nehaniv C (2008) Discriminating coding, non-coding and regulatory regions using rescaled range and detrended fluctuation analysis. Biosystems 91(1):183–194
Thijs G, Marchal K, Lescot M, Rombauts S, De Moor B, Rouz P Moreau Y (2001) A Gibbs sampling method to detect over-represented motifs in upstream regions of coexpressed genes. In: Proceedings Recomb’2001, pp 305–312
Tompa M et al (2005) Assessing computational tools for the discovery of transcription factor binding sites. Nat Biotechnol 23(1):137–144
White RJ (2001) Gene transcription: mechanisms and control. Blackwell, Oxford
Wolfsberg TG, Gabrieliam AE, Campbell AE, Cho MJ, Spouge RJ, Landsman D (1999) Candidatge regulatory sequence elements for cell cycle-dependent transcription in Saccharomyces cerevisiae. Genome Res 9:775–792
Wu TF, Lin CJ, Weng RC (2004) Probability estimates for multi-class classification by pairwise coupling. J Mach Learn Res 5:975–1005
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Sun, Y., Robinson, M., Adams, R. et al. Integrating genomic binding site predictions using real-valued meta classifiers. Neural Comput & Applic 18, 577–590 (2009). https://doi.org/10.1007/s00521-008-0204-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-008-0204-4