Abstract
An encoding method has a direct effect on the quality and the representation of the discovered knowledge in data mining systems. Biological macromolecules are encoded by strings of characters, calledprimary structures. Knowing that data mining systems usually use relational tables to encode data, we have then to reencode these strings and transform them into relational tables. In this paper, we do a comparative study of the existingstatic encoding methods, that are based on the Biologist know-how, and our newdynamic encoding one, that is based on the, construction ofDiscriminant and Minimal Substrings (DMS). Different classification methods are used to do this study. The experimental results show that ourdynamic encoding method is more efficient than thestatic ones, to encode biological macromolecules within a data mining perspective.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Dickerson R E, Geis I. The Structure and Actions of Proteins. Harper & Row Publishers, New York, NY, 1969, pp.16–17.
Hirsh J D, Sternberg M J E. Prediction of structural and functional features of protein and nucleic acid sequences by artificial neural networks.Biochemistry, 1992, 31(32): 7211–7218.
Hirsh H, Noordewier M. Using background knowledge to improve inductive learning of DNA sequences. InProc. the Tenth Conference on Artificial Intelligence for Applications, 1994, pp.351–357.
Wang J T L, Marr T G, Shasha Det al. Discovering active motifs in sets of related protein sequences and using them for classification.Nucleic Acids Res., 1994, 22: 2769–2775.
Qicheng M, Wang J T L, Gattiker J R. Mining biomolecular data using background knowledge and artificial neural networks.technical report.
Quinlan J R. Learning efficient classification procedures and their application to chess end games. InMachine Learning: An AI Approach, Vol.1, Michalski R S, Carbonell J G, Mitchell T M (Eds.), 1983, pp.463–482.
Towell G G. Symbolic knowledge and neural networks: Insertion, refinement and extraction [Dissertation]. Department of Computer Sciences, University of Wisconsin-Madison, 1991.
Zurada J M. Introduction to Artificial Neural Systems. West Publishing Co., St. Paul, MN, 1992, pp.186–196.
Lu S Y, Fu K S. A sentence-to-sentence clustering procedure for pattern analysis.IEEE Trans. Systems, Man and Cybernetics, 1978, (8): 381–389.
O'Neill M C. Consensus methods for finding and ranking DNA binding sites.Journal of Molecular Biology, 1989, 207: 301–310.
O'Neill M C, Chiafari F. Escherichia coli promoters. II. A spacing class-dependent promoter search protocol.J. Biol. Chem., 1989, 264: 5531–5534.
Fu H A study of amino acids binary codes.Master in Computer Sciences, University of Lille, France, 2001.
Maddouri M, Elloumi M. A data mining approach based on machine learning techniques to classify biological sequences.Knowledge Based Systems Journal, March 2002.
Elloumi M, Maddouri M. Discrimination between two families of strings: Application to classification of primary structures of biological macromolecules. InProc. Second International Workshop on Biomolecular Informatics, Atlantic City, New Jersey, USA, February 2000.
Karp R, Miller R E, Rosenberg A L. Rapid identification of repeated patterns in strings, trees and arrays. In4th Symposium of Theory of Computing, 1972, pp.125–136.
Elloumi M. Analysis of strings coding biological macromolecules [Dissertation]. The University of Aix-Marseilles III. France, June 1994.
Weiss S M, Kulikowski C A. Computer Systems that Learn. Morgan-Kaufmann Publish., California, U.S.A., 1991.
Author information
Authors and Affiliations
Corresponding author
Additional information
Mondher Maddouri received an B.S. degree in mathematics and physics in 1990, an M.S. degree in computer engineering in 1994 and a Ph.D. degree in computer science in 2000, from the Faculty of Sciences of Tunis, Tunisia. He is currently an associate professor in the Computer Science Department in the National Institute of Applied Sciences and Technologies, Tunis, Tunisia. His research interests are machine learning, knowledge discovery and data mining, and computational molecular biology.
Mourad Elloumi received an B.S. degree in mathematics and physics in 1984, and an M.S. degree in computer engineering in 1988, from the Faculty of Sciences of Tunis, Tunisia. He also received an M.S. degree in computer science in 1989, and a Ph.D. degree in computer science in 1994, from the University of Aix-Marseilles III, France. He is currently an associate professor in the Computer Science Department in the Faculty of Economic Sciences and Management of Tunis, Tunisia. His research interests are computational molecular biology, algorithmics, and knowledge discovery and data mining.
Rights and permissions
About this article
Cite this article
Maddouri, M., Elloumi, M. Encoding of primary structures of biological macromolecules within a data mining perspective. J. Comput. Sci. & Technol. 19, 78–88 (2004). https://doi.org/10.1007/BF02944786
Received:
Revised:
Issue Date:
DOI: https://doi.org/10.1007/BF02944786