Abstract
In many domains the data objects are described in terms of a large number of features (e.g. microarray experiments, or spectral characterizations of organic and inorganic samples). A pipelined approach using two clustering algorithms in combination with Rough Sets is investigated for the purpose of discovering important combinations of attributes in high dimensional data. The Leader and several k-means algorithms are used as fast procedures for attribute set simplification of the information systems presented to the rough sets algorithms. The data described in terms of these fewer features are then discretized with respect to the decision attribute according to different rough set based schemes. From them, the reducts and their derived rules are extracted, which are applied to test data in order to evaluate the resulting classification accuracy in crossvalidation experiments. The data mining process is implemented within a high throughput distributed computing environment. Nonlinear transformation of attribute subsets preserving the similarity structure of the data were also investigated. Their classification ability, and that of subsets of attributes obtained after the mining process were described in terms of analytic functions obtained by genetic programming (gene expression programming), and simplified using computer algebra systems. Visual data mining techniques using virtual reality were used for inspecting results. An exploration of this approach (using Leukemia, Colon cancer and Breast cancer gene expression data) was conducted in a series of experiments. They led to small subsets of genes with high discrimination power.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Alon, U., et al.: Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings National Academy of Science USA 96, 6745–6750 (1999)
Anderberg, M.: Cluster Analysis for Applications. Academic Press, London (1973)
Bal, H., et al.: Next Generation Grid(s) European Grid Research 2005 - 2010 Expert Group Report (2003)
Bazan, J.G., Skowron, A., Synak, P.: Dynamic Reducts as a Tool for Extracting Laws from Decision Tables. In: Raś, Z.W., Zemankova, M. (eds.) ISMIS 1994. LNCS, vol. 869, pp. 346–355. Springer, Heidelberg (1994)
Borg, I., Lingoes, J.: Multidimensional similarity structure analysis. Springer, New York (1987)
Chandon, J.L., Pinson, S.: Analyse typologique. Théorie et applications. Masson, Paris (1981)
Chang, J.C., et al.: Gene expression profiling for the prediction of therapeutic response to docetaxel in patients with breast cancer. Mechanisms of Disease. The Lancet 362 (2003)
Dougherty, J., Kohavi, R., Sahami, M.: Supervised and unsupervised discretization of continuous features. In: Prieditis, A., Russell, S. (eds.) Proc. Twelfth International Conference on Machine Learning, pp. 194–202. Morgan Kaufmann, San Francisco (1995)
Epema, D.H.J., et al.: A worldwide flock of Condors: Load sharing among workstation clusters. Journal of Future Generation Computer Systems, 53-65 (1996)
Famili, F., Ouyang, J.: Data mining: understanding data and disease modeling. In: Proceedings of the 21st IASTED International Conference, Applied Informatics, Innsbruck, Austria, Feb. 10-13, 2003, pp. 32–37 (2003)
Ferreira, C.: Gene Expression Programming: A New Adaptive Algorithm for Problem Solving. Journal of Complex Systems 13(2), 87–129 (2001)
Ferreira, C.: Gene Expression Programming: Mathematical Modeling by an Artificial Intelligence, Angra do Heroismo, Portugal (2002)
Forgy, E.W.: Cluster analysis of multivariate data: Efficiency versus interpretability of classifications. In: Biometric Soc. Meetings, Riverside, California. Abstract in Biometrics, 21(3), 768 (1965)
Foster, I., Kesselman, C., Tuecke, S.: The Anatomy of the Grid: Enabling Scalable Virtual Organizations. International Journal of Supercomp. App. 15(3)20, 222–237 (2001)
Golub, T.R., et al.: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286, 531–537 (1999)
Gower, J.C.: A general coefficient of similarity and some of its properties. Biometrics 1(27), 857–871 (1973)
Hartigan, J.: Clustering Algorithms. John Wiley & Sons, Chichester (1975)
Holte, R.C.: Very simple classification rules perform well on most commonly used datasets. Machine Learning 11(1), 63–91 (1993)
Jain, A.K., Mao, J.: Artificial Neural Networks for Nonlinear Projection of Multivariate Data. In: Proceedings 1992 IEEE Joint Conf. on Neural Networks, pp. 335–340. IEEE Computer Society Press, Los Alamitos (1992)
Jancey, R.C.: Multidimensional group analysis. Australian Journal of Botany 14(1), 127–130 (1966)
Johnson, D.S.: Approximation algorithms for combinatorial problems. Journal of Computer and System Sciences 9, 256–278 (1974)
Lingras, P.: Unsupervised Rough Classification using GAs. Journal of Intelligent Information Systems 16(3), 215–228 (2001)
Lingras, P., Yao, Y.: Time Complexity of Rough Clustering: GAs versus K-Means. In: Alpigini, J.J., et al. (eds.) RSCTC 2002. LNCS (LNAI), vol. 2475, pp. 279–288. Springer, Heidelberg (2002)
MacQueen, J.B.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5-th Symposium on Math. Statist. and Probability, vol. 1, pp. 281–297. Univ. of California Press, Berkeley (1967)
Nguyen, H.S., Nguyen, S.H.: Some efficient algorithms for rough set methods. In: Proceedings Fifth Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems (IPMU’96), Granada, Spain, July 1996, pp. 1451–1456 (1996)
Nguyen, H.S., Nguyen, S.H.: Discretization Methods in Data Mining. In: Polkowski, L., Skowron, A. (eds.) Rough Sets in Knowledge Discovery, pp. 451–482. Physica-Verlag, Heidelberg (1998)
Nguyen, H.S., Skowron, A.: Quantization of real-valued attributes. In: Proceedings Second International Joint Conference on Information Sciences, Wrightsville Beach, NC, September 1995, pp. 34–37 (1995)
Øhrn, A.: Discernibility and Rough Sets in Medicine: Tools and Applications. PhD thesis, Norwegian University of Science and Technology, Department of Computer and Information Science, December NTNU report 1999:133 (1999), http://www.idi.ntnu.no/~aleks/thesis/
Øhrn, A.: Rosetta Technical Reference Manual. Department of Computer and Information Science, Norwegian University of Science and Technology, Trondheim, Norway (2001)
Øhrn, A., Komorowski, J.: Rosetta- A Rough Set Toolkit for the Analysis of Data. In: Proceedings of Third Int. Join Conf. on Information Sciences (JCIS97), Durham, NC, USA, March 1-5, 1997, pp. 403–407 (1997)
Pawlak, Z.: Rough sets: Theoretical aspects of reasoning about data. Kluwer Academic Publishers, Dordrecht (1991)
Peters, J.F., Borkowski, M.: K-means Indiscernibility Relation over Pixels. In: Tsumoto, S., et al. (eds.) RSCTC 2004. LNCS (LNAI), vol. 3066, pp. 580–585. Springer, Heidelberg (2004)
Press, W.H., et al.: Numerical Recipes in C. Cambridge University Press, New York (1986)
Press, W.H., et al.: Numerical Recipes in C. The Art of Scientific Computing. Cambridge University Press, Cambridge (1992)
Sammon, J.W.: A non-linear mapping for data structure analysis. IEEE Trans. on Computers 18, 401–409 (1969)
Tannenbaum, T., et al.: Condor – A Distributed Job Scheduler. In: Sterling, T. (ed.) Beowulf Cluster Computing with Linux, MIT Press, Cambridge (2001)
Thain, D., Tannenbaum, T., Livny, M.: Condor and the Grid. In: Berman, F., Fox, G., Hey, T. (eds.) Grid Computing: Making the Global Infrastructure a Reality, John Wiley & Sons, Chichester (2002)
Thain, D., Tannenbaum, T., Livny, M.: Distributed Computing in Practice: The Condor Experience. Journal of Concurrency and Computation: Practice and Experience (2004)
Valdés, J.J.: Similarity-Based Heterogeneous Neurons in the Context of General Observational Models. Neural Network World 12(5), 499–508 (2002)
Valdés, J.J.: Virtual Reality Representation of Relational Systems and Decision Rules: An exploratory Tool for understanding Data Structure. In: Hajek, P. (ed.) Theory and Application of Relational Structures as Knowledge Instruments. Meeting of the COST Action 274, Prague, November 14-16 (2002)
Valdés, J.J.: Virtual Reality Representation of Information Systems and Decision Rules: An Exploratory Tool for Understanding Data and Knowledge. In: Wang, G., et al. (eds.) RSFDGrC 2003. LNCS (LNAI), vol. 2639, pp. 615–618. Springer, Heidelberg (2003)
Valdés, J.J., Barton, A.J.: Gene Discovery in Leukemia Revisited: A Computational Intelligence Perspective. In: Orchard, B., Yang, C., Ali, M. (eds.) IEA/AIE 2004. LNCS (LNAI), vol. 3029, pp. 118–127. Springer, Heidelberg (2004)
Wróblewski, J.: Ensembles of Classifiers Based on Approximate Reducts. Fundamenta Informaticae 47, 351–360 (2001)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer Berlin Heidelberg
About this chapter
Cite this chapter
Valdés, J.J., Barton, A.J. (2007). Finding Relevant Attributes in High Dimensional Data: A Distributed Computing Hybrid Data Mining Strategy. In: Peters, J.F., Skowron, A., Düntsch, I., Grzymała-Busse, J., Orłowska, E., Polkowski, L. (eds) Transactions on Rough Sets VI. Lecture Notes in Computer Science, vol 4374. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-71200-8_20
Download citation
DOI: https://doi.org/10.1007/978-3-540-71200-8_20
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-71198-8
Online ISBN: 978-3-540-71200-8
eBook Packages: Computer ScienceComputer Science (R0)