Abstract
Searching for good discriminative gene sets (DGSs) in microarray data is important for many problems, such as precise cancer diagnosis, correct treatment selection, and drug discovery. Small and good DGSs can help researchers eliminate “irrelavent” genes and focus on “critical” genes that may be used as biomarkers or that are related to the development of cancers. In addition, small DGSs will not impose demanding requirements to classifiers, e.g., high-speed CPUs, large memorys, etc. Furthermore, if the DGSs are used as diagnostic measures in the future, small DGSs will simplify the test and therefore reduce the cost. Here, we propose an algorithm of searching for DGSs, which we call active mining discriminative gene sets (AM-DGS). The searching scheme of the AM-DGS is as follows: the gene with a large t-statistic is assigned as a seed, i.e., the first feature of the DGS. We classify the samples in a data set using a support vector machine (SVM). Next, we add the gene with the greatest power to correct the misclassified samples into the DGS, that is the gene with the largest t-statistic evaluated with only the mis-classified samples is added. We keep on adding genes into the DGS according to the SVM’s mis-classified data until no error appears or overfitting occurs. We tested the proposed method with the well-known leukemia data set. In this data set, our method obtained two 2-gene DGSs that achieved 94.1% testing accuracy and a 4-gene DGS that achieved 97.1% testing accuracy. This result showed that our method obtained better accuracy with much smaller DGSs compared to 3 widely used methods, i.e., T-statistics, F-statistics, and SVM-based recursive feature elimination (SVM-RFE).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Guyon, I., Wecton, J., Barnhill, S., Vapnik, V.: Gene Selection for Cancer Classification using Support Vector Machines. Machine Learning 46, 389–422 (2002)
Mitra, P., Murthy, C.A., Pal, S.K.: A Probabilistic Active Support Vector Learning Algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence 26, 413–418 (2004)
Tong, S., Koller, D.: Support Vector Machine Active Learning with Applications to Text Classification. Journal of Machine Learning Research 2, 45–66 (2002)
Platt, J.C.: Sequential Minimum Optimization: A Fast Algorithm for Training Support Vector Machines. Microsoft Research, Cambridge, U.K., Technical Report (1998)
Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P., Coller, H., Loh, M.L., Downing, J.R., Caligiuri, M.A., et al.: Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science 286, 531–537 (1999)
Alon, U., Barkai, N., Notterman, D.A., Gish, K., Ybarra, S., Mack, D., Levine, A.J.: Broad Patterns of Gene Expression Revealed by Clustering Analysis of Tumor and Normal Colon Tissues Probed by Oligonucleotide Arrays. Proc. Natl. Acad. Sci. USA. 96, 6745–6750 (1999)
Wang, Y., Makedon, F., Ford, J., Pearlman, J.: Hykgene: a Hybrid Approach for Selecting Marker Genes for Phenotype Classification Using Microarray Gene Expression Data. Bioinformatics 21, 1530–1537 (2005)
Li, L., Weinberg, C.R., Darden, T.A., Pedersen, L.G.: Gene Selection for Sample Classification Based on Gene Expression Data: Study of Sensitivity to Choice of Parameters of the GA/KNN Method. Bioinformaitcs 17, 1131–1142 (2001)
Cho, J.H., Lee, D., Park, J.H., Lee, I.B.: Gene Selection and Classification from Microarray Data Using Kernel Machine. FEBS Letters 571, 93–98 (2004)
Li, J., Wong, L.: Identifying Good Diagnostic Gene Groups from Gene Expressin Profiles Using the Concept of Emerging Patterns. Bioinformatics 18, 725–734 (2002)
Lai, Y., Wu, B., Chen, L., Zhao, H.: Statistical Method for Identifying Differential Gene-Gene Coexpression Patterns. Bioinformatics 21, 1565–1571 (2005)
Broet, P., Lewin, A., Richardson, S., Dalmasso, C., Magdelenat, H.: A Mixture Model-Based Strategy for Selecting Sets of Genes in Multiclass Response Microarray Experiments. Bioinformatics 20, 2562–2571 (2004)
Alizadeh, A.A., Eisen, M.B., Davis, R.E., Ma, C., Lossos, I.S., Rosenwald, A., Boldrick, J.C., Sabet, H., Tran, T., Yu, X., et al.: Distinct Types of Diffuse Large B-Cell Lymphoma Identified by Gene Expression Profiling. Nature 403, 503–511 (2000)
Khan, J.M., Wei, J.S., Ringner, M., Saal, L.H., Ladanyi, M., Westermann, F., Berthold, F., Schwab, M., Antonescu, C.R., Peterson, C., et al.: Classification and Diagnostic Prediction of Cancers Using Gene Expression Profiling and Artificial Neural Networks. Nature Medicine 7, 673–679 (2001)
Deutsch, J.M.: Evolutionary Algorithms for Finding Optimal Gene Sets in Microarray Prediction. Bioinformatics 19, 45–52 (2003)
Devore, J., Peck, R.: Statistics: the Exploration and Analysis of Data, 3rd edn. Duxbury Press, Pacific Grove (1997)
Xing, E.P., Jordan, M.I., Karp, R.M.: Feature Selection for High-Dimensional Genomic Microarray Data. In: Proc. of the 18th International Conference on Machine Learning, pp. 601–608. Morgan Kaufmann, San Francisco (2001)
Vapnik, V.: Statistical Learning Theory. Wiley, New York (1998)
Wang, L.P. (ed.): Support Vector Machines: Theory and Applications. Springer, Berlin (2005)
Devijver, P., Kittler, J.: Pattern Recognition: a Statistical Approach. Prentice Hall, London (1982)
Fu, X., Wang, L.P.: Data Dimensionality Reduction with Application to Simplifying RBF Network Structure and Improving Classification Performance. IEEE Trans. on Systems, Man, and Cybernetics-Part b: Cybernetics 33, 399–409 (2003)
Ji, S., Krishnapuram, B., Carin, L.: Hidden Markov Models and Its Application to Active Learning. IEEE Trans. on Pattern Analysis and Machine Intelligence 28, 522–532 (2006)
Riccardi, G., Hakkani-Tur, D.: Active Learning: Theory and Application to Automatic Speech Recognition. IEEE Trans. on Speech and Audio Processing 13, 504–511 (2005)
Liu, X., Krishnan, A., Mondry, A.: An Entropy-Based Gene Selection Method for Cancer Classification Using Microarray Data. BMC Bioinformatics 6, 76 (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Chu, F., Wang, L. (2006). Active Mining Discriminative Gene Sets. In: Rutkowski, L., Tadeusiewicz, R., Zadeh, L.A., Żurada, J.M. (eds) Artificial Intelligence and Soft Computing – ICAISC 2006. ICAISC 2006. Lecture Notes in Computer Science(), vol 4029. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11785231_92
Download citation
DOI: https://doi.org/10.1007/11785231_92
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-35748-3
Online ISBN: 978-3-540-35750-6
eBook Packages: Computer ScienceComputer Science (R0)