Abstract
We study ensembles of simple threshold classifiers for the categorization of high-dimensional data of low cardinality and give a compression bound on their prediction risk. Two approaches are utilized to produce such classifiers. One is based on univariate feature selection employing the area under the ROC curve as ranking criterion. The other approach uses a greedy selection strategy. The methods are applied to artificial data, published microarray expression profiles, and highly imbalanced data.
Similar content being viewed by others
References
Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ (1999) Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci 96(12): 6745–6750
Anthony M, Biggs N (1992) Computational learning theory. Cambridge University Press, Cambridge
Bittner M, Meltzer P, Chen Y, Jiang Y, Seftor E, Hendrix M, Radmacher M, Simon R, Yakhini Z, Ben-Dor A, Sampas N, Dougherty E, Wang E, Marincola F, Gooden C, Lueders J, Glatfelter A, Pollock P, Carpten J, Gillanders E, Leja D, Dietrich K, Beaudry C, Berens M, Alberts D, Sondak V (2000) Molecular classification of cutaneous malignant melanoma by gene expression profiling. Nature 406(6795): 536–540
Blum A, Langford J (2003) PAC-MDL bounds. In: Schölkopf Bernhard, Warmuth Manfred K (eds) COLT. Springer, Berlin, pp 344–357
Breiman L (1996) Bagging predictors. Mach Learn 24(2): 123–140
Breiman L (1998) Arcing classifiers. Annal Stat 26(3): 801–824
Breiman L (2001) Random forests. Mach Learn 45(1): 5–32
Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and regression trees. Wadsworth Publishing Company, Belmont
Buchholz M, Kestler HA, Bauer A, Böck W, Rau B, Leder G, Kratzer W, Bommer M, Scarpa A, Schilling MK, Adler G, Hoheisel JD, Gress TM (2005) Specialized DNA arrays for the differentiation of pancreatic tumors. Clin Cancer Res 11(22): 8048–8054
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3): 273–297
Cover TM (1965) Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE Trans Electron Comput 14(3): 326–334
Duch W (2004) Filter methods. In: Guyon I, Gunn S, Nikravesh M, Zadeh LA (eds) Feature extraction, foundations and applications. Springer, Berlin, pp 89–118
Floyd S, Warmuth MK (1995) Sample compression, learnability, and the Vapnik-Chervonenkis dimension. Mach Learn 21(3): 269–304
Freund Y (1995) Boosting a weak learning algorithm by majority. Inf Comput 121(2): 256–285
Furey TS, Cristianini N, Duffy N, Bednarski DW, Schummer M, Haussler D (2000) Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics 16(10): 906–914
Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439): 531–537
Graepel T, Herbrich R, Shawe-Taylor J (2005) PAC-Bayesian compression bounds on the prediction error of learning algorithms for classification. Mach Learn 59(1–2): 55–76
Hanley JA, Mcneil BJ (1982) The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143(1): 29–36
Haussler D (1988) Quantifying inductive bias: AI learning algorithms and Valiant’s learning framework. Artif Intell 36(2): 177–221
Kearns MJ, Vazirani UV (1994) An introduction to computational learning theory. MIT, Cambridge
Klivans AR, Servedio RA (2006) Toward attribute efficient learning of decision lists and parities. J Mach Learn Res 7: 587–602
Kuncheva LI, Whitaker CJ, Shipp CA (2003) Limits on the majority vote accuracy in classifier fusion. Pattern Anal Appl 6(1): 22–31
Lai C, Reinders MJT, van’t Veer LJ, Wessels LFA (2006) A comparison of univariate and multivariate gene selection techniques for classification of cancer datasets. BMC Bioinf 7: 235
Langford J (2005) Tutorial on practical prediction theory for classification. J Mach Learn Res 6: 273–306
Laviolette F, Marchand M, Shah M (2005) Margin-sparsity trade-off for the set covering machine. In: Proceedings of the 16th European conference on machine learning. Springer, pp 206–217
Laviolette F, Marchand M, Shah M, Shanian S (2010) Learning the set covering machine by bound minimization and margin-sparsity trade-off. Mach Learn 78(1–2): 175–201
Littlestone N, Warmuth M (1986) Relating data compression and learnability. Unpublished manuscript
Marchand M, Shah M (2004) PAC-Bayes learning of conjunctions and classification of gene-expression data. In: Proceedings of the 18th annual conference on neural information processing systems. MIT, Cambridge, pp 881–888
Marchand M, Shawe-Taylor J (2002) The set covering machine. J Mach Learn Res 3(4–5): 723–746
McAllester DA (1999) PAC-Bayesian model averaging. In: COLT ’99: Proceedings of the twelfth annual conference on computational learning theory. ACM, pp 164–170
Pomeroy SL, Tamayo P, Gaasenbeek M, Sturla LM, Angelo M, McLaughlin ME, Kim JYH, Goumnerova LC, Black PM, Lau C, Allen JC, Zagzag D, Olson JM, Curran T, Wetmore C, Biegel JA, Poggio T, Mukherjee S, Rifkin R, Califano A, Stolovitzky G, Louis DN, Mesirov JP, Lander ES, Golub TR (2002) Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature 415(6870): 436–442
Ruschhaupt M, Huber W, Poustka A, Mansmann U (2004) A compendium to ensure computational reproducibility in high-dimensional classification tasks. Stat Appl Genet Mol Biol 3(1): 37
Vapnik V (1998) Statistical learning theory. Wiley, New York
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Kestler, H.A., Lausser, L., Lindner, W. et al. On the fusion of threshold classifiers for categorization and dimensionality reduction. Comput Stat 26, 321–340 (2011). https://doi.org/10.1007/s00180-011-0243-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00180-011-0243-7