research-article

Improved Contraction-Expansion Subspace Ensemble for High-Dimensional Imbalanced Data Classification

Authors:

C. L. Philip ChenAuthors Info & Claims

IEEE Transactions on Knowledge and Data Engineering, Volume 36, Issue 10

Pages 5194 - 5205

https://doi.org/10.1109/TKDE.2024.3384274

Published: 01 October 2024 Publication History

Abstract

Imbalanced data biases the classifier towards the majority class. Accompanied with high-dimensional characteristics, classification performance is further degraded. Existing researches for skewed data mainly involve resampling, cost-sensitive learning, and classifier ensemble. However, these approaches have some limitations: 1) resampling suffers from noisy and redundant features in high-dimensional skewed data; 2) cost-sensitive learning is hard to construct an optimal cost matrix for sample misclassification; 3) ensemble with random feature subspace easily leads to information loss; 4) ensemble with sample subspace on small-size data easily leads to insufficient description of sample space and suffers from negative impacts of high-dimensional data. This paper proposes an improved contraction-expansion subspace ensemble (ICESE) for high-dimensional imbalanced data classification. First, a contraction-expansion subspace optimization (CESO) is designed to perform subspace selection and transformation, which is beneficial for enhancing the discrimination and diversity of subspace. Then, to strengthen classification capabilities, a CESO-based multilayer optimization structure is developed to construct the improved subspace. Finally, to mitigate the effects of skewed data, ICESE performs a resampling scheme on the improved subspace for constructing a rebalanced subset to base classifier. Experimental results on 24 high-dimensional imbalanced data sets demonstrate that our ICESE outperforms different mainstream ensemble systems in terms of F-score and G-mean.

References

[1]

L. Abdi and S. Hashemi, “To combat multi-class imbalanced problems by means of over-sampling techniques,” IEEE Trans. Knowl. Data Eng., vol. 28, no. 1, pp. 238–251, Jan. 2016.

Digital Library

[2]

P. Yang, P. D. Yoo, J. Fernando, B. B. Zhou, Z. Zhang, and A. Y. Zomaya, “Sample subset optimization techniques for imbalanced and ensemble learning problems in bioinformatics applications,” IEEE Trans. Cybern., vol. 44, no. 3, pp. 445–455, Mar. 2014.

[3]

H. Guo et al., “Learning from class-imbalanced data: Review of methods and applications,” Expert Syst. Appl., vol. 73, pp. 220–239, 2017.

Digital Library

[4]

Y. Sun et al., “Cost-sensitive boosting for classification of imbalanced data,” Pattern Recognit., vol. 40, no. 12, pp. 3358–3378, 2007.

Digital Library

[5]

Y. Zhang and Z. Zhou, “Cost-sensitive face recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 10, pp. 1758–1769, Oct. 2010.

Digital Library

[6]

Z. Sun et al., “A novel ensemble method for classifying imbalanced data,” Pattern Recognit., vol. 48, no. 5, pp. 1623–1637, 2015.

Digital Library

[7]

J. Hu, Y. Li, M. Zhang, X. Yang, H.-B. Shen, and D.-J. Yu, “Predicting protein-DNA binding residues by weighted combining sequence-based features and boosting multiple svms,” IEEE-ACM Trans. Comput. Biol. Bioinform., vol. 14, no. 6, pp. 1389–1398, Nov./Dec. 2017.

Digital Library

[8]

H. Yu and J. Ni, “An improved ensemble learning method for classifying high-dimensional and imbalanced biomedicine data,” IEEE-ACM Trans. Comput. Biol. Bioinform., vol. 11, no. 4, pp. 657–666, Jul./Aug. 2014.

Digital Library

[9]

G. Li et al., “Asymmetric bagging and feature selection for activities prediction of drug molecules,” BMC Bioinf., vol. 9, no. 6, pp. 1–11, 2008.

[10]

Z. Chen et al., “A hybrid data-level ensemble to enable learning from highly imbalanced dataset,” Inf. Sci., vol. 554, pp. 157–176, 2021.

[11]

Q. Kang, X. Chen, S. Li, and M. Zhou, “A noise-filtered under-sampling scheme for imbalanced classification,” IEEE Trans. Cybern., vol. 47, no. 12, pp. 4263–4274, Dec. 2017.

[12]

G. E. A. P. A. Batista, R. C. Prati, and M. C. Monard, “A study of the behavior of several methods for balancing machine learning training data,” ACM SIGKDD Explorations Newslett., vol. 6, no. 1, pp. 20–29, 2004.

Digital Library

[13]

X. Liu, J. Wu, and Z. Zhou, “Exploratory undersampling for class-imbalance learning,” IEEE Trans. Syst., Man, Cybern. B Cybern., vol. 39, no. 2, pp. 539–550, Apr. 2009.

Digital Library

[14]

G. M. Weiss, K. McCarthy, and B. Zabar, “Cost-sensitive learning vs. sampling: Which is best for handling unbalanced classes with unequal error costs,” in Proc. Int. Conf. Des. Mater., 2007, pp. 35–41.

[15]

C. L. Castro and A. P. Braga, “Novel cost-sensitive approach to improve the multilayer perceptron performance on imbalanced data,” IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 6, pp. 888–899, Jun. 2013.

[16]

V. Lopez et al., “Cost-sensitive linguistic fuzzy rule-based classification systems under the MapReduce framework for imbalanced Big Data,” Fuzzy Sets Syst., vol. 258, pp. 5–38, 2015.

Digital Library

[17]

L. I. Kuncheva and C. J. Whitaker, “Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy,” Mach. Learn., vol. 51, pp. 181–207, 2003.

Digital Library

[18]

Y. Xu, Z. Yu, C. L. P. Chen, and Z. Liu, “Adaptive subspace optimization ensemble method for high-dimensional imbalanced data classification,” IEEE Trans. Neural Netw. Learn. Syst., vol. 34, no. 5, pp. 2284–2297, May 2023.

[19]

Y. Xu, Z. Yu, W. Cao, C. L. P. Chen, and J. You, “Adaptive classifier ensemble method based on spatial perception for high-dimensional data classification,” IEEE Trans. Knowl. Data Eng., vol. 33, no. 7, pp. 2847–2862, Jul. 2021.

[20]

H. He and E. A. Garcia, “Learning from imbalanced data,” IEEE Trans. Knowl. Data Eng., vol. 21, no. 9, pp. 1263–1284, Sep. 2009.

Digital Library

[21]

T. M. Khoshgoftaar, J. V. Hulse, and A. Napolitano, “Comparing boosting and bagging techniques with noisy and imbalanced data,” IEEE Trans. Syst. Man Cybern. A Syst. Hum., vol. 41, no. 3, pp. 552–568, May 2011.

Digital Library

[22]

H. Guo and H. L. Viktor, “Learning from imbalanced data sets with boosting and data generation: The data boost-im approach,” ACM SIGKDD Explorations Newslett., vol. 6, no. 1, pp. 30–39, 2004.

Digital Library

[23]

L. I. Kuncheva, “Diversity in multiple classifier systems,” Inf. Fusion, vol. 6, no. 1, pp. 3–4, 2005.

[24]

D. Tao, X. Tang, X. Li, and X. Wu, “Asymmetric bagging and random subspace for support vector machines-based relevance feedback in image retrieval,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 28, no. 7, pp. 1088–1099, Jul. 2006.

Digital Library

[25]

N. V. Chawla et al., “Smoteboost: Improving prediction of the minority class in boosting,” in Proc. Eur. Conf. Princ. Data Mining Knowl. Discov., 2003, pp. 107–119.

[26]

C. Seiffert, T. M. Khoshgoftaar, J. V. Hulse, and A. Napolitano, “RUSBoost: A hybrid approach to alleviating class imbalance,” IEEE Trans. Syst. Man Cybern. A Syst. Hum., vol. 40, no. 1, pp. 185–197, Jan. 2010.

Digital Library

[27]

L. Breiman, “Bagging predictors,” Mach. Learn., vol. 24, pp. 123–140, 1996.

[28]

R. E. Schapire, “The strength of weak learnability,” Mach. Learn., vol. 5, pp. 197–227, 1990.

Digital Library

[29]

M. Galar et al., “A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches,” IEEE Trans. Syst. Man Cybern. Part C-Appl. Rev., vol. 42, no. 4, pp. 463–484, Jul. 2012.

Digital Library

[30]

N. V. Chawla et al., “Smote: Synthetic minority over-sampling technique,” J. Artif. Intell. Res., vol. 16, pp. 321–357, 2002.

[31]

B. Krawczyk, M. Wozniak, and G. Schaefer, “Cost-sensitive decision tree ensembles for effective imbalanced classification,” Appl. Soft. Comput., vol. 14, pp. 554–562, 2014.

Digital Library

[32]

K. Yang et al., “Hybrid classifier ensemble for imbalanced data,” IEEE Trans. Neural Netw. Learn. Syst., vol. 31, no. 4, pp. 1387–1400, Apr. 2020.

[33]

A. Fernandez et al., “Analysing the classification of imbalanced data-sets with multiple classes: Binarization techniques and ad-hoc approaches,” Knowl.-Based Syst, vol. 42, pp. 97–110, 2013.

Digital Library

[34]

I. Tomek, “Two modifications of CNN,” IEEE Trans. Syst., Man Cybern., vol. 6, no. 11, pp. 769–772, Nov. 1976.

[35]

J. Zhang and I. Mani, “Knn approach to unbalanced data distributions: A case study involving information extraction,” in Proc. Workshop Learn. Imbalanced Datasets, 2003, pp. 1–7.

[36]

D. L. Wilson, “Asymptotic properties of nearest neighbor rules using edited data,” IEEE Trans. Syst., Man Cybern., vol. 2, no. 3, pp. 408–421, Jul. 1972.

[37]

W. Lin et al., “Clustering-based undersampling in class-imbalanced data,” Inf. Sci., vol. 409, pp. 17–26, 2017.

[38]

M. Koziarski, “Radial-based undersampling for imbalanced data classification,” Pattern Recognit., vol. 102, no. 107262, pp. 1–11, 2020.

[39]

R. C. Prati, G. E. A. P. A. Batista, and M. C. Monard, “Class imbalances versus class overlapping: An analysis of a learning system behavior,” in Proc. Mex. Int. Conf. Artif. Intell., 2004, pp. 312–321.

[40]

M. Koziarski, B. Krawczyk, and M. Wozniak, “Radial-based oversampling for noisy imbalanced data classification,” Neurocomputing, vol. 343, pp. 19–33, 2019.

Digital Library

[41]

H. Han, W. Wang, and B. Mao, “Borderline-smote: A new over-sampling method in imbalanced data sets learning,” in Proc. Int. Conf. Intell. Comput., 2005, pp. 878–887.

[42]

H. He et al., “Adasyn: Adaptive synthetic sampling approach for imbalanced learning,” in Proc. Int. Joint Conf. Neural Netw., 2008, pp. 1322–1328.

[43]

S. Barua, M. M. Islam, X. Yao, and K. Murase, “MWMOTE–majority weighted minority oversampling technique for imbalanced data set learning,” IEEE Trans. Knowl. Data Eng., vol. 26, no. 2, pp. 405–425, Feb. 2014.

Digital Library

[44]

G. Douzas, F. Bacao, and F. Last, “Improving imbalanced learning through a heuristic oversampling method based on k-means and smote,” Inf. Sci., vol. 465, pp. 1–20, 2018.

Digital Library

[45]

J. Xie and Z. Qiu, “The effect of imbalanced data sets on lda: A theoretical and empirical analysis,” Pattern Recognit., vol. 40, no. 2, pp. 557–562, 2007.

Digital Library

[46]

K. M. Ting, “An instance-weighting method to induce cost-sensitive trees,” IEEE Trans. Knowl. Data Eng., vol. 14, no. 3, pp. 659–665, 2002.

Digital Library

[47]

S. Wang et al., “Applying adaptive over-sampling technique based on data density and cost-sensitive SVM to imbalanced learning,” in Proc. Int. Joint Conf. Neural Netw., 2012, pp. 1–8.

[48]

Y. Xu, Z. Yu, W. Cao, and C. L. P. Chen, “A novel classifier ensemble method based on subspace enhancement for high-dimensional data classification,” IEEE Trans. Knowl. Data Eng., vol. 35, no. 1, pp. 16–30, Jan. 2023.

[49]

W. Chen et al., “Hybrid dimensionality reduction forest with pruning for high-dimensional data classification,” IEEE Access, vol. 8, pp. 40138–40150, 2020.

[50]

Y. Xu, Z. Yu, and C. L. P. Chen, “Classifier ensemble based on multiview optimization for high-dimensional imbalanced data classification,” IEEE Trans. Neural Netw. Learn. Syst., vol. 35, no. 1, pp. 870–883, Jan. 2024.

[51]

J. F. Diez-Pastor et al., “Random balance: Ensembles of variable priors classifiers for imbalanced data,” Knowl.-Based Syst., vol. 85, pp. 96–111, 2015.

Digital Library

[52]

J. F. Diez-Pastor et al., “Diversity techniques improve the performance of the best imbalance learning ensembles,” Inf. Sci., vol. 325, pp. 98–117, 2015.

Digital Library

[53]

C. Chen, A. Liaw, and L. Breiman, “Using random forest to learn imbalanced data,” vol. 110, Univ. California, Berkeley, Berkeley, CA, USA, 2004, pp. 1–12.

[54]

A. Ozcift, “Random forests ensemble classifier trained with data resampling strategy to improve cardiac arrhythmia diagnosis,” Comput. Biol. Med., vol. 41, no. 5, pp. 265–271, 2011.

Digital Library

[55]

B. H. Menze et al., “A comparison of random forest and its gini importance with standard chemometric methods for the feature selection and classification of spectral data,” BMC Bioinf., vol. 10, no. 213, pp. 1–16, 2009.

[56]

J. J. Rodriguez, L. I. Kuncheva, and C. J. Alonso, “Rotation forest: A new classifier ensemble method,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 28, no. 10, pp. 1619–1630, Oct. 2006.

Digital Library

[57]

K. Liu and D. Huang, “Cancer classification using rotation forest,” Comput. Biol. Med., vol. 38, no. 5, pp. 601–610, 2008.

Digital Library

[58]

L. I. Kuncheva and J. J. Rodriguez, “An experimental study on rotation forest ensembles,” in Proc. Int. Workshop Mult. Classifier Syst., 2007, pp. 459–468.

[59]

M. F. Amasyali and O. K. Ersoy, “Classifier ensembles with the extended space forest,” IEEE Trans. Knowl. Data Eng., vol. 26, no. 3, pp. 549–562, Mar. 2014.

Digital Library

[60]

A. M. Martinez and R. Benavente, “The AR face database,” CVC, New Delhi, India, Tech. Rep. 24, 1998.

[61]

A. I. Su et al., “Molecular classification of human carcinomas by use of gene expression signatures,” Cancer Res., vol. 61, no. 20, pp. 7388–7393, 2001.

[62]

A. Asuncion and D. J. Newman, “UCI machine learning repository,” School Inf. Comput. Sci., Univ. California at Irvine, Irvine, CA, USA, 2007. [Online]. Available: http://www.ics.uci.edu/mlearn/MLRepository.html

[63]

Y. Hoshida et al., “Subclass mapping: Identifying common subtypes in independent disease data sets,” PLoS One, vol. 2, no. 11, 2007, Art. no.

[64]

S. Monti et al., “Consensus clustering: A resampling-based method for class discovery and visualization of gene expression microarray data,” Mach. Learn., vol. 52, pp. 91–118, 2003.

Digital Library

[65]

M. C. D. Souto et al., “Clustering cancer gene expression data: A comparative study,” BMC Bioinf., vol. 9, no. 497, pp. 1–14, 2008.

[66]

P. Bermejo, J. A. Gamez, and J. M. Puerta, “Speeding up incremental wrapper feature subset selection with naive Bayes classifier,” Knowl.-Based Syst., vol. 55, pp. 140–147, 2014.

Digital Library

[67]

L. I. Kuncheva, “A bound on kappa-error diagrams for analysis of classifier ensembles,” IEEE Trans. Knowl. Data Eng., vol. 25, no. 3, pp. 494–501, Mar. 2013.

Digital Library

[68]

M. Friedman, “The use of ranks to avoid the assumption of normality implicit in the analysis of variance,” J. Amer. Statist. Assoc., vol. 32, no. 200, pp. 675–701, 1937.

[69]

Z. Yu et al., “A new kind of nonparametric test for statistical comparison of multiple classifiers over multiple datasets,” IEEE Trans. Cybern., vol. 47, no. 12, pp. 4418–4431, Dec. 2017.

[70]

D. Li and Z. Gong, “A deep neural network for crossing-city poi recommendations,” IEEE Trans. Knowl. Data Eng., vol. 34, no. 8, pp. 3536–3548, Aug. 2022.

[71]

D. Dablain, B. Krawczyk, and N. V. Chawla, “Deepsmote: Fusing deep learning and smote for imbalanced data,” IEEE Trans. Neural Netw. Learn. Syst., vol. 34, no. 9, pp. 6390–6404, Sep. 2023.

[72]

W. W. Y. Ng et al., “Dual autoencoders features for imbalance classification problem,” Pattern Recognit., vol. 60, pp. 875–889, 2016.

Digital Library

[73]

K. H. Kim and S. Y. Sohn, “Hybrid neural network with cost-sensitive support vector machine for class-imbalanced multimodal data,” Neural Netw., vol. 130, pp. 176–184, 2020.

[74]

G. Douzas and F. Bacao, “Effective data generation for imbalanced learning using conditional generative adversarial networks,” Expert Syst. Appl., vol. 91, pp. 464–471, 2018.

Digital Library

[75]

C. Yan et al., “Self-weighted robust lda for multiclass classification with edge classes,” ACM Trans. Intell. Syst. Technol., vol. 12, no. 1, pp. 1–19, 2020.

Digital Library

[76]

X. Lu, L. Liu, L. Nie, X. Chang, and H. Zhang, “Semantic-driven interpretable deep multi-modal hashing for large-scale multimedia retrieval,” IEEE Trans. Multim., vol. 23, pp. 4541–4554, 2021.

[77]

X. Chang, F. Nie, S. Wang, Y. Yang, X. Zhou, and C. Zhang, “Compound rank-k projections for bilinear analysis,” IEEE Trans. Neural Netw. Learn. Syst., vol. 27, no. 7, pp. 1502–1513, Jul. 2016.

[78]

W. A. Rivera, “Noise reduction a priori synthetic over-sampling for class imbalanced data sets,” Inf. Sci., vol. 408, pp. 146–161, 2017.

Digital Library

Cited By

Lee JFan YCheng CChew CKuo C(2025)ML-based intrusion detection system for precise APT cyber-clusteringComputers and Security10.1016/j.cose.2024.104209149:COnline publication date: 1-Feb-2025
https://dl.acm.org/doi/10.1016/j.cose.2024.104209

Index Terms

Improved Contraction-Expansion Subspace Ensemble for High-Dimensional Imbalanced Data Classification
1. Computing methodologies
  1. Machine learning
2. Information systems
  1. Information systems applications
    1. Data mining

Index terms have been assigned to the content through auto-classification.

Recommendations

An improved ensemble learning method for classifying high-dimensional and imbalanced biomedicine data

Training classifiers on skewed data can be technically challenging tasks, especially if the data is high-dimensional simultaneously, the tasks can become more difficult. In biomedicine field, skewed data type often appears. In this study, we try to deal ...
A subspace ensemble framework for classification with high dimensional missing data

Real world classification tasks may involve high dimensional missing data. The traditional approach to handling the missing data is to impute the data first, and then apply the traditional classification algorithms on the imputed data. This method first ...
Enhanced algorithm for high-dimensional data classification

Graphical abstractIllustration of the decision hyperplanes generated by TSSVM, MCVSVM, and LMLP on an artificial dataset. Display Omitted HighlightsIn the case of the singularity of the within-class scatter matrix, the drawbacks of both MCVSVM and LMLP ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Knowledge and Data Engineering

IEEE Transactions on Knowledge and Data Engineering Volume 36, Issue 10

Oct. 2024

446 pages

Issue’s Table of Contents

1041-4347 © 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.

Publisher

IEEE Educational Activities Department

United States

Publication History

Published: 01 October 2024

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Lee JFan YCheng CChew CKuo C(2025)ML-based intrusion detection system for precise APT cyber-clusteringComputers and Security10.1016/j.cose.2024.104209149:COnline publication date: 1-Feb-2025
https://dl.acm.org/doi/10.1016/j.cose.2024.104209

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents