Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Improved Contraction-Expansion Subspace Ensemble for High-Dimensional Imbalanced Data Classification

Published: 02 April 2024 Publication History

Abstract

Imbalanced data biases the classifier towards the majority class. Accompanied with high-dimensional characteristics, classification performance is further degraded. Existing researches for skewed data mainly involve resampling, cost-sensitive learning, and classifier ensemble. However, these approaches have some limitations: 1) resampling suffers from noisy and redundant features in high-dimensional skewed data; 2) cost-sensitive learning is hard to construct an optimal cost matrix for sample misclassification; 3) ensemble with random feature subspace easily leads to information loss; 4) ensemble with sample subspace on small-size data easily leads to insufficient description of sample space and suffers from negative impacts of high-dimensional data. This paper proposes an improved contraction-expansion subspace ensemble (ICESE) for high-dimensional imbalanced data classification. First, a contraction-expansion subspace optimization (CESO) is designed to perform subspace selection and transformation, which is beneficial for enhancing the discrimination and diversity of subspace. Then, to strengthen classification capabilities, a CESO-based multilayer optimization structure is developed to construct the improved subspace. Finally, to mitigate the effects of skewed data, ICESE performs a resampling scheme on the improved subspace for constructing a rebalanced subset to base classifier. Experimental results on 24 high-dimensional imbalanced data sets demonstrate that our ICESE outperforms different mainstream ensemble systems in terms of F-score and G-mean.

References

[1]
L. Abdi and S. Hashemi, “To combat multi-class imbalanced problems by means of over-sampling techniques,” IEEE Trans. Knowl. Data Eng., vol. 28, no. 1, pp. 238–251, Jan. 2016.
[2]
P. Yang, P. D. Yoo, J. Fernando, B. B. Zhou, Z. Zhang, and A. Y. Zomaya, “Sample subset optimization techniques for imbalanced and ensemble learning problems in bioinformatics applications,” IEEE Trans. Cybern., vol. 44, no. 3, pp. 445–455, Mar. 2014.
[3]
H. Guo et al., “Learning from class-imbalanced data: Review of methods and applications,” Expert Syst. Appl., vol. 73, pp. 220–239, 2017.
[4]
Y. Sun et al., “Cost-sensitive boosting for classification of imbalanced data,” Pattern Recognit., vol. 40, no. 12, pp. 3358–3378, 2007.
[5]
Y. Zhang and Z. Zhou, “Cost-sensitive face recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 10, pp. 1758–1769, Oct. 2010.
[6]
Z. Sun et al., “A novel ensemble method for classifying imbalanced data,” Pattern Recognit., vol. 48, no. 5, pp. 1623–1637, 2015.
[7]
J. Hu, Y. Li, M. Zhang, X. Yang, H.-B. Shen, and D.-J. Yu, “Predicting protein-DNA binding residues by weighted combining sequence-based features and boosting multiple svms,” IEEE-ACM Trans. Comput. Biol. Bioinform., vol. 14, no. 6, pp. 1389–1398, Nov./Dec. 2017.
[8]
H. Yu and J. Ni, “An improved ensemble learning method for classifying high-dimensional and imbalanced biomedicine data,” IEEE-ACM Trans. Comput. Biol. Bioinform., vol. 11, no. 4, pp. 657–666, Jul./Aug. 2014.
[9]
G. Li et al., “Asymmetric bagging and feature selection for activities prediction of drug molecules,” BMC Bioinf., vol. 9, no. 6, pp. 1–11, 2008.
[10]
Z. Chen et al., “A hybrid data-level ensemble to enable learning from highly imbalanced dataset,” Inf. Sci., vol. 554, pp. 157–176, 2021.
[11]
Q. Kang, X. Chen, S. Li, and M. Zhou, “A noise-filtered under-sampling scheme for imbalanced classification,” IEEE Trans. Cybern., vol. 47, no. 12, pp. 4263–4274, Dec. 2017.
[12]
G. E. A. P. A. Batista, R. C. Prati, and M. C. Monard, “A study of the behavior of several methods for balancing machine learning training data,” ACM SIGKDD Explorations Newslett., vol. 6, no. 1, pp. 20–29, 2004.
[13]
X. Liu, J. Wu, and Z. Zhou, “Exploratory undersampling for class-imbalance learning,” IEEE Trans. Syst., Man, Cybern. B Cybern., vol. 39, no. 2, pp. 539–550, Apr. 2009.
[14]
G. M. Weiss, K. McCarthy, and B. Zabar, “Cost-sensitive learning vs. sampling: Which is best for handling unbalanced classes with unequal error costs,” in Proc. Int. Conf. Des. Mater., 2007, pp. 35–41.
[15]
C. L. Castro and A. P. Braga, “Novel cost-sensitive approach to improve the multilayer perceptron performance on imbalanced data,” IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 6, pp. 888–899, Jun. 2013.
[16]
V. Lopez et al., “Cost-sensitive linguistic fuzzy rule-based classification systems under the MapReduce framework for imbalanced Big Data,” Fuzzy Sets Syst., vol. 258, pp. 5–38, 2015.
[17]
L. I. Kuncheva and C. J. Whitaker, “Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy,” Mach. Learn., vol. 51, pp. 181–207, 2003.
[18]
Y. Xu, Z. Yu, C. L. P. Chen, and Z. Liu, “Adaptive subspace optimization ensemble method for high-dimensional imbalanced data classification,” IEEE Trans. Neural Netw. Learn. Syst., vol. 34, no. 5, pp. 2284–2297, May 2023.
[19]
Y. Xu, Z. Yu, W. Cao, C. L. P. Chen, and J. You, “Adaptive classifier ensemble method based on spatial perception for high-dimensional data classification,” IEEE Trans. Knowl. Data Eng., vol. 33, no. 7, pp. 2847–2862, Jul. 2021.
[20]
H. He and E. A. Garcia, “Learning from imbalanced data,” IEEE Trans. Knowl. Data Eng., vol. 21, no. 9, pp. 1263–1284, Sep. 2009.
[21]
T. M. Khoshgoftaar, J. V. Hulse, and A. Napolitano, “Comparing boosting and bagging techniques with noisy and imbalanced data,” IEEE Trans. Syst. Man Cybern. A Syst. Hum., vol. 41, no. 3, pp. 552–568, May 2011.
[22]
H. Guo and H. L. Viktor, “Learning from imbalanced data sets with boosting and data generation: The data boost-im approach,” ACM SIGKDD Explorations Newslett., vol. 6, no. 1, pp. 30–39, 2004.
[23]
L. I. Kuncheva, “Diversity in multiple classifier systems,” Inf. Fusion, vol. 6, no. 1, pp. 3–4, 2005.
[24]
D. Tao, X. Tang, X. Li, and X. Wu, “Asymmetric bagging and random subspace for support vector machines-based relevance feedback in image retrieval,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 28, no. 7, pp. 1088–1099, Jul. 2006.
[25]
N. V. Chawla et al., “Smoteboost: Improving prediction of the minority class in boosting,” in Proc. Eur. Conf. Princ. Data Mining Knowl. Discov., 2003, pp. 107–119.
[26]
C. Seiffert, T. M. Khoshgoftaar, J. V. Hulse, and A. Napolitano, “RUSBoost: A hybrid approach to alleviating class imbalance,” IEEE Trans. Syst. Man Cybern. A Syst. Hum., vol. 40, no. 1, pp. 185–197, Jan. 2010.
[27]
L. Breiman, “Bagging predictors,” Mach. Learn., vol. 24, pp. 123–140, 1996.
[28]
R. E. Schapire, “The strength of weak learnability,” Mach. Learn., vol. 5, pp. 197–227, 1990.
[29]
M. Galar et al., “A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches,” IEEE Trans. Syst. Man Cybern. Part C-Appl. Rev., vol. 42, no. 4, pp. 463–484, Jul. 2012.
[30]
N. V. Chawla et al., “Smote: Synthetic minority over-sampling technique,” J. Artif. Intell. Res., vol. 16, pp. 321–357, 2002.
[31]
B. Krawczyk, M. Wozniak, and G. Schaefer, “Cost-sensitive decision tree ensembles for effective imbalanced classification,” Appl. Soft. Comput., vol. 14, pp. 554–562, 2014.
[32]
K. Yang et al., “Hybrid classifier ensemble for imbalanced data,” IEEE Trans. Neural Netw. Learn. Syst., vol. 31, no. 4, pp. 1387–1400, Apr. 2020.
[33]
A. Fernandez et al., “Analysing the classification of imbalanced data-sets with multiple classes: Binarization techniques and ad-hoc approaches,” Knowl.-Based Syst, vol. 42, pp. 97–110, 2013.
[34]
I. Tomek, “Two modifications of CNN,” IEEE Trans. Syst., Man Cybern., vol. 6, no. 11, pp. 769–772, Nov. 1976.
[35]
J. Zhang and I. Mani, “Knn approach to unbalanced data distributions: A case study involving information extraction,” in Proc. Workshop Learn. Imbalanced Datasets, 2003, pp. 1–7.
[36]
D. L. Wilson, “Asymptotic properties of nearest neighbor rules using edited data,” IEEE Trans. Syst., Man Cybern., vol. 2, no. 3, pp. 408–421, Jul. 1972.
[37]
W. Lin et al., “Clustering-based undersampling in class-imbalanced data,” Inf. Sci., vol. 409, pp. 17–26, 2017.
[38]
M. Koziarski, “Radial-based undersampling for imbalanced data classification,” Pattern Recognit., vol. 102, no. 107262, pp. 1–11, 2020.
[39]
R. C. Prati, G. E. A. P. A. Batista, and M. C. Monard, “Class imbalances versus class overlapping: An analysis of a learning system behavior,” in Proc. Mex. Int. Conf. Artif. Intell., 2004, pp. 312–321.
[40]
M. Koziarski, B. Krawczyk, and M. Wozniak, “Radial-based oversampling for noisy imbalanced data classification,” Neurocomputing, vol. 343, pp. 19–33, 2019.
[41]
H. Han, W. Wang, and B. Mao, “Borderline-smote: A new over-sampling method in imbalanced data sets learning,” in Proc. Int. Conf. Intell. Comput., 2005, pp. 878–887.
[42]
H. He et al., “Adasyn: Adaptive synthetic sampling approach for imbalanced learning,” in Proc. Int. Joint Conf. Neural Netw., 2008, pp. 1322–1328.
[43]
S. Barua, M. M. Islam, X. Yao, and K. Murase, “MWMOTE–majority weighted minority oversampling technique for imbalanced data set learning,” IEEE Trans. Knowl. Data Eng., vol. 26, no. 2, pp. 405–425, Feb. 2014.
[44]
G. Douzas, F. Bacao, and F. Last, “Improving imbalanced learning through a heuristic oversampling method based on k-means and smote,” Inf. Sci., vol. 465, pp. 1–20, 2018.
[45]
J. Xie and Z. Qiu, “The effect of imbalanced data sets on lda: A theoretical and empirical analysis,” Pattern Recognit., vol. 40, no. 2, pp. 557–562, 2007.
[46]
K. M. Ting, “An instance-weighting method to induce cost-sensitive trees,” IEEE Trans. Knowl. Data Eng., vol. 14, no. 3, pp. 659–665, 2002.
[47]
S. Wang et al., “Applying adaptive over-sampling technique based on data density and cost-sensitive SVM to imbalanced learning,” in Proc. Int. Joint Conf. Neural Netw., 2012, pp. 1–8.
[48]
Y. Xu, Z. Yu, W. Cao, and C. L. P. Chen, “A novel classifier ensemble method based on subspace enhancement for high-dimensional data classification,” IEEE Trans. Knowl. Data Eng., vol. 35, no. 1, pp. 16–30, Jan. 2023.
[49]
W. Chen et al., “Hybrid dimensionality reduction forest with pruning for high-dimensional data classification,” IEEE Access, vol. 8, pp. 40138–40150, 2020.
[50]
Y. Xu, Z. Yu, and C. L. P. Chen, “Classifier ensemble based on multiview optimization for high-dimensional imbalanced data classification,” IEEE Trans. Neural Netw. Learn. Syst., vol. 35, no. 1, pp. 870–883, Jan. 2024.
[51]
J. F. Diez-Pastor et al., “Random balance: Ensembles of variable priors classifiers for imbalanced data,” Knowl.-Based Syst., vol. 85, pp. 96–111, 2015.
[52]
J. F. Diez-Pastor et al., “Diversity techniques improve the performance of the best imbalance learning ensembles,” Inf. Sci., vol. 325, pp. 98–117, 2015.
[53]
C. Chen, A. Liaw, and L. Breiman, “Using random forest to learn imbalanced data,” vol. 110, Univ. California, Berkeley, Berkeley, CA, USA, 2004, pp. 1–12.
[54]
A. Ozcift, “Random forests ensemble classifier trained with data resampling strategy to improve cardiac arrhythmia diagnosis,” Comput. Biol. Med., vol. 41, no. 5, pp. 265–271, 2011.
[55]
B. H. Menze et al., “A comparison of random forest and its gini importance with standard chemometric methods for the feature selection and classification of spectral data,” BMC Bioinf., vol. 10, no. 213, pp. 1–16, 2009.
[56]
J. J. Rodriguez, L. I. Kuncheva, and C. J. Alonso, “Rotation forest: A new classifier ensemble method,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 28, no. 10, pp. 1619–1630, Oct. 2006.
[57]
K. Liu and D. Huang, “Cancer classification using rotation forest,” Comput. Biol. Med., vol. 38, no. 5, pp. 601–610, 2008.
[58]
L. I. Kuncheva and J. J. Rodriguez, “An experimental study on rotation forest ensembles,” in Proc. Int. Workshop Mult. Classifier Syst., 2007, pp. 459–468.
[59]
M. F. Amasyali and O. K. Ersoy, “Classifier ensembles with the extended space forest,” IEEE Trans. Knowl. Data Eng., vol. 26, no. 3, pp. 549–562, Mar. 2014.
[60]
A. M. Martinez and R. Benavente, “The AR face database,” CVC, New Delhi, India, Tech. Rep. 24, 1998.
[61]
A. I. Su et al., “Molecular classification of human carcinomas by use of gene expression signatures,” Cancer Res., vol. 61, no. 20, pp. 7388–7393, 2001.
[62]
A. Asuncion and D. J. Newman, “UCI machine learning repository,” School Inf. Comput. Sci., Univ. California at Irvine, Irvine, CA, USA, 2007. [Online]. Available: http://www.ics.uci.edu/mlearn/MLRepository.html
[63]
Y. Hoshida et al., “Subclass mapping: Identifying common subtypes in independent disease data sets,” PLoS One, vol. 2, no. 11, 2007, Art. no.
[64]
S. Monti et al., “Consensus clustering: A resampling-based method for class discovery and visualization of gene expression microarray data,” Mach. Learn., vol. 52, pp. 91–118, 2003.
[65]
M. C. D. Souto et al., “Clustering cancer gene expression data: A comparative study,” BMC Bioinf., vol. 9, no. 497, pp. 1–14, 2008.
[66]
P. Bermejo, J. A. Gamez, and J. M. Puerta, “Speeding up incremental wrapper feature subset selection with naive Bayes classifier,” Knowl.-Based Syst., vol. 55, pp. 140–147, 2014.
[67]
L. I. Kuncheva, “A bound on kappa-error diagrams for analysis of classifier ensembles,” IEEE Trans. Knowl. Data Eng., vol. 25, no. 3, pp. 494–501, Mar. 2013.
[68]
M. Friedman, “The use of ranks to avoid the assumption of normality implicit in the analysis of variance,” J. Amer. Statist. Assoc., vol. 32, no. 200, pp. 675–701, 1937.
[69]
Z. Yu et al., “A new kind of nonparametric test for statistical comparison of multiple classifiers over multiple datasets,” IEEE Trans. Cybern., vol. 47, no. 12, pp. 4418–4431, Dec. 2017.
[70]
D. Li and Z. Gong, “A deep neural network for crossing-city poi recommendations,” IEEE Trans. Knowl. Data Eng., vol. 34, no. 8, pp. 3536–3548, Aug. 2022.
[71]
D. Dablain, B. Krawczyk, and N. V. Chawla, “Deepsmote: Fusing deep learning and smote for imbalanced data,” IEEE Trans. Neural Netw. Learn. Syst., vol. 34, no. 9, pp. 6390–6404, Sep. 2023.
[72]
W. W. Y. Ng et al., “Dual autoencoders features for imbalance classification problem,” Pattern Recognit., vol. 60, pp. 875–889, 2016.
[73]
K. H. Kim and S. Y. Sohn, “Hybrid neural network with cost-sensitive support vector machine for class-imbalanced multimodal data,” Neural Netw., vol. 130, pp. 176–184, 2020.
[74]
G. Douzas and F. Bacao, “Effective data generation for imbalanced learning using conditional generative adversarial networks,” Expert Syst. Appl., vol. 91, pp. 464–471, 2018.
[75]
C. Yan et al., “Self-weighted robust lda for multiclass classification with edge classes,” ACM Trans. Intell. Syst. Technol., vol. 12, no. 1, pp. 1–19, 2020.
[76]
X. Lu, L. Liu, L. Nie, X. Chang, and H. Zhang, “Semantic-driven interpretable deep multi-modal hashing for large-scale multimedia retrieval,” IEEE Trans. Multim., vol. 23, pp. 4541–4554, 2021.
[77]
X. Chang, F. Nie, S. Wang, Y. Yang, X. Zhou, and C. Zhang, “Compound rank-k projections for bilinear analysis,” IEEE Trans. Neural Netw. Learn. Syst., vol. 27, no. 7, pp. 1502–1513, Jul. 2016.
[78]
W. A. Rivera, “Noise reduction a priori synthetic over-sampling for class imbalanced data sets,” Inf. Sci., vol. 408, pp. 146–161, 2017.

Index Terms

  1. Improved Contraction-Expansion Subspace Ensemble for High-Dimensional Imbalanced Data Classification
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image IEEE Transactions on Knowledge and Data Engineering
    IEEE Transactions on Knowledge and Data Engineering  Volume 36, Issue 10
    Oct. 2024
    446 pages

    Publisher

    IEEE Educational Activities Department

    United States

    Publication History

    Published: 02 April 2024

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 0
      Total Downloads
    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 19 Nov 2024

    Other Metrics

    Citations

    View Options

    View options

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media