Resampling Multilabel Datasets by Decoupling Highly Imbalanced Labels

Francisco Charte⁹,
Antonio Rivera¹⁰,
María José del Jesus¹⁰ &
…
Francisco Herrera⁹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9121))

Included in the following conference series:

International Conference on Hybrid Artificial Intelligence Systems

2278 Accesses
4 Citations

Abstract

Multilabel classification is a task that has been broadly studied in late years. However, how to face learning from imbalanced multilabel datasets (MLDs) has only been addressed latterly. In this regard, a few proposals can be found in the literature, most of them based on resampling techniques adapted from the traditional classification field. The success of these methods varies extraordinarily depending on the traits of the chosen MLDs.

One of the characteristics which significantly influences the behavior of multilabel resampling algorithms is the joint appearance of minority and majority labels in the same instances. It was demonstrated that MLDs with a high level of concurrence among imbalanced labels could hardly benefit from resampling methods. This paper proposes an original resampling algorithm, called REMEDIAL, which is not based on removing majority instances nor creating minority ones, but on a procedure to decouple highly imbalanced labels. As will be experimentally demonstrated, this is an interesting approach for certain MLDs.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Concurrence among Imbalanced Labels and Its Influence on Multilabel Resampling Algorithms

MLeNN: A First Approach to Heuristic Multilabel Undersampling

A Membership Probability–Based Undersampling Algorithm for Imbalanced Data

Article 14 January 2020

Notes

1.
Visualizing all label interactions in an MLD is, in some cases, almost impossible due to the large number of labels. For that reason, only the most frequent labels and the most rare ones for each MLD are represented in these plots. High resolution version of these plots can be found at http://simidat.ujaen.es/remedial and they can be generated using the mldr R package [32].

References

Tsoumakas, G., Katakis, I., Vlahavas, I.: Mining multi-label data. In: Maimon, O., Rokach, L. (eds.) Data Mining and Knowledge Discovery Handbook, Ch. 34, pp. 667–685. Springer, Boston (2010). doi:10.1007/978-0-387-09823-4_34
Google Scholar
Klimt, B., Yang, Y.: The enron corpus: a new dataset for email classification research. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) ECML 2004. LNCS (LNAI), vol. 3201, pp. 217–226. Springer, Heidelberg (2004). doi:10.1007/978-3-540-30115-8_22
Chapter Google Scholar
Turnbull, D., Barrington, L., Torres, D., Lanckriet, G.: Semantic annotation and retrieval of music and sound effects. IEEE Audio Speech Lang. Process. 16(2), 467–476 (2008). doi:10.1109/TASL.2007.913750
Google Scholar
Duygulu, P., Barnard, K., de Freitas, J.F.G., Forsyth, D.: Object recognition as machine translation: learning a lexicon for a fixed image vocabulary. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002, Part IV. LNCS, vol. 2353, pp. 97–112. Springer, Heidelberg (2002). doi:10.1007/3-540-47979-1_7
Chapter Google Scholar
Chawla, N.V., Japkowicz, N., Kotcz, A.: Editorial: special issue on learning from imbalanced data sets. SIGKDD Explor. Newsl. 6(1), 1–6 (2004). doi:10.1145/1007730.1007733
Google Scholar
García, V., Sánchez, J., Mollineda, R.: On the effectiveness of preprocessing methods when dealing with different levels of class imbalance. Knowl. Based Syst. 25(1), 13–21 (2012). http://dx.doi.org/10.1016/j.knosys.2011.06.013
Google Scholar
Charte, F., Rivera, A., del Jesus, M.J., Herrera, F.: A first approach to deal with imbalance in multi-label datasets. In: Pan, J.-S., Polycarpou, M.M., Woźniak, M., de Carvalho, A.C.P.L.F., Quintián, H., Corchado, E. (eds.) HAIS 2013. LNCS, vol. 8073, pp. 150–160. Springer, Heidelberg (2013). doi:10.1007/978-3-642-40846-5_16
Chapter Google Scholar
Giraldo-Forero, A.F., Jaramillo-Garzón, J.A., Ruiz-Muñoz, J.F., Castellanos-Domínguez, C.G.: Managing imbalanced data sets in multi-label problems: a case study with the SMOTE algorithm. In: Ruiz-Shulcloper, J., Sanniti di Baja, G. (eds.) CIARP 2013, Part I. LNCS, vol. 8258, pp. 334–342. Springer, Heidelberg (2013). doi:10.1007/978-3-642-41822-8_42
Chapter Google Scholar
Charte, F., Rivera, A.J., del Jesus, M.J., Herrera, F.: Addressing imbalance in multilabel classification: Measures and random resampling algorithms, Neurocomputing to be published
Google Scholar
Charte, F., Rivera, A.J., del Jesus, M.J., Herrera, F.: MLeNN: a first approach to heuristic multilabel undersampling. In: Corchado, E., Lozano, J.A., Quintián, H., Yin, H. (eds.) IDEAL 2014. LNCS, vol. 8669, pp. 1–9. Springer, Heidelberg (2014). doi:10.1007/978-3-319-10840-7_1
Chapter Google Scholar
Tahir, M.A., Kittler, J., Yan, F.: Inverse random under sampling for class imbalance problem and its application to multi-label classification. Pattern Recogn. 45(10), 3738–3750 (2012). doi:10.1016/j.patcog.2012.03.014
Google Scholar
Tahir, M.A., Kittler, J., Bouridane, A.: Multilabel classification using heterogeneous ensemble of multi-label classifiers. Pattern Recogn. Lett. 33(5), 513–523 (2012). doi:10.1016/j.patrec.2011.10.019
Google Scholar
Charte, F., Rivera, A., del Jesus, M.J., Herrera, F.: Concurrence among imbalanced labels and its influence on multilabel resampling algorithms. In: Polycarpou, M., de Carvalho, A.C.P.L.F., Pan, J.-S., Woźniak, M., Quintian, H., Corchado, E. (eds.) HAIS 2014. LNCS, vol. 8480, pp. 110–121. Springer, Heidelberg (2014)
Chapter Google Scholar
Diplaris, S., Tsoumakas, G., Mitkas, P.A., Vlahavas, I.P.: Protein classification with multiple algorithms. In: Bozanis, P., Houstis, E.N. (eds.) PCI 2005. LNCS, vol. 3746, pp. 448–456. Springer, Heidelberg (2005). doi:10.1007/11573036_42
Chapter Google Scholar
Elisseeff, A., Weston, J.: A kernel method for multi-labelled classification. In: Dietterich, G., Becker, S., Ghahramani, Z. (eds.) Advances in Neural Information Processing Systems 14, vol. 14, pp. 681–687. MIT Press, Cambridge (2001)
Google Scholar
Crammer, K., Dredze, M., Ganchev, K., Talukdar, P.P., Carroll, S.: Automatic code assignment to medical text. In: Proceedings of the Workshop on Biological, Translational, and Clinical Language Processing, BioNLP 2007. Prague, Czech Republic, pp. 129–136 (2007)
Google Scholar
Godbole, S., Sarawagi, S.: Discriminative methods for multi-labeled classification. In: Dai, H., Srikant, R., Zhang, C. (eds.) PAKDD 2004. LNCS (LNAI), vol. 3056, pp. 22–30. Springer, Heidelberg (2004). doi:10.1007/978-3-540-24775-3_5
Chapter Google Scholar
Boutell, M., Luo, J., Shen, X., Brown, C.: Learning multi-label scene classification. Pattern Recogn. 37(9), 1757–1771 (2004). doi:10.1016/j.patcog.2004.03.009
Google Scholar
Read, J., Pfahringer, B., Holmes, G., Frank, E.: Classifier chains for multi-label classification. Mach. Learn. 85, 333–359 (2011). doi:10.1007/s10994-011-5256-5
MathSciNet Google Scholar
Tsoumakas, G., Katakis, I., Vlahavas, I.: Effective and efficient multilabel classification in domains with large number of labels. In: Proceedings of the ECML/PKDD Workshop on Mining Multidimensional Data, MMD 2008. Antwerp, Belgium, pp. 30–44 (2008)
Google Scholar
Zhang, M., Zhou, Z.: ML-KNN: a lazy learning approach to multi-label learning. Pattern Recogn. 40(7), 2038–2048 (2007). doi:10.1016/j.patcog.2006.12.019
MATH Google Scholar
Clare, A.J., King, R.D.: Knowledge discovery in multi-label phenotype data. In: Siebes, A., De Raedt, L. (eds.) PKDD 2001. LNCS (LNAI), vol. 2168, p. 42. Springer, Heidelberg (2001). doi:10.1007/3-540-44794-6_4
Chapter Google Scholar
Zhang, M.-L.: Multilabel neural networks with applications to functional genomics and text categorization. IEEE Trans. Knowl. Data Eng. 18(10), 1338–1351 (2006). doi:10.1109/TKDE.2006.162
Google Scholar
Zhang, M., Zhou, Z.: A review on multi-label learning algorithms. IEEE Trans. Knowl. Data Eng. 26(8), 1819–1837 (2014). doi:10.1109/TKDE.2013.39
Google Scholar
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002). doi:10.1613/jair.953
MATH Google Scholar
Kotsiantis, S.B., Pintelas, P.E.: Mixture of expert agents for handling imbalanced data sets. Ann. Math. Comput. Teleinformatics 1, 46–55 (2003)
Google Scholar
López, V., Fernández, A., García, S., Palade, V., Herrera, F.: An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf. Sci. 250, 113–141 (2013). doi:10.1016/j.ins.2013.07.007
Google Scholar
Provost, F., Fawcett, T.: Robust classification for imprecise environments. Mach. Learn. 42, 203–231 (2001). doi:10.1023/A:1007601015854
MATH Google Scholar
He, J., Gu, H., Liu, W.: Imbalanced multi-modal multi-label learning for subcellular localization prediction of human proteins with both single and multiple sites. PloS one 7(6), 7155 (2012). doi:10.1371/journal.pone.0037155
Google Scholar
Li, C., Shi, G.: Improvement of learning algorithm for the multi-instance multi-label rbf neural networks trained with imbalanced samples. J. Inf. Sci. Eng. 29(4), 765–776 (2013)
Google Scholar
Tepvorachai, G., Papachristou, C.: Multi-label imbalanced data enrichment process in neural net classifier training. In: IEEE International Joint Conference on Neural Networks, IJCNN 2008, pp. 1301–1307 (2008). doi:10.1109/IJCNN.2008.4633966
Charte, F., Charte, F.D.: How to work with multilabel datasets in R using the mldr package. doi:10.6084/m9.figshare.1356035
Cheng, W., Hüllermeier, E.: Combining instance-based learning and logistic regression for multilabel classification. Mach. Learn. 76(2–3), 211–225 (2009). doi:10.1007/s10994-009-5127-5
Google Scholar

Download references

Acknowledgments

F. Charte is supported by the Spanish Ministry of Education under the FPU National Program (Ref. AP2010-0068). This work was partially supported by the Spanish Ministry of Science and Technology under projects TIN2011-28488 and TIN2012-33856, and the Andalusian regional projects P10-TIC-06858 and P11-TIC-7765.

Author information

Authors and Affiliations

Department of Computer Science and Artificial Intelligence, University of Granada, Granada, Spain
Francisco Charte & Francisco Herrera
Department of Computer Science, University of Jaén, Jaén, Spain
Antonio Rivera & María José del Jesus

Authors

Francisco Charte
View author publications
You can also search for this author in PubMed Google Scholar
Antonio Rivera
View author publications
You can also search for this author in PubMed Google Scholar
María José del Jesus
View author publications
You can also search for this author in PubMed Google Scholar
Francisco Herrera
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Francisco Charte .

Editor information

Editors and Affiliations

University of Deusto, Bilbao, Spain
Enrique Onieva
University of Deusto, Bilbao, Spain
Igor Santos
University of Deusto, Bilbao, Spain
Eneko Osaba
Universidad de Salamanca, Salamanca, Spain
Héctor Quintián
Universidad de Salamanca, Salamanca, Spain
Emilio Corchado

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Charte, F., Rivera, A., del Jesus, M.J., Herrera, F. (2015). Resampling Multilabel Datasets by Decoupling Highly Imbalanced Labels. In: Onieva, E., Santos, I., Osaba, E., Quintián, H., Corchado, E. (eds) Hybrid Artificial Intelligent Systems. HAIS 2015. Lecture Notes in Computer Science(), vol 9121. Springer, Cham. https://doi.org/10.1007/978-3-319-19644-2_41

Download citation

DOI: https://doi.org/10.1007/978-3-319-19644-2_41
Published: 29 May 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-19643-5
Online ISBN: 978-3-319-19644-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Resampling Multilabel Datasets by Decoupling Highly Imbalanced Labels

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Concurrence among Imbalanced Labels and Its Influence on Multilabel Resampling Algorithms

MLeNN: A First Approach to Heuristic Multilabel Undersampling

A Membership Probability–Based Undersampling Algorithm for Imbalanced Data

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Resampling Multilabel Datasets by Decoupling Highly Imbalanced Labels

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Concurrence among Imbalanced Labels and Its Influence on Multilabel Resampling Algorithms

MLeNN: A First Approach to Heuristic Multilabel Undersampling

A Membership Probability–Based Undersampling Algorithm for Imbalanced Data

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation