Machine Learning for Imbalanced Data: Tackle imbalanced datasets using machine learning and deep learning techniques
As machine learning practitioners, we often encounter imbalanced datasets in which one class has considerably fewer instances than the other. Many machine learning algorithms assume an equilibrium between majority and minority classes, leading to suboptimal performance on imbalanced data. This comprehensive guide helps you address this class imbalance to significantly improve model performance.

Machine Learning for Imbalanced Data begins by introducing you to the challenges posed by imbalanced datasets and the importance of addressing these issues. It then guides you through techniques that enhance the performance of classical machine learning models when using imbalanced data, including various sampling and cost-sensitive learning methods.

As you progress, you’ll delve into similar and more advanced techniques for deep learning models, employing PyTorch as the primary framework. Throughout the book, hands-on examples will provide working and reproducible code that’ll demonstrate the practical implementation of each technique.

By the end of this book, you’ll be adept at identifying and addressing class imbalances and confidently applying various techniques, including sampling, cost-sensitive techniques, and threshold adjustment, while using traditional machine learning or deep learning models.

Release dateNov 30, 2023
Machine Learning for Imbalanced Data: Tackle imbalanced datasets using machine learning and deep learning techniques

Abhishek Kumar

Dr. Abhishek Kumar is a post-doctorate fellow in computer science at Ingenium Research Group, based at Universidad De Castilla-La Mancha in Spain. He has been teaching in academia for more than 8 years, and published more than 50 articles in reputed, peer reviewed national and international journals, books, and conferences. His research area includes artificial intelligence, image processing, computer vision, data mining, and machine learning.

    Machine Learning for Imbalanced Data - Abhishek Kumar


    Machine Learning for Imbalanced Data

    About the authors

    Kumar Abhishek is a seasoned senior machine learning engineer at Expedia Group, US, specializing in risk analysis and fraud detection. With over a decade of machine learning and software engineering experience, Kumar has worked for companies such as Microsoft, Amazon, and a Bay Area start-up. Kumar holds a master’s degree in computer science from the University of Florida, Gainesville.

    To my incredible wife who has been my rock and constant source of inspiration, our adorable son who fills our lives with joy, my wonderful parents for their unwavering support, and my close friends. Immense thanks to Christian, who has been a pivotal mentor and guide, for his meticulous reviews. My deepest gratitude to my co-author, Mounir, and contributor, Anshul; their dedication and solid contributions were essential in shaping this book. Lastly, I extend my sincere appreciation to Abhiram and the Packt team for their unwavering support.

    Dr. Mounir Abdelaziz is a deep learning researcher specializing in computer vision applications. He holds a Ph.D. in computer science and technology from Central South University, China. During his Ph.D. journey, he developed innovative algorithms to address practical computer vision challenges. He has also authored numerous research articles in the field of few-shot learning for image classification.

    I would like to thank my family, especially my parents, for their support and encouragement. I also want to thank all the fantastic people I collaborated with, including my co-author, Packt editors, and reviewers. Without their help, writing this book wouldn’t have been possible.

    Other contributor

    Anshul Yadav is a software developer and trainer with a keen interest in machine learning, web development, and theoretical computer science. He likes to solve technical problems: the slinkier, the better. He has a B.Tech. degree in computer science and engineering from IIT Kanpur. Anshul loves to share the joy of learning with his audience.

    About the reviewers

    Christian Monson has nine years of industry experience working as a machine learning scientist specializing in Natural Language Processing (NLP) and speech recognition. For five of those years, he worked at Amazon improving the Alexa personal assistant. During the 2000s, he was a graduate student at Carnegie Mellon University and a postdoc at Oregon Health and Science University working on NLP. Christian completed his bachelor’s degree in computer science, with minors in math and physics, at Brigham Young University in 2000. In his free time, Christian creates video games and plays with his kids. Currently, he is a full-time tutor and mentor in machine learning. You can find Christian at or watch his videos at .

    Abhiram Jagarlapudi is a principal software engineer with 10 years of experience in cloud computing and Artificial Intelligence (AI). At Amazon Web Services and Oracle Cloud, Abhiram was part of launching several public cloud services, later specializing in cloud AI services. He was part of a small team that built the software delivery infrastructure of Oracle Cloud, which started in 2016 and has since grown into a multi-billion-dollar business. He also designed and developed AI services for the Oracle Cloud and is passionate about applying that experience to improve and accelerate the delivery of machine learning.

    Table of Contents



    Introduction to Data Imbalance in Machine Learning

    Technical requirements

    Introduction to imbalanced datasets

    Machine learning 101

    What happens during model training?

    Types of dataset and splits


    Common evaluation metrics

    Confusion matrix


    Precision-Recall curve

    Relation between the ROC curve and PR curve

    Challenges and considerations when dealing with imbalanced data

    When can we have an imbalance in datasets?

    Why can imbalanced data be a challenge?

    When to not worry about data imbalance

    Introduction to the imbalanced-learn library

    General rules to follow





    Oversampling Methods

    Technical requirements

    What is oversampling?

    Random oversampling

    Problems with random oversampling


    How SMOTE works

    Problems with SMOTE

    SMOTE variants



    Working of ADASYN

    Categorical features and SMOTE variants (SMOTE-NC and SMOTEN)

    Model performance comparison of various oversampling methods

    Guidance for using various oversampling techniques

    When to avoid oversampling

    Oversampling in multi-class classification





    Undersampling Methods

    Technical requirements

    Introducing undersampling

    When to avoid undersampling the majority class

    Fixed versus cleaning undersampling

    Undersampling approaches

    Removing examples uniformly

    Random UnderSampling


    Strategies for removing noisy observations

    ENN, RENN, and AllKNN

    Tomek links

    Neighborhood Cleaning Rule

    Instance hardness threshold

    Strategies for removing easy observations

    Condensed Nearest Neighbors

    One-sided selection

    Combining undersampling and oversampling

    Model performance comparison





    Ensemble Methods

    Technical requirements

    Bagging techniques for imbalanced data




    Comparative performance of bagging methods

    Boosting techniques for imbalanced data


    RUSBoost, SMOTEBoost, and RAMOBoost

    Ensemble of ensembles


    Comparative performance of boosting methods

    Model performance comparison





    Cost-Sensitive Learning

    Technical requirements

    The concept of Cost-Sensitive Learning

    Costs and cost functions

    Types of cost-sensitive learning

    Difference between CSL and resampling

    Problems with rebalancing techniques

    Understanding costs in practice

    Cost-Sensitive Learning for logistic regression

    Cost-Sensitive Learning for decision trees

    Cost-Sensitive Learning using scikit-learn and XGBoost models

    MetaCost – making any classification model cost-sensitive

    Threshold adjustment

    Methods for threshold tuning





    Data Imbalance in Deep Learning

    Technical requirements

    A brief introduction to deep learning

    Neural networks


    Activation functions


    Feedforward neural networks

    Training neural networks

    The effect of the learning rate on data imbalance

    Image processing using Convolutional Neural Networks

    Text analysis using Natural Language Processing

    Data imbalance in deep learning

    The impact of data imbalance on deep learning models

    Overview of deep learning techniques to handle data imbalance

    Multi-label classification





    Data-Level Deep Learning Methods

    Technical requirements

    Preparing the data

    Creating the training loop

    Sampling techniques for deep learning models

    Random oversampling

    Dynamic sampling

    Data augmentation techniques for vision

    Data-level techniques for text classification

    Dataset and baseline model

    Document-level augmentation

    Character and word-level augmentation

    Discussion of other data-level deep learning methods and their key ideas

    Two-phase learning

    Expansive Over-Sampling

    Using generative models for oversampling


    Neural style transfer





    Algorithm-Level Deep Learning Techniques

    Technical requirements

    Motivation for algorithm-level techniques

    Weighting techniques

    Using PyTorch’s weight parameter

    Handling textual data

    Deferred re-weighting – a minor variant of the class weighting technique

    Explicit loss function modification

    Focal loss

    Class-balanced loss

    Class-dependent temperature Loss

    Class-wise difficulty-balanced loss

    Discussing other algorithm-based techniques

    Regularization techniques

    Siamese networks

    Deeper neural networks

    Threshold adjustment





    Hybrid Deep Learning Methods

    Technical requirements

    Using graph machine learning for imbalanced data

    Understanding graphs

    Graph machine learning

    Dealing with imbalanced data

    Case study – the performance of XGBoost, MLP, and a GCN on an imbalanced dataset

    Hard example mining

    Online Hard Example Mining

    Minority class incremental rectification

    Utilizing the hard sample mining technique in minority class incremental rectification





    Model Calibration

    Technical requirements

    Introduction to model calibration

    Why bother with model calibration

    Models with and without well-calibrated probabilities

    Calibration curves or reliability plot

    Brier score

    Expected Calibration Error

    The influence of data balancing techniques on model calibration

    Plotting calibration curves for a model trained on a real-world dataset

    Model calibration techniques

    The calibration of model scores to account for sampling

    Platt’s scaling

    Isotonic regression

    Choosing between Platt’s scaling and Isotonic regression

    Temperature scaling

    Label smoothing

    The impact of calibration on a model’s performance





    Machine Learning Pipeline in Production

    Machine learning training pipeline

    Inferencing (online or batch)


    Chapter 1 – Introduction to Data Imbalance in Machine Learning

    Chapter 2 – Oversampling Methods

    Chapter 3 – Undersampling Methods

    Chapter 4 – Ensemble Methods

    Chapter 5 – Cost-Sensitive Learning

    Chapter 6 – Data Imbalance in Deep Learning

    Chapter 7 – Data-Level Deep Learning Methods

    Chapter 8 – Algorithm-Level Deep Learning Techniques

    Chapter 9 – Hybrid Deep Learning Methods

    Chapter 10 – Model Calibration


    Hello and welcome! Machine Learning (ML) enables computers to learn from data using algorithms to make informed decisions, automate tasks, and extract valuable insights. One particular aspect that often garners attention is imbalanced data, where certain classes may have considerably fewer samples than others.

    This book provides an in-depth guide to understanding and navigating the intricacies of skewed data. You will gain insights into best practices for managing imbalanced datasets in ML contexts.

    While imbalanced data can present challenges, it’s important to understand that the techniques to address this imbalance are not universally applicable. Their relevance and necessity depend on various factors such as the domain, the data distribution, the performance metrics you’re optimizing, and the business objectives. Before adopting any techniques, it’s essential to establish a baseline. Even if you don’t currently face issues with imbalanced data, it can be beneficial to be aware of the challenges and solutions discussed in this book. Familiarizing yourself with these techniques will provide you with a comprehensive toolkit, preparing you for scenarios that you may not yet know you’ll encounter. If you do find that model performance is lacking, especially for underrepresented (minority) classes, the insights and strategies covered in the book can be instrumental in guiding effective improvements.

    As the domains of ML and artificial intelligence continue to grow, there will be an increasing demand for professionals who can adeptly handle various data challenges, including imbalance. This book aims to equip you with the knowledge and tools to be one of those sought-after experts.

    Who this book is for

    This comprehensive book is thoughtfully tailored to meet the needs of a variety of professionals, including the following:

    ML researchers, ML scientists, ML engineers, and students: Professionals and learners in the fields of ML and deep learning who seek to gain valuable insights and practical knowledge for tackling the challenges posed by data imbalance

    Data scientists and analysts: Experienced data experts eager to expand their knowledge of handling skewed data with practical, real-world solutions

    Software engineers: Software engineers who want to effectively integrate ML and deep learning solutions into their applications when dealing with imbalanced data

    Practical insight seekers: Professionals and enthusiasts from various backgrounds who want to use hands-on, industry-relevant approaches for efficiently dealing with data imbalance in ML and deep learning, enabling them to excel in their respective roles

    What this book covers

    Chapter 1, Introduction to Data Imbalance in Machine Learning, serves as an exploration of data imbalance within the context of ML. This chapter elucidates the nature of imbalanced data, distinguishing it from other dataset types. It also provides a comprehensive introduction to the essential components of ML and model performance metrics most relevant for cases when there is a data imbalance. The chapter looks into the issues and concerns involved in dealing with imbalanced data, explaining when it can occur and why it can sometimes be a challenge. More importantly, we will go over when not to worry about data imbalance at all or when it may not be worth worrying about. Furthermore, it introduces the imbalanced-learn library, offering invaluable insights and general guidelines to navigate the intricacies of dealing with imbalanced datasets effectively.

    Chapter 2, Oversampling Methods, introduces the concept of oversampling, outlining when to employ it and when not to, and various techniques to augment imbalanced datasets. It guides you through the practical application of these techniques using the imbalanced-learn library and compares their performance across classical ML models. Practical advice on the effectiveness of these techniques in real-world scenarios concludes the chapter.

    Chapter 3, Undersampling Methods, presents the concept of undersampling as an effective approach for data balancing when standard oversampling isn’t an option. This chapter covers strategies to effectively remove examples from imbalanced data, different ways of addressing noisy observations, and procedures for handling easily categorized instances. We will also discuss when to avoid undersampling of the majority class.

    Chapter 4, Ensemble Methods, explores the application of ensemble techniques, including bagging and boosting, to enhance the performance of ML models. Moreover, it tackles the challenge of imbalanced datasets, where traditional ensemble methods may be ineffective, by combining the ensemble methods with the techniques introduced in previous chapters.

    Chapter 5, Cost-Sensitive Learning, explores some alternatives to sampling techniques, including oversampling and undersampling. This chapter highlights the significance of cost-sensitive learning as an effective strategy to overcome the problem of imbalanced datasets. We also discuss threshold-tuning techniques, which can be very relevant in the context of data imbalance.

    Chapter 6, Data Imbalance in Deep Learning, presents the core concepts of deep learning and walks through the issues posed by imbalanced datasets. You will investigate typical types of imbalanced data challenges in various deep learning applications and develop an understanding of their impact.

    Chapter 7, Data-Level Deep Learning Methods, marks a transition from classical ML to deep learning, exploring the adaptation of familiar data-level sampling techniques and unveiling opportunities for enhancing these methods in the context of deep learning models. It dives into combining deep learning with oversampling and undersampling techniques, covering dynamic sampling and data augmentation for images and text. It emphasizes the fundamental differences between deep learning and classical ML, particularly the nature of the data they handle, whereas deep learning deals with unstructured data such as images, text, audio, and video. The chapter also explores techniques to address class imbalance in computer vision and their applicability to Natural Language Processing (NLP) problems.

    Chapter 8, Algorithm-Level Deep Learning Techniques, expands on the concepts from Chapter 5, Cost-Sensitive Learning, and applies them to deep learning models. We adapt deep learning models through loss function modifications using the PyTorch deep learning framework, ultimately enhancing model performance and enabling more effective predictions.

    Chapter 9, Hybrid Deep Learning Methods, explores innovative techniques that bridge the gap between data-level and algorithm-level methods from the previous two chapters. This chapter introduces the concept of graph ML and employs a real-world Facebook social network dataset to provide valuable insights and practical applications for addressing data imbalance in deep learning. We will also introduce the concept of hard mining loss and build upon it to explore a specialized technique called minority class incremental rectification, which combines hard mining with cross-entropy loss.

    Chapter 10, Model Calibration, takes a different angle of addressing data imbalance. Rather than focusing on data preprocessing or model building, this chapter highlights the post-processing of prediction scores obtained from trained models. Such post-processing can be valuable for both real-time predictions and offline model evaluation. The chapter offers insights into measuring the calibration of a model and explains why this aspect can be indispensable when dealing with imbalanced data. This is particularly important since data balancing techniques can often lead to model miscalibration.

    Appendix, Machine Learning Pipeline in Production, offers a foundational guide to constructing ML pipelines in production environments that encounter imbalanced data. This appendix provides a brief roadmap, going over the sequence and stage at which techniques for addressing data imbalance should be integrated.

    📌 Usage of techniques – In production tips

    Throughout this book, you will come across In production tip boxes like the following one, highlighting real-world applications of the techniques discussed:

    🚀 Class reweighting in production at OpenAI

    OpenAI was trying to solve the problem of bias in training data of the image generation model DALL-E 2 [1]. DALL-E 2 is trained on a massive dataset of images from the internet, which can contain biases. For example, the dataset may contain more images of men than women or more images of people from certain racial or ethnic groups than others.

    These snippets offer insights into how well-known companies grappled with data imbalance and what strategies they adopted to effectively navigate these challenges. For instance, the tip on OpenAI’s approach with DALL-E 2 sheds light on the intricate balance between filtering training data and inadvertently amplifying biases. Such examples underscore the importance of being both strategic and cautious when dealing with imbalanced data. To delve deeper into the specifics and understand the nitty-gritty of these implementations, you are encouraged to follow the company blog or paper links provided. These insights can provide a clearer understanding of how to adapt and apply techniques in varied real-world scenarios effectively.

    To get the most out of this book

    This book assumes some foundational knowledge of ML, deep learning, and Python programming. Some basic working knowledge of scikit-learn and PyTorch can be helpful, although they can be learned on the go.

    For the software requirements, you have two options to execute the code provided in this book. You can choose to either run the code within Google Colab online at or download the code to your local computer and execute it there. Google Colab provides a hassle-free option as it comes with all the necessary libraries pre-installed, so you don’t need to install anything on your local machine. All you need is a web browser to access Google Colab and a Google account. If you prefer to work locally, ensure that you have Python (3.6 or higher) installed, as well as the specified libraries such as PyTorch, torchvision, NumPy, and scikit-learn. A list of required libraries can be found in the GitHub repository of the book. These libraries are compatible with Windows, macOS, and Linux operating systems. A modern GPU can speed up the code execution for the deep learning chapters that appear later in the book; however, it’s not mandatory.

    If you are using the digital version of this book, we advise you to type the code yourself or access the code from the book’s GitHub repository (a link is available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.

    Regarding references, we use numbered references such as [6], where you can go to the References section at the end of that chapter and download the corresponding reference (paper/blog/article) either using the link (if mentioned) or searching for that reference on Google Scholar (

    At the conclusion of each chapter, you will find a set of questions designed to test your comprehension of the material covered. We strongly encourage you to engage with these questions to reinforce your learning. Solutions or answers to selected questions can be found in Assessments towards the end of this book.

    Download the example code files

    You can download the example code files for this book from GitHub at If there’s an update to the code, it will be updated in the GitHub repository.

    We also have other code bundles from our rich catalog of books and videos available at Check them out!

    Conventions used

    There are a number of text conventions used throughout this book.

    Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: Since it’s possible to provide a base estimator to BaggingClassifier, let’s use DecisionTreeClassifier with the maximum depth of the trees being 6.

    A block of code is set as follows:

    from collections import Counter X, y = make_data(sep=2)print(y.value_counts()) sns.scatterplot(data=X, x=feature_1, y=feature_2)plt.title('Separation: {}'.format(separation))

    Bold: Indicates a new term, an important word, or words that you see onscreen. For instance, words in menus or dialog boxes appear in bold. Here is an example: True Negative Rate (TNR): TNR measures the proportion of actual negatives that are correctly identified as such.

    Tips or important notes

    Appear like this.

    Get in touch

    Feedback from our readers is always welcome.

    General feedback: If you have questions about any aspect of this book, email us at and mention the book title in the subject of your message.

    Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit and fill in the form.

    Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at with a link to the material.

    If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit

    Share Your Thoughts

    Once you’ve read Machine Learning for Imbalanced Data, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.

    Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.

    Download a free PDF copy of this book

    Thanks for purchasing this book!

    Do you like to read on the go but are unable to carry your print books everywhere?

    Is your eBook purchase not compatible with the device of your choice?

    Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.

    Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical books directly into your application.

    The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free content in your inbox daily

    Follow these simple steps to get the benefits:

    Scan the QR code or visit the link below

    Submit your proof of purchase

    That’s it! We’ll send your free PDF and other benefits to your email directly


    Introduction to Data Imbalance in Machine Learning

    Machine learning algorithms have helped solve real-world problems as diverse as disease prediction and online shopping. However, many problems we would like to address with machine learning involve imbalanced datasets. In this chapter, we will discuss and define imbalanced datasets, explaining how they differ from other types of datasets. The ubiquity of imbalanced data will be demonstrated with examples of common problems and scenarios. We will also go through the basics of machine learning and cover the essentials, such as loss functions, regularization, and feature engineering. We will also learn about common evaluation metrics, particularly those that can be very helpful for imbalanced datasets. We will then introduce the imbalanced-learn library.

    In particular, we will learn about the following topics:

    Introduction to imbalanced datasets

    Machine learning 101

    Types of datasets and splits

    Common evaluation metrics

    Challenges and considerations when dealing with imbalanced data

    When can we have an imbalance in datasets?

    Why can imbalanced data be a challenge?

    When to not worry about data imbalance

    Introduction to the imbalanced-learn library

    General rules to follow

    Technical requirements

    In this chapter, we will utilize common libraries such as numpy and scikit-learn and introduce the imbalanced-learn library. The code and notebooks for this chapter are available on GitHub at You can fire up the GitHub notebook using Google Colab by clicking on the Open in Colab icon at the top of this chapter’s notebook or by launching it from using the GitHub URL of the notebook.

    Introduction to imbalanced datasets

    Machine learning algorithms learn from collections of examples that we call datasets. These datasets contain multiple data samples or points, which we may refer to as examples, samples, or instances interchangeably throughout this book.

    A dataset can be said to have a balanced distribution when all the target classes have a similar number of examples, as shown in Figure 1.1:

    Figure 1.1 – Balanced distribution with an almost equal number of examples for each class

