Machine Learning for Imbalanced Data: Tackle imbalanced datasets using machine learning and deep learning techniques
()
About this ebook
As machine learning practitioners, we often encounter imbalanced datasets in which one class has considerably fewer instances than the other. Many machine learning algorithms assume an equilibrium between majority and minority classes, leading to suboptimal performance on imbalanced data. This comprehensive guide helps you address this class imbalance to significantly improve model performance.
Machine Learning for Imbalanced Data begins by introducing you to the challenges posed by imbalanced datasets and the importance of addressing these issues. It then guides you through techniques that enhance the performance of classical machine learning models when using imbalanced data, including various sampling and cost-sensitive learning methods.
As you progress, you’ll delve into similar and more advanced techniques for deep learning models, employing PyTorch as the primary framework. Throughout the book, hands-on examples will provide working and reproducible code that’ll demonstrate the practical implementation of each technique.
By the end of this book, you’ll be adept at identifying and addressing class imbalances and confidently applying various techniques, including sampling, cost-sensitive techniques, and threshold adjustment, while using traditional machine learning or deep learning models.
Abhishek Kumar
Dr. Abhishek Kumar is a post-doctorate fellow in computer science at Ingenium Research Group, based at Universidad De Castilla-La Mancha in Spain. He has been teaching in academia for more than 8 years, and published more than 50 articles in reputed, peer reviewed national and international journals, books, and conferences. His research area includes artificial intelligence, image processing, computer vision, data mining, and machine learning.
Read more from Abhishek Kumar
Rust Crash Course: Build High-Performance, Efficient and Productive Software with the Power of Next-Generation Programming Skills (English Edition) Rating: 0 out of 5 stars0 ratingsCareer 3.0: Practical Career Planning Advice to Find your Dream Job in Today's Digital World Rating: 0 out of 5 stars0 ratingsRobust Cloud Integration with Azure: Unleash the power of serverless integration with Azure Rating: 0 out of 5 stars0 ratingsServerless Integration Design Patterns with Azure: Build powerful cloud solutions that sustain next-generation products Rating: 0 out of 5 stars0 ratingsTravel: The Ultimate Budget Travel Guide for Students to make Every Destination a Wild Lifetime Adventure for under $30 a day Rating: 0 out of 5 stars0 ratingsBeginning PBR Texturing: Learn Physically Based Rendering with Allegorithmic’s Substance Painter Rating: 0 out of 5 stars0 ratingsImmersive 3D Design Visualization: With Autodesk Maya and Unreal Engine 4 Rating: 0 out of 5 stars0 ratings
Related to Machine Learning for Imbalanced Data
Related ebooks
Synthetic Data for Machine Learning: Revolutionize your approach to machine learning with this comprehensive conceptual guide Rating: 0 out of 5 stars0 ratingsInterpretable Machine Learning with Python: Learn to build interpretable high-performance models with hands-on real-world examples Rating: 0 out of 5 stars0 ratingsAzure Machine Learning Engineering: Deploy, fine-tune, and optimize ML models using Microsoft Azure Rating: 0 out of 5 stars0 ratingsMachine Learning for Beginners - 2nd Edition: Build and deploy Machine Learning systems using Python (English Edition) Rating: 0 out of 5 stars0 ratingsDeep Learning with PyTorch: A practical approach to building neural network models using PyTorch Rating: 0 out of 5 stars0 ratingsR Machine Learning Projects: Implement supervised, unsupervised, and reinforcement learning techniques using R 3.5 Rating: 0 out of 5 stars0 ratingsActive Machine Learning with Python: Refine and elevate data quality over quantity with active learning Rating: 0 out of 5 stars0 ratingsHands-On Machine Learning with Azure: Build powerful models with cognitive machine learning and artificial intelligence Rating: 0 out of 5 stars0 ratingsDeep Learning with TensorFlow: Explore neural networks with Python Rating: 0 out of 5 stars0 ratingsAutomated Machine Learning: Hyperparameter optimization, neural architecture search, and algorithm selection with cloud platforms Rating: 0 out of 5 stars0 ratingsHands-On Automated Machine Learning: A beginner's guide to building automated machine learning systems using AutoML and Python Rating: 0 out of 5 stars0 ratingsA Handbook of Mathematical Models with Python: Elevate your machine learning projects with NetworkX, PuLP, and linalg Rating: 0 out of 5 stars0 ratingsR Machine Learning Essentials Rating: 0 out of 5 stars0 ratingsPython Deep Learning Projects: 9 projects demystifying neural network and deep learning models for building intelligent systems Rating: 0 out of 5 stars0 ratingsDeep Learning for Data Architects: Unleash the power of Python's deep learning algorithms (English Edition) Rating: 0 out of 5 stars0 ratingsModern Computer Vision with PyTorch: A practical roadmap from deep learning fundamentals to advanced applications and Generative AI Rating: 0 out of 5 stars0 ratingsPractical Machine Learning and Image Processing: For Facial Recognition, Object Detection, and Pattern Recognition Using Python Rating: 0 out of 5 stars0 ratingsDeep Learning with C#, .Net and Kelp.Net: The Ultimate Kelp.Net Deep Learning Guide Rating: 0 out of 5 stars0 ratings
Computers For You
101 Awesome Builds: Minecraft® Secrets from the World's Greatest Crafters Rating: 4 out of 5 stars4/5The Invisible Rainbow: A History of Electricity and Life Rating: 5 out of 5 stars5/5Mastering ChatGPT: 21 Prompts Templates for Effortless Writing Rating: 5 out of 5 stars5/5Procreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad Rating: 0 out of 5 stars0 ratingsThe Innovators: How a Group of Hackers, Geniuses, and Geeks Created the Digital Revolution Rating: 4 out of 5 stars4/5Slenderman: Online Obsession, Mental Illness, and the Violent Crime of Two Midwestern Girls Rating: 4 out of 5 stars4/5Uncanny Valley: A Memoir Rating: 4 out of 5 stars4/5Elon Musk Rating: 4 out of 5 stars4/5CompTIA IT Fundamentals (ITF+) Study Guide: Exam FC0-U61 Rating: 0 out of 5 stars0 ratingsAlan Turing: The Enigma: The Book That Inspired the Film The Imitation Game - Updated Edition Rating: 4 out of 5 stars4/5Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics Rating: 4 out of 5 stars4/5The Professional Voiceover Handbook: Voiceover training, #1 Rating: 5 out of 5 stars5/5Deep Search: How to Explore the Internet More Effectively Rating: 5 out of 5 stars5/5The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology Rating: 4 out of 5 stars4/5SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL Rating: 4 out of 5 stars4/5CompTIA Security+ Get Certified Get Ahead: SY0-701 Study Guide Rating: 5 out of 5 stars5/5Excel 101: A Beginner's & Intermediate's Guide for Mastering the Quintessence of Microsoft Excel (2010-2019 & 365) in no time! Rating: 0 out of 5 stars0 ratingsThe Hacker Crackdown: Law and Disorder on the Electronic Frontier Rating: 4 out of 5 stars4/5Make Your PC Stable and Fast: What Microsoft Forgot to Tell You Rating: 4 out of 5 stars4/5Dark Aeon: Transhumanism and the War Against Humanity Rating: 5 out of 5 stars5/5How to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally Rating: 4 out of 5 stars4/5Tor and the Dark Art of Anonymity Rating: 5 out of 5 stars5/5Master Builder Roblox: The Essential Guide Rating: 4 out of 5 stars4/5
Reviews for Machine Learning for Imbalanced Data
0 ratings0 reviews
Book preview
Machine Learning for Imbalanced Data - Abhishek Kumar
Machine Learning for Imbalanced Data
Copyright © 2023 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Group Product Manager: Niranjan Naikwadi
Publishing Product Manager: Sanjana Gupta
Book Project Manager: Kirti Pisat
Senior Editor: Rohit Singh
Technical Editor: Rahul Limbachiya
Copy Editor: Safis Editing
Proofreader: Safis Editing
Indexer: Pratik Shirodkar
Production Designer: Nilesh Mohite
DevRel Marketing Coordinator: Vinishka Kalra
First published: November 2023
Production reference: 2221123
Published by Packt Publishing Ltd.
Grosvenor House
11 St Paul’s Square
Birmingham
B3 1RB, UK.
ISBN 978-1-80107-083-6
www.packtpub.com
Contributors
About the authors
Kumar Abhishek is a seasoned senior machine learning engineer at Expedia Group, US, specializing in risk analysis and fraud detection. With over a decade of machine learning and software engineering experience, Kumar has worked for companies such as Microsoft, Amazon, and a Bay Area start-up. Kumar holds a master’s degree in computer science from the University of Florida, Gainesville.
To my incredible wife who has been my rock and constant source of inspiration, our adorable son who fills our lives with joy, my wonderful parents for their unwavering support, and my close friends. Immense thanks to Christian, who has been a pivotal mentor and guide, for his meticulous reviews. My deepest gratitude to my co-author, Mounir, and contributor, Anshul; their dedication and solid contributions were essential in shaping this book. Lastly, I extend my sincere appreciation to Abhiram and the Packt team for their unwavering support.
Dr. Mounir Abdelaziz is a deep learning researcher specializing in computer vision applications. He holds a Ph.D. in computer science and technology from Central South University, China. During his Ph.D. journey, he developed innovative algorithms to address practical computer vision challenges. He has also authored numerous research articles in the field of few-shot learning for image classification.
I would like to thank my family, especially my parents, for their support and encouragement. I also want to thank all the fantastic people I collaborated with, including my co-author, Packt editors, and reviewers. Without their help, writing this book wouldn’t have been possible.
Other contributor
Anshul Yadav is a software developer and trainer with a keen interest in machine learning, web development, and theoretical computer science. He likes to solve technical problems: the slinkier, the better. He has a B.Tech. degree in computer science and engineering from IIT Kanpur. Anshul loves to share the joy of learning with his audience.
About the reviewers
Christian Monson has nine years of industry experience working as a machine learning scientist specializing in Natural Language Processing (NLP) and speech recognition. For five of those years, he worked at Amazon improving the Alexa personal assistant. During the 2000s, he was a graduate student at Carnegie Mellon University and a postdoc at Oregon Health and Science University working on NLP. Christian completed his bachelor’s degree in computer science, with minors in math and physics, at Brigham Young University in 2000. In his free time, Christian creates video games and plays with his kids. Currently, he is a full-time tutor and mentor in machine learning. You can find Christian at or watch his videos at .
Abhiram Jagarlapudi is a principal software engineer with 10 years of experience in cloud computing and Artificial Intelligence (AI). At Amazon Web Services and Oracle Cloud, Abhiram was part of launching several public cloud services, later specializing in cloud AI services. He was part of a small team that built the software delivery infrastructure of Oracle Cloud, which started in 2016 and has since grown into a multi-billion-dollar business. He also designed and developed AI services for the Oracle Cloud and is passionate about applying that experience to improve and accelerate the delivery of machine learning.
Table of Contents
Preface
1
Introduction to Data Imbalance in Machine Learning
Technical requirements
Introduction to imbalanced datasets
Machine learning 101
What happens during model training?
Types of dataset and splits
Cross-validation
Common evaluation metrics
Confusion matrix
ROC
Precision-Recall curve
Relation between the ROC curve and PR curve
Challenges and considerations when dealing with imbalanced data
When can we have an imbalance in datasets?
Why can imbalanced data be a challenge?
When to not worry about data imbalance
Introduction to the imbalanced-learn library
General rules to follow
Summary
Questions
References
2
Oversampling Methods
Technical requirements
What is oversampling?
Random oversampling
Problems with random oversampling
SMOTE
How SMOTE works
Problems with SMOTE
SMOTE variants
Borderline-SMOTE
ADASYN
Working of ADASYN
Categorical features and SMOTE variants (SMOTE-NC and SMOTEN)
Model performance comparison of various oversampling methods
Guidance for using various oversampling techniques
When to avoid oversampling
Oversampling in multi-class classification
Summary
Exercises
References
3
Undersampling Methods
Technical requirements
Introducing undersampling
When to avoid undersampling the majority class
Fixed versus cleaning undersampling
Undersampling approaches
Removing examples uniformly
Random UnderSampling
ClusterCentroids
Strategies for removing noisy observations
ENN, RENN, and AllKNN
Tomek links
Neighborhood Cleaning Rule
Instance hardness threshold
Strategies for removing easy observations
Condensed Nearest Neighbors
One-sided selection
Combining undersampling and oversampling
Model performance comparison
Summary
Exercises
References
4
Ensemble Methods
Technical requirements
Bagging techniques for imbalanced data
UnderBagging
OverBagging
SMOTEBagging
Comparative performance of bagging methods
Boosting techniques for imbalanced data
AdaBoost
RUSBoost, SMOTEBoost, and RAMOBoost
Ensemble of ensembles
EasyEnsemble
Comparative performance of boosting methods
Model performance comparison
Summary
Questions
References
5
Cost-Sensitive Learning
Technical requirements
The concept of Cost-Sensitive Learning
Costs and cost functions
Types of cost-sensitive learning
Difference between CSL and resampling
Problems with rebalancing techniques
Understanding costs in practice
Cost-Sensitive Learning for logistic regression
Cost-Sensitive Learning for decision trees
Cost-Sensitive Learning using scikit-learn and XGBoost models
MetaCost – making any classification model cost-sensitive
Threshold adjustment
Methods for threshold tuning
Summary
Questions
References
6
Data Imbalance in Deep Learning
Technical requirements
A brief introduction to deep learning
Neural networks
Perceptron
Activation functions
Layers
Feedforward neural networks
Training neural networks
The effect of the learning rate on data imbalance
Image processing using Convolutional Neural Networks
Text analysis using Natural Language Processing
Data imbalance in deep learning
The impact of data imbalance on deep learning models
Overview of deep learning techniques to handle data imbalance
Multi-label classification
Summary
Questions
References
7
Data-Level Deep Learning Methods
Technical requirements
Preparing the data
Creating the training loop
Sampling techniques for deep learning models
Random oversampling
Dynamic sampling
Data augmentation techniques for vision
Data-level techniques for text classification
Dataset and baseline model
Document-level augmentation
Character and word-level augmentation
Discussion of other data-level deep learning methods and their key ideas
Two-phase learning
Expansive Over-Sampling
Using generative models for oversampling
DeepSMOTE
Neural style transfer
Summary
Questions
References
8
Algorithm-Level Deep Learning Techniques
Technical requirements
Motivation for algorithm-level techniques
Weighting techniques
Using PyTorch’s weight parameter
Handling textual data
Deferred re-weighting – a minor variant of the class weighting technique
Explicit loss function modification
Focal loss
Class-balanced loss
Class-dependent temperature Loss
Class-wise difficulty-balanced loss
Discussing other algorithm-based techniques
Regularization techniques
Siamese networks
Deeper neural networks
Threshold adjustment
Summary
Questions
References
9
Hybrid Deep Learning Methods
Technical requirements
Using graph machine learning for imbalanced data
Understanding graphs
Graph machine learning
Dealing with imbalanced data
Case study – the performance of XGBoost, MLP, and a GCN on an imbalanced dataset
Hard example mining
Online Hard Example Mining
Minority class incremental rectification
Utilizing the hard sample mining technique in minority class incremental rectification
Summary
Questions
References
10
Model Calibration
Technical requirements
Introduction to model calibration
Why bother with model calibration
Models with and without well-calibrated probabilities
Calibration curves or reliability plot
Brier score
Expected Calibration Error
The influence of data balancing techniques on model calibration
Plotting calibration curves for a model trained on a real-world dataset
Model calibration techniques
The calibration of model scores to account for sampling
Platt’s scaling
Isotonic regression
Choosing between Platt’s scaling and Isotonic regression
Temperature scaling
Label smoothing
The impact of calibration on a model’s performance
Summary
Questions
References
Appendix
Machine Learning Pipeline in Production
Machine learning training pipeline
Inferencing (online or batch)
Assessments
Chapter 1 – Introduction to Data Imbalance in Machine Learning
Chapter 2 – Oversampling Methods
Chapter 3 – Undersampling Methods
Chapter 4 – Ensemble Methods
Chapter 5 – Cost-Sensitive Learning
Chapter 6 – Data Imbalance in Deep Learning
Chapter 7 – Data-Level Deep Learning Methods
Chapter 8 – Algorithm-Level Deep Learning Techniques
Chapter 9 – Hybrid Deep Learning Methods
Chapter 10 – Model Calibration
Index
Other Books You May Enjoy
Preface
Hello and welcome! Machine Learning (ML) enables computers to learn from data using algorithms to make informed decisions, automate tasks, and extract valuable insights. One particular aspect that often garners attention is imbalanced data, where certain classes may have considerably fewer samples than others.
This book provides an in-depth guide to understanding and navigating the intricacies of skewed data. You will gain insights into best practices for managing imbalanced datasets in ML contexts.
While imbalanced data can present challenges, it’s important to understand that the techniques to address this imbalance are not universally applicable. Their relevance and necessity depend on various factors such as the domain, the data distribution, the performance metrics you’re optimizing, and the business objectives. Before adopting any techniques, it’s essential to establish a baseline. Even if you don’t currently face issues with imbalanced data, it can be beneficial to be aware of the challenges and solutions discussed in this book. Familiarizing yourself with these techniques will provide you with a comprehensive toolkit, preparing you for scenarios that you may not yet know you’ll encounter. If you do find that model performance is lacking, especially for underrepresented (minority) classes, the insights and strategies covered in the book can be instrumental in guiding effective improvements.
As the domains of ML and artificial intelligence continue to grow, there will be an increasing demand for professionals who can adeptly handle various data challenges, including imbalance. This book aims to equip you with the knowledge and tools to be one of those sought-after experts.
Who this book is for
This comprehensive book is thoughtfully tailored to meet the needs of a variety of professionals, including the following:
ML researchers, ML scientists, ML engineers, and students: Professionals and learners in the fields of ML and deep learning who seek to gain valuable insights and practical knowledge for tackling the challenges posed by data imbalance
Data scientists and analysts: Experienced data experts eager to expand their knowledge of handling skewed data with practical, real-world solutions
Software engineers: Software engineers who want to effectively integrate ML and deep learning solutions into their applications when dealing with imbalanced data
Practical insight seekers: Professionals and enthusiasts from various backgrounds who want to use hands-on, industry-relevant approaches for efficiently dealing with data imbalance in ML and deep learning, enabling them to excel in their respective roles
What this book covers
Chapter 1, Introduction to Data Imbalance in Machine Learning, serves as an exploration of data imbalance within the context of ML. This chapter elucidates the nature of imbalanced data, distinguishing it from other dataset types. It also provides a comprehensive introduction to the essential components of ML and model performance metrics most relevant for cases when there is a data imbalance. The chapter looks into the issues and concerns involved in dealing with imbalanced data, explaining when it can occur and why it can sometimes be a challenge. More importantly, we will go over when not to worry about data imbalance at all or when it may not be worth worrying about. Furthermore, it introduces the imbalanced-learn library, offering invaluable insights and general guidelines to navigate the intricacies of dealing with imbalanced datasets effectively.
Chapter 2, Oversampling Methods, introduces the concept of oversampling, outlining when to employ it and when not to, and various techniques to augment imbalanced datasets. It guides you through the practical application of these techniques using the imbalanced-learn library and compares their performance across classical ML models. Practical advice on the effectiveness of these techniques in real-world scenarios concludes the chapter.
Chapter 3, Undersampling Methods, presents the concept of undersampling as an effective approach for data balancing when standard oversampling isn’t an option. This chapter covers strategies to effectively remove examples from imbalanced data, different ways of addressing noisy observations, and procedures for handling easily categorized instances. We will also discuss when to avoid undersampling of the majority class.
Chapter 4, Ensemble Methods, explores the application of ensemble techniques, including bagging and boosting, to enhance the performance of ML models. Moreover, it tackles the challenge of imbalanced datasets, where traditional ensemble methods may be ineffective, by combining the ensemble methods with the techniques introduced in previous chapters.
Chapter 5, Cost-Sensitive Learning, explores some alternatives to sampling techniques, including oversampling and undersampling. This chapter highlights the significance of cost-sensitive learning as an effective strategy to overcome the problem of imbalanced datasets. We also discuss threshold-tuning techniques, which can be very relevant in the context of data imbalance.
Chapter 6, Data Imbalance in Deep Learning, presents the core concepts of deep learning and walks through the issues posed by imbalanced datasets. You will investigate typical types of imbalanced data challenges in various deep learning applications and develop an understanding of their impact.
Chapter 7, Data-Level Deep Learning Methods, marks a transition from classical ML to deep learning, exploring the adaptation of familiar data-level sampling techniques and unveiling opportunities for enhancing these methods in the context of deep learning models. It dives into combining deep learning with oversampling and undersampling techniques, covering dynamic sampling and data augmentation for images and text. It emphasizes the fundamental differences between deep learning and classical ML, particularly the nature of the data they handle, whereas deep learning deals with unstructured data such as images, text, audio, and video. The chapter also explores techniques to address class imbalance in computer vision and their applicability to Natural Language Processing (NLP) problems.
Chapter 8, Algorithm-Level Deep Learning Techniques, expands on the concepts from Chapter 5, Cost-Sensitive Learning, and applies them to deep learning models. We adapt deep learning models through loss function modifications using the PyTorch deep learning framework, ultimately enhancing model performance and enabling more effective predictions.
Chapter 9, Hybrid Deep Learning Methods, explores innovative techniques that bridge the gap between data-level and algorithm-level methods from the previous two chapters. This chapter introduces the concept of graph ML and employs a real-world Facebook social network dataset to provide valuable insights and practical applications for addressing data imbalance in deep learning. We will also introduce the concept of hard mining loss and build upon it to explore a specialized technique called minority class incremental rectification, which combines hard mining with cross-entropy loss.
Chapter 10, Model Calibration, takes a different angle of addressing data imbalance. Rather than focusing on data preprocessing or model building, this chapter highlights the post-processing of prediction scores obtained from trained models. Such post-processing can be valuable for both real-time predictions and offline model evaluation. The chapter offers insights into measuring the calibration of a model and explains why this aspect can be indispensable when dealing with imbalanced data. This is particularly important since data balancing techniques can often lead to model miscalibration.
Appendix, Machine Learning Pipeline in Production, offers a foundational guide to constructing ML pipelines in production environments that encounter imbalanced data. This appendix provides a brief roadmap, going over the sequence and stage at which techniques for addressing data imbalance should be integrated.
📌 Usage of techniques – In production tips
Throughout this book, you will come across In production
tip boxes like the following one, highlighting real-world applications of the techniques discussed:
🚀 Class reweighting in production at OpenAI
OpenAI was trying to solve the problem of bias in training data of the image generation model DALL-E 2 [1]. DALL-E 2 is trained on a massive dataset of images from the internet, which can contain biases. For example, the dataset may contain more images of men than women or more images of people from certain racial or ethnic groups than others.
These snippets offer insights into how well-known companies grappled with data imbalance and what strategies they adopted to effectively navigate these challenges. For instance, the tip on OpenAI’s approach with DALL-E 2 sheds light on the intricate balance between filtering training data and inadvertently amplifying biases. Such examples underscore the importance of being both strategic and cautious when dealing with imbalanced data. To delve deeper into the specifics and understand the nitty-gritty of these implementations, you are encouraged to follow the company blog or paper links provided. These insights can provide a clearer understanding of how to adapt and apply techniques in varied real-world scenarios effectively.
To get the most out of this book
This book assumes some foundational knowledge of ML, deep learning, and Python programming. Some basic working knowledge of scikit-learn and PyTorch can be helpful, although they can be learned on the go.
For the software requirements, you have two options to execute the code provided in this book. You can choose to either run the code within Google Colab online at https://colab.research.google.com/ or download the code to your local computer and execute it there. Google Colab provides a hassle-free option as it comes with all the necessary libraries pre-installed, so you don’t need to install anything on your local machine. All you need is a web browser to access Google Colab and a Google account. If you prefer to work locally, ensure that you have Python (3.6 or higher) installed, as well as the specified libraries such as PyTorch, torchvision, NumPy, and scikit-learn. A list of required libraries can be found in the GitHub repository of the book. These libraries are compatible with Windows, macOS, and Linux operating systems. A modern GPU can speed up the code execution for the deep learning chapters that appear later in the book; however, it’s not mandatory.
If you are using the digital version of this book, we advise you to type the code yourself or access the code from the book’s GitHub repository (a link is available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.
Regarding references, we use numbered references such as [6],
where you can go to the References section at the end of that chapter and download the corresponding reference (paper/blog/article) either using the link (if mentioned) or searching for that reference on Google Scholar (https://scholar.google.com/).
At the conclusion of each chapter, you will find a set of questions designed to test your comprehension of the material covered. We strongly encourage you to engage with these questions to reinforce your learning. Solutions or answers to selected questions can be found in Assessments towards the end of this book.
Download the example code files
You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Machine-Learning-for-Imbalanced-Data. If there’s an update to the code, it will be updated in the GitHub repository.
We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
Conventions used
There are a number of text conventions used throughout this book.
Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: Since it’s possible to provide a base estimator to BaggingClassifier, let’s use DecisionTreeClassifier with the maximum depth of the trees being 6.
A block of code is set as follows:
from collections import Counter X, y = make_data(sep=2)print(y.value_counts()) sns.scatterplot(data=X, x=feature_1
, y=feature_2
)plt.title('Separation: {}'.format(separation))plt.show()
Bold: Indicates a new term, an important word, or words that you see onscreen. For instance, words in menus or dialog boxes appear in bold. Here is an example: True Negative Rate (TNR): TNR measures the proportion of actual negatives that are correctly identified as such.
Tips or important notes
Appear like this.
Get in touch
Feedback from our readers is always welcome.
General feedback: If you have questions about any aspect of this book, email us at customercare@packtpub.com and mention the book title in the subject of your message.
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata and fill in the form.
Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at copyright@packt.com with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.
Share Your Thoughts
Once you’ve read Machine Learning for Imbalanced Data, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.
Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.
Download a free PDF copy of this book
Thanks for purchasing this book!
Do you like to read on the go but are unable to carry your print books everywhere?
Is your eBook purchase not compatible with the device of your choice?
Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.
Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical books directly into your application.
The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free content in your inbox daily
Follow these simple steps to get the benefits:
Scan the QR code or visit the link below
https://packt.link/free-ebook/9781801070836
Submit your proof of purchase
That’s it! We’ll send your free PDF and other benefits to your email directly
1
Introduction to Data Imbalance in Machine Learning
Machine learning algorithms have helped solve real-world problems as diverse as disease prediction and online shopping. However, many problems we would like to address with machine learning involve imbalanced datasets. In this chapter, we will discuss and define imbalanced datasets, explaining how they differ from other types of datasets. The ubiquity of imbalanced data will be demonstrated with examples of common problems and scenarios. We will also go through the basics of machine learning and cover the essentials, such as loss functions, regularization, and feature engineering. We will also learn about common evaluation metrics, particularly those that can be very helpful for imbalanced datasets. We will then introduce the imbalanced-learn library.
In particular, we will learn about the following topics:
Introduction to imbalanced datasets
Machine learning 101
Types of datasets and splits
Common evaluation metrics
Challenges and considerations when dealing with imbalanced data
When can we have an imbalance in datasets?
Why can imbalanced data be a challenge?
When to not worry about data imbalance
Introduction to the imbalanced-learn library
General rules to follow
Technical requirements
In this chapter, we will utilize common libraries such as numpy and scikit-learn and introduce the imbalanced-learn library. The code and notebooks for this chapter are available on GitHub at https://github.com/PacktPublishing/Machine-Learning-for-Imbalanced-Data/tree/main/chapter01. You can fire up the GitHub notebook using Google Colab by clicking on the Open in Colab icon at the top of this chapter’s notebook or by launching it from https://colab.research.google.com using the GitHub URL of the notebook.
Introduction to imbalanced datasets
Machine learning algorithms learn from collections of examples that we call datasets. These datasets contain multiple data samples or points, which we may refer to as examples, samples, or instances interchangeably throughout this book.
A dataset can be said to have a balanced distribution when all the target classes have a similar number of examples, as shown in Figure 1.1:
Figure 1.1 – Balanced distribution with an almost equal number of examples for each class