Skip to content

Latest commit

 

History

History
137 lines (108 loc) · 12.4 KB

File metadata and controls

137 lines (108 loc) · 12.4 KB

Machine Learning for Imbalanced Data

Machine Learning for Imbalanced Data

This is the code repository for Machine Learning for Imbalanced Data, published by Packt.

Tackle imbalanced datasets using machine learning and deep learning techniques

What is this book about?

As machine learning practitioners, we often encounter imbalanced datasets in which one class has considerably fewer instances than the other. Many machine learning algorithms assume an equilibrium between majority and minority classes, leading to suboptimal performance on imbalanced data. This comprehensive guide helps you address this class imbalance to significantly improve model performance.

This book covers the following exciting features:

  • Use imbalanced data in your machine learning models effectively
  • Explore the metrics used when classes are imbalanced
  • Understand how and when to apply various sampling methods such as over-sampling and under-sampling
  • Apply data-based, algorithm-based, and hybrid approaches to deal with class imbalance
  • Combine and choose from various options for data balancing while avoiding common pitfalls
  • Understand the concepts of model calibration and threshold adjustment in the context of dealing with imbalanced datasets

If you feel this book is for you, get your copy today!

Download a free PDF

If you have already purchased a print or Kindle version of this book, you can get a DRM-free PDF version at no cost.
Please go to this link to claim your free PDF.

Instructions and Navigations

All of the code is organized into folders.

The code will look like the following:

from collections import Counter
X, y = make_data(sep=2)
print(y.value_counts())
sns.scatterplot(data=X, x="feature_1", y="feature_2")
plt.title('Separation: {}'.format(separation))
plt.show()

Following is what you need for this book: This book is for machine learning practitioners who want to effectively address the challenges of imbalanced datasets in their projects. Data scientists, machine learning engineers/scientists, research scientists/engineers, and data scientists/engineers will find this book helpful. Though complete beginners are welcome to read this book, some familiarity with core machine learning concepts will help readers maximize the benefits and insights gained from this comprehensive resource.

With the following software and hardware list you can run all code files present in the book (Chapter 1-10).

Software and Hardware List

Chapter Software required OS required
1-10 Google Colab Any OS

Questions or Feedback

If you have any questions or feedback, please feel free to use the Discussions tab of this repository. You can start a new discussion under an appropriate category.

Related products

Get to Know the Authors

Kumar Abhishek is a seasoned Senior Machine Learning Engineer, specializing in risk analysis and fraud detection. With over a decade of experience at companies such as Expedia, Microsoft, Amazon, and a Bay Area startup, Kumar holds an MS in Computer Science from the University of Florida.

Dr. Mounir Abdelaziz is a deep learning researcher specializing in computer vision applications. He holds a Ph.D. in computer science and technology from Central South University, China. During his Ph.D. journey, he developed innovative algorithms to address practical computer vision challenges. He has also authored numerous research articles in the field of few-shot learning for image classification.

Table of Contents and Code Notebooks

  1. Introduction to Data Imbalance in Machine Learning [open dir]
  2. Oversampling Methods [open dir]
  3. Undersampling Methods [open dir]
  4. Ensemble Methods [open dir]
  5. Cost-Sensitive Learning [open dir]
  6. Data Imbalance in Deep Learning [open dir]
  7. Data-Level Deep Learning Methods [open dir]
  8. Algorithm-Level Deep Learning Techniques [open dir]
  9. Hybrid Deep Learning Methods [open dir]
  10. Model Calibration [open dir]

Detailed Table of content

List of notebooks

Notebook ID Description Link
Notebook 1.1 Imbalanced-learn demo ipynb/colab
Notebook 2.1 Oversampling techniques ipynb/colab
Notebook 2.2 Oversampling performance ipynb/colab
Notebook 2.3 SMOTE problems ipynb/colab
Notebook 3.1 Various undersampling techniques ipynb/colab
Notebook 3.2 Undersampling performance ipynb/colab
Notebook 4.1 Ensemble techniques overview ipynb/colab
Notebook 4.2 Ensembling methods performance ipynb/colab
Notebook 5.1 Class weight with Sklearn/XGBoost ipynb/colab
Notebook 5.2 Threshold tuning techniques ipynb/colab
Notebook 6.1 Simple neural network ipynb/colab
Notebook 6.2 Multi-class classification ipynb/colab
Notebook 7.1 Augmix on FashionMNIST ipynb/colab
Notebook 7.2 Cutmix, Mixup, Remix on FashionMNIST ipynb/colab
Notebook 7.3 NLP data-level techniques ipynb/colab
Notebook 7.4 Dynamic sampling ipynb/colab
Notebook 7.5 VAE with MNIST ipynb/colab
Notebook 7.6 Cutmix technique ipynb/colab
Notebook 7.7 Cutout technique ipynb/colab
Notebook 7.8 Mixup technique ipynb/colab
Notebook 7.9 Data transformation plotting ipynb/colab
Notebook 8.1 CIFAR10 focal loss ipynb/colab
Notebook 8.2 CDT loss implementation ipynb/colab
Notebook 8.3 Class balanced loss ipynb/colab
Notebook 8.4 Class-wise difficulty balanced loss ipynb/colab
Notebook 8.5 DRW technique ipynb/colab
Notebook 8.6 Tweet emotion detection ipynb/colab
Notebook 8.7 PyTorch class weighting ipynb/colab
Notebook 9.1 GNN demo ipynb/colab
Notebook 9.2 OHEM technique ipynb/colab
Notebook 9.3 Class rectification loss ipynb/colab
Notebook 10.1 Calibration techniques ipynb/colab
Notebook 10.2 Sampling/weighting impact on calibration ipynb/colab
Notebook 10.3 Imbalance handling impact on calibration ipynb/colab
Notebook 10.4 Kaggle HR data calibration ipynb/colab
Notebook 10.5 Plat's scaling and isotonic regression ipynb/colab



Kumar Abhishek, Dr. Mounir Abdelaziz, Machine Learning for Imbalanced Data. Packt Publishing, 2023.

    @book{mlimbdata2023,
    title = {Machine Learning for Imbalanced Data},
    author = {Kumar Abhishek and Mounir Abdelaziz},
    year = {2023},
    publisher = {Packt},
    isbn = {9781801070836}
}