Class-imbalance (also known as the long-tail problem) is the fact that the classes are not represented equally in a classification problem, which is quite common in practice. For instance, fraud detection, prediction of rare adverse drug reactions and prediction gene families. Failure to account for the class imbalance often causes inaccurate and decreased predictive performance of many classification algorithms. Imbalanced learning aims to tackle the class imbalance problem to learn an unbiased model from imbalanced data.
Inspired by awesome-machine-learning. In this repository:
Check out Zhining's other open-source projects!
Machine Learning [Awesome]
Self-paced Ensemble [ICDE]
NOTE: written in python, easy to use.
imbalanced-ensembleis a Python toolbox for quick implementing and deploying ensemble learning algorithms on class-imbalanced data. It is featured for:
NOTE: written in python, easy to use.
imbalanced-learnis a python package offering a number of re-sampling techniques commonly used in datasets showing strong between-class imbalance. It is compatible with scikit-learn and is part of scikit-learn-contrib projects.
smote_variants [Documentation][Github] - A collection of 85 minority over-sampling techniques for imbalanced learning with multi-class oversampling and model selection features (All writen in Python, also support R and Julia).
KEEL [Github][Paper] - KEEL provides a simple GUI based on data flow to design experiments with different datasets and computational intelligence algorithms (paying special attention to evolutionary algorithms) in order to assess the behavior of the algorithms. This tool includes many widely used imbalanced learning techniques such as (evolutionary) over/under-resampling, cost-sensitive learning, algorithm modification, and ensemble learning methods.
NOTE: wide variety of classical classification, regression, preprocessing algorithms included.
Learning from imbalanced data (IEEE TKDE, 2009, 6000+ citations) [Paper]
Learning from imbalanced data: open challenges and future directions (2016, 900+ citations) [Paper]
Learning from class-imbalanced data: Review of methods and applications (2017, 900+ citations) [Paper]
NOTE: versatile solution with outstanding performance and computational efficiency.
NOTE: learning an optimal sampling policy directly from data.
Exploratory Undersampling for Class-Imbalance Learning (IEEE Trans. on SMC, 2008, 1300+ citations) [Paper]
NOTE: simple but effective solution.
DataBoost (2004, 570+ citations) [Paper] - Boosting with Data Generation for Imbalanced Data
MSMOTEBoost (2011, 1300+ citations) [Paper] - Modified Synthetic Minority Over-sampling TEchnique Boosting
AdaBoostNC (2012, 350+ citations) [Paper] - Adaptive Boosting with Negative Correlation Learning
EUSBoost (2013, 210+ citations) [Paper] - Evolutionary Under-sampling in Boosting
Diversity Analysis on Imbalanced Data Sets by Using Ensemble Models (2009, 400+ citations) [Paper]
ROS [Code] - Random Over-sampling
NOTE: See more over-sampling methods at smote-variants.
RUS [Code] - Random Under-sampling
EUS (2009, 290+ citations) [Paper] - Evolutionary Under-sampling
A Study of the Behavior of Several Methods for Balancing Training Data (2004, 2000+ citations) [Paper]
NOTE: extensive experimental evaluation involving 10 different over/under-sampling methods.
A systematic study of the class imbalance problem in convolutional neural networks (2018, 330+ citations) [Paper]
Survey on deep learning with class imbalance (2019, 50+ citations) [Paper]
NOTE: a recent comprehensive survey of the class imbalance problem in deep learning.
Focal loss for dense object detection (ICCV 2017, 2600+ citations) [Paper][Code (detectron2)][Code (unofficial)] - A uniform loss function that focuses training on a sparse set of hard examples to prevents the vast number of easy negatives from overwhelming the detector during training.
NOTE: elegant solution, high influence.
Training deep neural networks on imbalanced data sets (IJCNN 2016, 110+ citations) [Paper] - Mean (square) false error that can equally capture classification errors from both the majority class and the minority class.
Imbalanced deep learning by minority class incremental rectification (TPAMI 2018, 60+ citations) [Paper] - Class Rectification Loss for minimizing the dominant effect of majority classes by discovering sparsely sampled boundaries of minority classes in an iterative batch-wise learning process.
Learning Imbalanced Datasets with Label-Distribution-Aware Margin Loss (NIPS 2019, 10+ citations) [Paper][Code] - A theoretically-principled label-distribution-aware margin (LDAM) loss motivated by minimizing a margin-based generalization bound.
Gradient harmonized single-stage detector (AAAI 2019, 40+ citations) [Paper][Code] - Compared to Focal Loss, which only down-weights "easy" negative examples, GHM also down-weights "very hard" examples as they are likely to be outliers.
AutoBalance: Optimized Loss Functions for Imbalanced Data (NeurIPS 2021) [Paper]
Learning to model the tail (NIPS 2017, 70+ citations) [Paper] - Transfer meta-knowledge from the data-rich classes in the head of the distribution to the data-poor classes in the tail.
NOTE: representative work to solve the class imbalance problem through meta-learning.
Meta-weight-net: Learning an explicit mapping for sample weighting (NIPS 2019) [Paper][Code] - Explicitly learn a weight function (with an MLP as the function approximator) to reweight the samples in gradient updates of DNN.
NOTE: meta-learning-powered ensemble learning
Learning deep representation for imbalanced classification (CVPR 2016, 220+ citations) [Paper]
Supervised Class Distribution Learning for GANs-Based Imbalanced Classification (ICDM 2019) [Paper]
NOTE: interesting findings on representation learning and classifier learning
Supercharging Imbalanced Data Learning With Energy-based Contrastive Representation Transfer (NeurIPS 2021) [Paper]
NOTE: semi-supervised training / self-supervised pre-training helps imbalance learning
Improving Contrastive Learning on Imbalanced Data via Open-World Sampling (NeurIPS 2021) [Paper]
Pre-training on balanced dataset, fine-tuning the last output layer before softmax on the original, imbalanced data.
Class-Imbalanced Deep Learning via a Class-Balanced Ensemble (TNNLS 2021) [Paper]
One-class SVMs for document classification (JMLR, 2001, 1300+ citations) [Paper]
One-class Collaborative Filtering (ICDM 2008, 1000+ citations) [Paper]
Isolation Forest (ICDM 2008, 1000+ citations) [Paper]
Anomaly Detection using One-Class Neural Networks (2018, 200+ citations) [Paper]
Anomaly Detection with Robust Deep Autoencoders (KDD 2017, 170+ citations) [Paper]
This collection of datasets is from
|ID||Name||Repository & Target||Ratio||#S||#F|
|1||ecoli||UCI, target: imU||8.6:1||336||7|
|2||optical_digits||UCI, target: 8||9.1:1||5,620||64|
|3||satimage||UCI, target: 4||9.3:1||6,435||36|
|4||pen_digits||UCI, target: 5||9.4:1||10,992||16|
|5||abalone||UCI, target: 7||9.7:1||4,177||10|
|6||sick_euthyroid||UCI, target: sick euthyroid||9.8:1||3,163||42|
|7||spectrometer||UCI, target: > =44||11:1||531||93|
|8||car_eval_34||UCI, target: good, v good||12:1||1,728||21|
|9||isolet||UCI, target: A, B||12:1||7,797||617|
|10||us_crime||UCI, target: >0.65||12:1||1,994||100|
|11||yeast_ml8||LIBSVM, target: 8||13:1||2,417||103|
|12||scene||LIBSVM, target: >one label||13:1||2,407||294|
|13||libras_move||UCI, target: 1||14:1||360||90|
|14||thyroid_sick||UCI, target: sick||15:1||3,772||52|
|15||coil_2000||KDD, CoIL, target: minority||16:1||9,822||85|
|16||arrhythmia||UCI, target: 06||17:1||452||278|
|17||solar_flare_m0||UCI, target: M->0||19:1||1,389||32|
|18||oil||UCI, target: minority||22:1||937||49|
|19||car_eval_4||UCI, target: vgood||26:1||1,728||21|
|20||wine_quality||UCI, wine, target: <=4||26:1||4,898||11|
|21||letter_img||UCI, target: Z||26:1||20,000||16|
|22||yeast_me2||UCI, target: ME2||28:1||1,484||8|
|23||webpage||LIBSVM, w7a, target: minority||33:1||34,780||300|
|24||ozone_level||UCI, ozone, data||34:1||2,536||72|
|25||mammography||UCI, target: minority||42:1||11,183||6|
|26||protein_homo||KDD CUP 2004, minority||111:1||145,751||74|
|27||abalone_19||UCI, target: 19||130:1||4,177||10|
imbalanced-algorithms - Python-based implementations of algorithms for learning on imbalanced data.
imbalanced-dataset-sampler - A (PyTorch) imbalanced dataset sampler for oversampling low frequent classes and undersampling high frequent ones.
class_imbalance - Jupyter Notebook presentation for class imbalance in binary classification.
Multi-class-with-imbalanced-dataset-classification - Perform multi-class classification on imbalanced 20-news-group dataset.
Advanced Machine Learning with scikit-learn: Imbalanced classification and text data - Different approaches to feature selection, and resampling methods for imbalanced data.
Paper-list-on-Imbalanced-Time-series-Classification-with-Deep-Learning - Imbalanced Time-series Classification
Thanks goes to these wonderful people (emoji key):
💻 🚧 🌍
This project follows the all-contributors specification. Contributions of any kind welcome!