This post collects and summarizes the methods from the following posts
- Basic: 8 Tactics to Combat Imbalanced Classes in Your Machine Learning Dataset by Jason Brownlee
- Basic: In classification, how do you handle an unbalanced training set? from Quora
- Intermediate: Learning Imbalanced Classes from SVDS blog
- Advanced: Classification when 80% of my training set is of one class from Reddit
- Advanced: He2009. Looks an awesome reading.
- Library: scikit-learn-contrib/imbalanced-learn
The Big Picture
The first thing of all is using the proper metric. If the classes are very much skewed, precision-recall curve is better than ROC. However, experience also taught me that if the minority samples in the test set are too few, PR curve may give a zigzag shape of the curve, which is in turn numerically unstable in estimating the area under the curve.
The common strategy for imbalanced learning is to resample the dataset, if getting more data is not an option. Resampling can be either over-sampling, under-sampling, or over-sampling followed by under-sampling, mixed with certain ensample strategies. Over-sampling can be done in a naive way, or use SMOTE-like method to generate synthetic samples. Similarly, under-sampling can be done just taking few samples from the majority class, or in a more informed process, by looking at the potential boundary between classes and removing the sample from the majority class. Ensamble methods can be simple bagging, the majority vote for several weak classifier, or can be much more involved and more learning-algorithm-specific strategy. These over-sampling, under-sampling, ensembling methods have been mixed into various combinations. The above are the main stream of handling imbalanced datasets.
The cost-sensitive learning is another stream for handling imbalanced datasets.
Yet another weay to handle imbalanced datasets is to go with more unsupervised learning, for instance, anomaly detection and autoencoder.
- Precision, Recall, F1 score, G-mean
- Kappa (Cohen’s Kappa)
- ROC (the development for multi-class problems are reletively less mature)
Although ROC curves provide powerful methods to visualize performance evaluation, they also have their own limitations. In the case of highly skewed data sets, it is observed that the ROC curve may provide an overly optimistic view of an algorithm’s performance. Under such situations, the PR curves can provide a more informative representation of performance assessment
Lowers the variance of the minority class.
Cluster-based sampling method (cluster-based oversampling - CBO)
Increases the variance of the majority class.
Cluster majority class
Cluster large class into N clusters and use the N mediods to represent the large class. N is the sample size of the minority class.
Informed undersampling, or ensemble sampling (EasyEnsemble and BalanceCascade)
Over-sampling followed by under-sampling
Under-sampling are performed by T-links or ENN (cleaning rules)
- SMOTE + Tomek links
- SMOTE + ENN
Other sampling techniques
Resampling and bagging
From Wallace2011. Subdivide the large class into L small classes, and train L predictors against the minority class. Then average the prediction scores. Steps are as below
- Balance by downsampling into N different datasets
- Train decision trees to each dataset. The decision boundaries are having high variance.
- Bagging all trees. Make a majority vote.
Sampling with data cleaning techniques.
Some representative work in this area includes the OSS method , the condensed nearest neighbor rule and Tomek Links (CNN+Tomek Links) integration method , the neighborhood cleaning rule (NCL)  based on the edited nearest neighbor (ENN) rule—which removes examples that differ from two of its three nearest neighbors, and the integrations of SMOTE with ENN (SMOTE+ENN) and SMOTE with Tomek links (SMOTE + Tomek)
Integration of sampling and boosting (SMOTEBoost, DataBoost-IM)
Over / undersampling with jittering (JOUS-Boost)
- For K times replication, find K-nearest neighbors for each minority sample.
- Interpolate at halfway, or within (0,1) of their distance for each pair of the sample and its K-nearest neighbors to generate K synthetic samples
- Note the limitation: “Formally, SMOTE can only fill in the convex hull of existing minority examples, but not create new exterior regions of minority examples.”
Adaptive Synthetic Sampling (Borderline-SMOTE and ADA-SYN)
Penalization, cost-sensitive learning
Penalize making mistakes for the minority classes.
- Cost-Sensitive Dataspace Weighting with Adaptive Boosting (AdaC1, AdaC2, and AdaC3)
Cost-Sensitive Decision Trees
In regards to decision trees, cost-sensitive fitting can take three forms: first, cost-sensitive adjustments can be applied to the decision threshold; second, cost-sensitive considera- tions can be given to the split criteria at each node; and lastly, cost-sensitive pruning schemes can be applied to the tree.
- Cost-Sensitive Neural Networks
- Cost-Sensitive Bayesian Classifiers
- Cost-Sensitive SVMs
Consider decision tree models
Anomaly or Change detection.
- One-class classification: anomaly detection, change detection, novelty detection, outlier detection. This may be the worst-case scenario.
- Autoassociator or Autoencoder
- The Mahalanobis-Taguchi System (MTS)