2016 Volume E99.D Issue 12 Pages 3101-3109
Synthetic over-sampling is a well-known method to solve class imbalance by modifying class distribution and generating synthetic samples. A large number of synthetic over-sampling techniques have been proposed; however, most of them suffer from the over-generalization problem whereby synthetic minority class samples are generated into the majority class region. Learning from an over-generalized dataset, a classifier could misclassify a majority class member as belonging to a minority class. In this paper a method called TRIM is proposed to overcome the over-generalization problem. The idea is to identify minority class regions that compromise between generalization and overfitting. TRIM identifies all the minority class regions in the form of clusters. Then, it merges a large number of small minority class clusters into more generalized clusters. To enhance the generalization ability, a cluster connection step is proposed to avoid over-generalization toward the majority class while increasing generalization of the minority class. As a result, the classifier is able to correctly classify more minority class samples while maintaining its precision. Compared with SMOTE and extended versions such as Borderline-SMOTE, experimental results show that TRIM exhibits significant performance improvement in terms of F-measure and AUC. TRIM can be used as a pre-processing step for synthetic over-sampling methods such as SMOTE and its extended versions.