Abstract
Missing labels in multi-label datasets are a common problem, especially for minority classes, which are more likely to occur. This limitation hinders the performance of classifiers in identifying and extracting information from minority classes. Oversampling is an effective method for addressing imbalanced multi-label problems by generating synthetic instances to create a class-balanced dataset. However, the existing oversampling algorithms mainly focus on the location of the generated data, and there is a lack of design on how to complete the labels of the synthetic data. To address this issue, we propose MLAWSMOTE, a synthetic data generation algorithm based on matrix factorization weights. We introduce a weak supervised learning method in the oversampling method, optimize the weights of features and labels by using label correlation, and iteratively learn the ideal label weights. The mapping relationship between features and labels is learned from the dataset and the label correlation matrix. The oversampling ratio is defined based on the discrepancy between observed labels and the ideal label of synthetic instances. It mitigates the impact of missing minority labels on the model’s predictions. The labeling of synthetic instances is performed based on label prediction, and the potential labeling distribution is complemented. Experimental results on multiple multi-label datasets under different label missing ratios demonstrate the effectiveness of the proposed method in terms of ACC, Hamming loss, MacroF1 and MicroF1. In the validation of the four classifiers, MacroF1 decreased by 24.78%, 17.81%, 3.8% and 19.56%, respectively, with the increase of label loss rate. After applying MLAWSMOTE only decreased by 15.79%, 13.63%, 3.78% and 15.21%.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
Multi-label classification is an important component of machine learning [53], The task is challenging due to the large output space of the label set [34]. Missing labels are highly likely to occur in multi-label datasets due to the high labeling cost or the carelessness of human labelers, and each sample may only receive a certain subset of labels. Multi-label learning with missing labels problem (MLML) has gained widespread attention and is crucial for addressing missing labels [44]. The reason for missing labels is widely discussed [34, 41] considers that incorrect labels are less likely to occur, and the abundance of high-frequency labels can hinder the observation or accurate labeling of low-frequency labels. Improving the performance of multi-label classifiers for minority labels in MLML has become a practical issue that needs to be addressed.
Recently, there have been several approaches to multi-label classification. For example, problem transformation [15, 38, 54] transforms multi-label classification into a binary classification form. Algorithm adaptation, such as MLknn [52] and Rank-SVM [24], optimally adapts machine learning to the algorithmic process. In MLML, the main assumption is that the labels are correlated with each other, which provides opportunities for matrix decomposition based on lasso regression [17]. For the imbalance MLML (imMLML), current research focuses on optimization at the algorithmic level. For example, [47] provides information about missing labels through the concept of label smoothing, which is similar to downsampling to control the number of positive instances for labels. This method has also been applied to facial action unit recognition [29]. Cheng et al. [11] utilizes inter-label information entropy to complement labels and prevent the effects of data imbalance. However, these optimizations are primarily developed at the algorithmic level, [29, 47] are mainly focus on specific LLEs, while [11] applied them to extreme learning machines. These approaches have limitations for imMLML in various classifiers, hindering their broader application.
The data level is an important research topic in imbalanced problems. Synthetic instances of the minority labels can be generated using methods such as SMOTE [9], GAN [16] or VAE [23], which can be applied in different classifiers. REMEDIAL [8] and ML-ROS [6] achieve an increase in the number of minority labels by replicating minority samples for different label assignments. MLSMOTE [7] and MLSOL [31] generate synthetic instances using SMOTE and assign synthetic labels using neighbor information. However, imMLML is a common phenomenon in the multi-label field, and the main over-sampling focuses on the definition of labels [6, 8] and the selection of the generation position [7, 31]. Although there are methods involving using fuzzy rough sets to define labels [33], there is still a significant challenge in effectively filling in the missing labels for synthetic samples.
To recover missing labels in imMLML and to increase the number of minorities, we propose the oversampling method \({\textbf{M}}\)ulti-\({\textbf{l}}\)abel with \({\textbf{A}}\)uxiliary \({\textbf{W}}\)eight \(\textbf{SMOTE}\) (MLAWSMOTE) based on two matrix decomposition weights, which learn from instance-label and label correlation, respectively. The former is derived from the variance between observed and retrieved labels, while the latter is acquired from the label distribution. The ratio for synthetic samples is generated based on the recovered and observed labels and label distributions to prioritize samples with missing labels and low frequency. The labels for synthetic samples are determined by the instance-label weight. We validate the effectiveness of our proposed algorithm by combining various datasets, missing label ratios, and classifiers collectively.
This paper makes the following contributions:
-
We propose an oversampling approach using label correlation weights to address label missing and imbalance;
-
We combine label distribution and label recovery differences to enhance samples with low frequency and high label missing probability;
-
We propose using weights to complement the labels of generated instances, aiming to optimize the performance of multi-label classifiers.
The remainder of this paper is divided into five sections. Section 2 is the related work including MLML and imMLML. Section 3 introduces our proposed method. Section 4 is the experimental setting and analysis. Section 5 is the conclusion.
2 Related Work
2.1 Multi-label Learning with Missing Labels
In real-world scenarios [1, 3, 4, 25, 26], accurate labeling of samples in multi-label problems is often challenging due to the complexity of the label output space. Consequently, only a subset of the correct labels can be obtained, leading to the phenomenon known as MLML. [34] suggests that MLML can be categorized into implicit and explicit types based on the causes of the problem. Implicit MLML considers that labeling part of the positive class as a negative class is due to label loss. Explicit MLML suggests that the unobservable labels are the main reason. Among these, the former is more likely to occur in the problem of imbalance, as minority labels are typically harder to detect. [41] also shows that the proportion of missing tail labels is larger than the proportion of missing head labels. Matrix decomposition and graph-based methods are the primary techniques employed to extract label correlations from the dataset and to impute missing labels.
In multi-label learning, the label matrix is usually considered low-rank due to the presence of label correlations, and the low-rank matrix is also a general way to complete the missing labels. The optimization of algorithm convergence is also an important part of prediction application [1, 2, 36, 40]. For example, [21] uses the label correlation matrix to capture the influence of different labels and selects shared sparse feature structures between labels by applying the l-1 and l-2 penalties. LRMML [27] conducts a study on matrix decomposition in explicit machine learning. They utilized the graph Laplacian of the label association matrix to represent the idea that the more relevant the labels are, the more similar the label-specific weights. In terms of weight optimization, C2ML and C2MLE [44] apply a clustering approach. They use fuzzy c-means to obtain cluster centroid and granular discrimination-based regularization to optimize weights for inter-class and intra-class distance sensitivity. [12] considers that the previous label correlation matrix ignores the information of the instance and assists in establishing the relationship with the missing label by constructing the attention distribution matrix and the fusion matrix. Zhang et al. [50] extends the matrix decomposition-based method to semi-supervised MLML using both unlabeled and labeled data to construct a feature manifold regularizer for learning local label correlations. In addition to assuming a low rank for the label correlation matrix, ColEmbed [18] considers that both the recovered feature matrix and the classifier are low-rank. It also applies cost-sensitive matrix factorization to address the imbalance problem.
Graph-based methods utilize label correlation to construct a weighted graph for predicting missing labels. Wu et al. [47] establishes graph networks based on the similarity of samples and labels while [10] accomplishes it by correlation distances in features and labels. Hashemi et al. [20] combines labeled directed graph dependencies with image classification, while also incorporating word embedding. Beside these, methods such as evolutionary algorithm [30] and information entropy [19] are also solutions for MLML.
2.2 Imbalanced Multi-label Learning
Class imbalance distribution is a widespread phenomenon in multi-label datasets. Improvements are mainly made at the algorithm level and data level. At the algorithm level, methods like [46, 48] improve the classification performance in specific tasks.
Although imbalanced label distribution is a common problem in MLML, data distribution can be improved by downsampling and oversampling. For example, MLeNN [5] and MLTL [37] draw on the Tomek Link method to eliminate majority classes that overlap each other. Roseberry et al [39] incorporate the idea of downsampling into the identification of data streams, quickly adapting to changes in data streams by removing data.
The oversampling algorithm expands the minority samples by generating synthetic samples. Generating synthetic instances from the feature level is also the primary method in dichotomous and multi-classification. MLSMOTE and MLSOL involve interpolating weights from various minority groups. EMLSOL [32] oversamples the minorities and downsamples the majorities. LCOS [51] considers label correlations and assigns weights to synthetic instances based on their distance information. Due to the multi-label property, methods like LP-ROS and ML-ROS perform repetitive oversampling by defining the minority labels differently, while REMEDIAL clones the samples and decouples the labels. The detail of literature review is in Table 1, it is common to lose minority labels in the multi-labeling dataset. However, there is a lack of oversampling algorithms to address the class imbalance resulting from this issue.
3 Proposed Method
In our work, we assume that the data set D contains n instances \(\{(x_{1},y_{1}),...,(x_{n},y_{n})\}\), where \(x \in R^{d*n}\) and d is the data dimension, y in \(\{-1,1\}^{q*n}\) and q is the label dimension. As a general rule, it is more challenging to accurately identify the labels of minority groups. y in \(-1\) contains negative or missing labels. Our goal is to improve the prediction of minority class labels in imMLML. The architecture of our proposed algorithm is shown in Fig. 1, and we elaborate on it in this section.
3.1 Matrix Factorization Weights Construction
In previous works [27, 35, 42], the correspondence between samples and labels is assumed to be a linear function, ie., \(Y = WX\) and \(W=[w_{1},w_{2},...,w_{q}]^T\) where W in \(R^{q*d}\). Each column in W is responsible for mapping the data dimensions to a specific label, where \(w_{ij}\) indicates the contribution of the j-th dimensional feature to the i-th label. To determine the feature-label correspondence in the MLML, W typically utilizes l1 or l2-norm penalty for constraint. Such a sparsification operation restricts the W corresponding to certain labels to a smaller value to achieve feature selection. Assuming Y represents the observed labels and \(\Omega Y\) represents the recovered labels. By doing so, WX can map the features to the ideal labels \(\Omega Y\):
where the former term implements the fit \(\Omega Y\) by W, while the latter term controls the complexity of W. \(\lambda _{1} \ge 0 \) is its tradeoff parameter, \(\Vert \cdot \Vert _{F}\) represents the Frobenius norm.
Kumar and Rastogi [27], Sun et al. [42] point out that a linear self-recovery matrix \(\Omega \) can be learned from the observed label Y, and \(\Omega \) is based on three assumptions. The first observation is that the observed variable Y should be identifiable in the recovered variable \(\Omega Y\) because Y is true when it equals one. Also, \(\Omega \) should be a low-rank matrix because labels are correlated with each other in multi-label learning. \(\Omega \) being low-rank is advantageous for filling in missing labels. The last one is \(\Omega \) should be greater than zero. We assume that labels are positively influenced by each other and do not intend for labels that are already positive to be corrupted by \(\Omega \).
Since the issue of missing labels is prevalent in imbalanced distributions, it is essential to enhance W with guidance from the label correlation matrix \(\Omega \). If two labels \(l_{i}\) and \(l_{j}\) have a high correlation, then \(w_{i}\) and \(w_{j}\) are also more correlated, indicating a similar Euclidean distance. For the label correlation matrix \(\Omega \), the feature-specific weights can be constrained by the Laplacian matrix \(L \in R^{q*q}\), it will help W learn from second-order label correlation. Accordingly, the way to build W and \(\Omega \) expressed as follows:
where \(\lambda _{1}\) controls label-specific feature and the larger the more sparse feature is. \(\lambda _{2}\) controls loss for the observed and recovered labels. \(\Vert \cdot \Vert _{1}\) represents \(l_{1}\) norm regularizer to learn sparse label dependencies. \(\lambda _{3}\) controls the sparsity of correlation between different labels. \(\lambda _{4}\) controls the distance between model coefficients for any two labels. The optimization process is in the Sect. 3.2.
3.2 Matrix Factorization Weights Optimization
Equation 2 is convex and non-smooth due to it has \(l_{1}\)-norm terms. It can be solved by accelerated proximal gradient method [28] in an alternating way.
\(\mathbf {Step\;1.}\) Fixing \(\Omega \), the derivative for the first and last term involved W in Eq. (2) is:
The term of \(l_{1}\)-norm for W can be solved by the element-wise soft-threshold operator as:
where \(\epsilon _{1}\) is a small value to activate w. In t-th epoch, W can be updated by:
where \(\theta \) is the step size. The update of W is realized by Eqs. (3)–(6) when \(\Omega \) is fixed.
\(\mathbf {Step\;2.}\) With W fixed, the derivative in Eq. (2) for \(\Omega \) in the first and third term is:
For the \(l_{1}\)-norm and nonnegative constraint for \(\Omega \), it can be solved by the element-wise soft-threshold operator as:
From literature [22], \(\epsilon _{1}\) and \(\epsilon _{2}\) are related to \(\lambda _{1}\) and \(\lambda _{3}\). To reduce the impact of parameter settings, we can set \(\epsilon \) equal to the corresponding \(\lambda \) respectively. Combine with Eq. (7) and (8), the update method for \(\Omega \) in the accelerated proximal gradient is:
By alternately training steps 1 and 2 until the change in weights reaches a threshold. When we acquire W and \(\Omega \), they can help determine the missing label degree for samples, which is valuable for creating synthetic data.
3.3 Synthetic Data Generation Ratio
When the auxiliary weights W and \(\Omega \) are determined, data distribution and label inconsistency impact the confidence scores of the samples. The data distribution focuses on the generation proportion for the minorities, while prediction error underscores the significance of generation for samples with high probability of missing labels. Based on these considerations, we weighted the data generation for each minority sample based on their label distribution and prediction loss.
where \(n_{j}\) is the effective number of labels and \(\beta \) is smaller than 1. The minority samples are weighted by a factor of \((1-\beta )/(1-\beta ^{n_{j}})\) based on the inverse label frequency. We set the hyperparameter \(\beta \) to 0.99. Due to W being learned from the recovered label \(\Omega Y\), when the value of \(\Vert Y - WX \Vert ^ {2}\) is larger, it indicates that the probability of mislabeling this sample is higher. Based on the two parts, Eq. (11) can address labels with minority and missing values. Similar to MLSOL, MLAWSMOTE defines minority labels as those that constitute less than half of n. When the sample contains minority labels, the operation of Eq. (11) is performed, and the proportion g is generated after normalization.
3.4 SMOTE Operation with Auxiliary Weights
In multi-label learning, there are varies methods to assign values to synthetic data labels, including copy labels [8], knn statistical assignment [7], SMOTE dependent sample assignment [31]. However, by assigning values that depend on the original observed label, no additional information can be provided while the dataset is imMLML. Leveraging neighbor information for label assignment has a restricted capability to supplement absent labels and influences the capacity to strengthen boundaries.
We consider WX capable of complementing labels based on label correlation. When dealing with synthetic data created by SMOTE, we calculate predicted labels by multiplying WX when the distance is near the seed instance, and then combine them with the corresponding seed sample. It is ensured that the synthetic instance is complemented with any possible missing labels based on the obtained seed sample and its observed labels. The synthetic instances are featured and labeled as shown in Eqs. (12) and (13).
where x is the seed sample of SMOTE, xn is its neighbour. xs and ys represent the feature and label for synthetic instances, while r denotes a random value. It is worth noting that the random value r of SMOTE should preferably be in the range of [0, 0.5] to avoid assigning different labels in the region of [0.5, 1], as the generated SMOTE samples may also belong to the minority class. Overall, the pseudocode is presented in Algorithm 1 and the flow chart is in Fig. 2. From Fig. 2, MLAWSMOTE is mainly divided into three parts, defining minoirty labels via IR; training W and \(\Omega \) in an iterative way; use the weight W to define generate ratio and SMOTE operation.
3.5 Computational Complexity and Convergence Analysis
MLAWSMOTE can be divided into the following stages: acquiring the auxiliary weights W and \(\Omega \), calculating the generation ratio g and generating the synthetic samples. The most time-consuming complexity during the iteration is computing W and \(\Omega \). The complexity of computing the gradient of W is \(O(nd^{2} + qd^{2} + dnq + dq^{2})\), and it requires \(O(nq^{3} + q^{3} + dq^{2} + dnq)\) to obtain the gradient of \(\Omega \). The complexity to calculate the generation ratio is O(n) and SMOTE requires \(O(n^{2}d + kn^{2} + pn(q+d))\), where the first two represent the time complexity of computing knn. The former refers to the complexity of the algorithm, while the latter pertains to the complexity of generating synthetic data based on the overrall generated ratio p.
In terms of algorithmic complexity, the oversampling algorithm can proceed by first defining the minority samples, generating samples using SMOTE, and implementing the proposed label correction method. LP-ROS, ML-ROS, and REMEDIAL focus on defining minority samples. LP-ROS and REMEDIAL utilize mean size and SCUMBLE to define the minority class and perform sample cloning, respectively. Following these two steps, the algorithmic complexity can be approximated as \(O(qn+n)\). ML-ROS requires the generation of a proportion p for control, with algorithmic complexity O(qnp). MLSMOTE and MLSOL require the additional use of kNN to find SMOTE samples. This, in combination with the step of defining the minority samples, has computational complexity \(O(qn+n^{2}d+kn^{2})\) for the former step. The latter additionally requires the computation of weights, with complexity \(O(qn+n^{2}d+kn^{2}+nkq)\). The computational complexity for MLCSMOTE-FRST is \(O(qn+n^{2}d+kn^{2})\) plus \(O(qn+n^{2}d+kn^{2}+nkq)\) due to it has two steps, which are focus define the generation location and revise the labels. Compared with other algorithms, MLAWSMOTE mainly lies in the time spent on the optimization of W and \(\Omega \). Fortunately, the optimization method such as [55], which uses the alternating direction method of multipliers to reduce the time of parameter selection, can also be used in the process of MLAWSMOTE parameter optimization.
According to the definition in [42], the parameters controlled by \(\lambda _{1}\) and \(\lambda _{3}\) are convex but nonsmmoth, while the rest can be regarded as convex. For Linear inverse problems, the convergence analysis of iterative shrinkage-thresholding algorithms has been proved in previous literature [13, 14]. To further illustrate the convergence of MLAWSMOTE, we plotted the difference curves of WX and \(\Omega Y\) for different datasets under a \(30\%\) minority label missing rate to show that W and \(\Omega \) are effective. In Fig. 3, it can be observed that, except for the medical dataset which started to oscillate, the other datasets maintained excellent convergence.
4 Experiments
To demonstrate the superiority and stability of our proposed algorithm, we proceed from the following three research questions.
\(\mathbf {Research\;question\;1}\): Compare MLAWSMOTE and other data generation algorithms in the absence of minority class labels in the training set, and imbalance related metric changes between classifiers.
\(\mathbf {Research\;question\;2}\): Based on the above classifiers and the compared generation algorithms, the nonparametric tests is validated for difference minority labels missing ratio.
\(\mathbf {Research\;question\;3}\): The parameter sensitivity of MLAWSMOTE is demonstrated by analyzing the classification results.
We analyze the above research questions in the Sects. 4.2–4.4 and before that we present the datasets and the compared generation algorithms and the different classifiers.
4.1 Experimental Framework
To verify the effectiveness of the algorithm, we conducted six datasets with varying ratios of missing minority labels. The domains in datasets include music, audio, text, biology, and images. The split for training and testing was supported by scikit-multilearn [43]. Similar to the matrix decomposition in related papers [49, 56], for a dataset where \(n > q\), we use PCA to retain the top \(95\%\) of the features; otherwise, we apply standard normalization. For the minority labels, the missing rate ranges from 0 to \(30\%\), with a sampling interval of \(5\%\). Overall, the number of datasets we evaluated is 42, as each benchmark has seven missing rates. The details of the datasets are shown in Table 2.
The comparison algorithm utilizes the oversampling technique to address the imbalance in the multi-label dataset. We also used the fuzzy rough set oversampling algorithm MLCSMOTE-FRST [33] for comparison, which mined the label correlation with minority sample neighbors and used OWA to complete the label imputation. The detail of the compared oversampling methods are shown in Table 1. In the experiments of classification metrics, the generation proportion is set to \(25\%\) of the total dataset, and the number of nearest neighbors in MLSMOTE, MLSOL, and MLAWSMOTE is set to 10.
The iteration rounds of MLAWSMOTE are set to 100 epochs. After each round of the training process, the difference between WX and \(\Omega Y\) is recorded. The epoch with the smallest difference is selected to obtain the relevant W and \(\Omega \). The parameter \(\lambda _{1}\) to \( \lambda _{4}\) is set consistently from \(1e^{-3}\)to \(1e^{-6}\) to find the maximum convertible parameter.
The classification algorithm used in the experiment consists of two types: problem transformation and matrix decomposition. Problem transformation techniques include binary relevance (BR), classifier chains (CC) [38], and random k-labelsets (RAkEL) [45]. These three methods transform the multi-label problem into binary classification, chain structure, and label subsets, respectively. Multi-label classification with group-based mapping (MC-GM) [35] is a type of matrix decomposition that assumes similar features and labels correspond to instances in the same group. Label extraction is carried out based on the relationship between groups and labels.
There are multiple ways to evaluate performance in imbalanced multi-label datasets. We would prefer to evaluate the overall classification effectiveness while also paying attention to minority groups. Based on these two considerations, we choose example-based accuracy (ACC) and Hamming loss (HL) to describe the overall metrics of different data generation methods corresponding to classification. We use Macro F1 and Micro F1 to reflect the metrics related to the minority. The different metrics are as follows:
where HL measures the number of incorrectly predicted labels as a percentage of the overall number of labeled labels. The smaller indicated the better. In ACC, MacroF1, MicroF1, the higher means the better performance, TP, FP and FN are the sum of the corresponding values of each class. ACC is the accuracy of all data in the test set, which can reflect the impact on the accuracy of all data after adding synthetic samples. MacroF1 takes the mean value of F1-Score of all labels and MicroF1 calculates the accuracy and recall of all labels to obtain the F1-Score value according to the calculated results. In the case of an extremely imbalanced case, MacroF1 can better reflect the classification of the minority because it considers all classes in a balanced way to get the F1-score.
4.2 Results of Sampling Methods
Due to space constraints, we present the model performance corresponding to the BR classifier (with \(10\%\) as the missing label interval) in Tables 3 and 4, where BR might have a better adaptability in missing label by removing invisible samples of labels [54]. Also, in Tables 5 and 6, we present the average ranking of for the four classifiers, the missing label interval is same with Tables 3 and 4. The complete average and standard deviation performance is presented in the supplementary material.Footnote 1 We note that the oversampling methods and MLAWSMOTE contain SMOTE random seed and other random parameters. To verify the stability of each method, we conducted each experiment five times to calculate the average values.
We perform the following analysis using various metrics, including missing label rates and applied classifiers.
4.2.1 Comparison of Oversampling Algorithms
Compared to other oversampling methods, our proposed method performs well with all four classifiers. In three problem-transfer classifiers, our proposed method achieves the optimal average values when minority labels are missing. It shows that our proposed method can help the classifiers maintain good performance for both global and minority-related outcomes. Other oversampling methods, on the other hand, have inconsistent effects on the classifiers. For example, LP-ROS achieves suboptimal MicroF1 for Rakel at a \(30\%\) missing rate, while the CC classifier is weaker than ML-ROS and MLSOL.
4.2.2 Impact from Minority Labels Missing Rate
Typically, the metrics exhibit a declining trend with an increase in the label missing rate. In the case of MacroF1, the metrics experienced reductions of \(24.78\%\), \(17.81\%\), \(3.8\%\), and \(19.56\%\) for the four classifiers, respectively, in the absence of synthetic instances. The reductions in performance for MLAWSMOTE are \(15.79\%\), \(13.63\%\), \(3.78\%\), and \(15.21\%\), respectively, demonstrating its superior performance compared to other oversampling techniques. In comparison to MacroF1, MicroF1 demonstrates a smaller decrease when using MLAWSMOTE, with reductions of \(9.79\%\), \(8.1\%\), \(2.8\%\), and \(10.31\%\). The decrease in MicroF1 was observed to be less pronounced compared to MacroF1 as the percentage of missing labels increased in the other oversampling methods. We posit that the sensitivity of MacroF1 to the identification of minority classes is the underlying reason.
4.2.3 Impact from Different Classifiers
In comparison to transfer-based methods, the MC-GM correlation metrics exhibit less variation as the missing rate increases. MC-GM also incorporates label-related information by utilizing global group and instance-group relations, thereby mitigating the impact of label missing on classification outcomes. Leveraging the synthetic instances generated by MLAWSMOTE, MC-GM attained the highest average metrics in terms of ACC, HL, and MicroF1. In three assessed transfer-based methodologies for problem-solving, the performance evaluation revealed a pattern of declining metrics with an increase in the missing label rate. In the case of the Emotions dataset, for example, MacroF1 decreased by \(36.99\%\), \(25.78\%\), and \(33.42\%\) without the addition of synthetic data. In contrast, when using the MLAWSMOTE method, MacroF1 decreased by \(16.45\%\), \(14.2\%\), and \(19.69\%\) respectively, demonstrating the stability of the corresponding metrics of our method. In addition, various datasets and prediction targets necessitate the selection of suitable classifiers. For example, the MicroF1 of Rakel with MLAWSMOTE is \(44.28\%\), whereas it is \(11.82\%\) in MC-GM with MLAWSMOTE. However, this already represents the best performance for MC-GM with the addition of synthetic methods.
To further inspect the significance of our proposed algorithm, we perform parameter tests in Sect. 4.3.
4.3 Significance Test
To compare oversampling algorithm metric changes with varying label missing rates, we categorize missing rates as low (5–15\(\%\)) and high (20–30\(\%\)), then compare the Friedman test results for each classifier. A p-value less than 0.05 indicates a significant difference between the oversampling algorithms. A smaller p-value indicates a more significant difference (Table 7).
The MC-GM classifier shows no significant difference between the oversampling methods for MacroF1 at low missing levels, whereas the p-value at high levels is less than 0.05. Significant differences were observed in other problem transfer-related classifiers. The p-value of BR in MacroF1 and MicroF1 is significantly smaller than that of Rakel and CC, indicating a substantial difference.
We also present the Nemenyi test in the BR classifier in Figs. 4 and 5, CD equals 1.9389, corresponding to 18 datasets (six datasets with three missing rates for low and high missing levels). MLAWSMOTE stands out from other oversampling methods in terms of ACC, MacroF1, and MicroF1 at the high label missing rates. Among the validated metrics, MLAWSMOTE tends to group similarly to MLCSMOTE-FRST, LP-ROS and ML-ROS. The original dataset for the HL, showing less variability compared to the other three metrics. According to Table 5, LP-ROS improves average rankings for certain minority-sensitive metrics as the label missing rate increases, but it is not significantly different from not adding simulated data. Our proposed oversampling method stands out for its higher significance compared to other methods.
4.4 Parameter Sensitivity
The parameters of MLAWSMOTE can be divided into two parts: the parameters \(\lambda _{1},\lambda _{2},\lambda _{3},\lambda _{4}\) related to MLAWSMOTE convergence and the parameters r and p related to oversampling. For convergence-related parameters \(\lambda _{1}\) to \(\lambda _{4}\), we recommend grid search for parameter selection. The reason is that we find that the magnitude of \(\lambda \)-related parameters has different effects on the results. Taking Yeast and BR as an example, when we change one parameter and fix the other three parameters, the impact on classification results is shown in Fig. 6. By applying grid search, when \(\lambda _{1} = 0.001\), \(\lambda _{2} = 0.0001\), \(\lambda _{3} = 1\) and \(\lambda _{4} = 0.001\), Yeast achieves the best classification level.
Similar to most oversampling methods for multi-label imbalance learning, MLAWSMOTE is concerned with the number of neighbors k and the generation ratio p. k is concerned with the expansion of the space of possible generation of minority samples and is related to the ability of the algorithm to construct decision boundaries. p focuses on the number of samples generated and is related to the degree of imbalance reduction. In this section, we use Yeast as the evaluated dataset, which is also the primary binary and multi class dataset. The classifier uses BR which achieves good performance as a classifier in both MacroF1 and MicroF1. Missing rate equals to \(30\%\) was applied as a control ratio to verify the parameter variation in the extreme minority of missing labels.
In Fig. 7, changes in k have minimal impact on performance, particularly when there is an extreme missing rate of minority labels. In such cases, limited labeling is available for generating synthetic samples over an extended range. Interestingly, in Fig. 8, all the metrics of MLAWSMOTE show an increasing trend as the sampling rate increases. The metrics of the other oversampling algorithms show little change. The synthetic instances generated by MLAWSMOTE can enhance the classifier’s focus on minorities and complement labeling information, improving metrics like ACC and HL. In practical applications, it is essential to comprehensively design the oversampling ratio based on the computational performance of the classifier.
4.5 Discussion
MLML is often formulated as a matrix completion problem that uses low-rank assumptions about the matrix to fill in the missing entries based on the observed items, which is also the role of \(\Omega \) completion. In addition, Eq. (2) incorporates the second-order label correlation by calculating the Euclidean distance. This is based on the manifold hypothesis of label space. That is, if the label correlation is high, they will have similar label-specific features, and the corresponding W has a similar Euclidean distance. MLAWSMOTE, designed on the basis of these assumptions, has the following advantages and limitations.
MLAWSMOTE can be used directly in multi-label classifiers and performs well in multiple classifiers. We found that both the global indicators and minority label related indicators have improved. As can be seen from Fig. 4c, for MicroF1, label imputation method MLCSMOTE-FRST still has some advantages when the label missing rate is low, but with the increase of label missing rate, both MacroF1 and microF1 show excellent performance.
However, compared with other oversampling algorithms, MLAWSMOTE requires multiple iterations to learn W and \(\Omega \), and needs to choose on parameters of \(\lambda \), which requires more computation time for other methods. In terms of the performance of classifiers, although the advantage of problem-transfer classifiers is obvious, the MC-GM improvement effect on labels that can be supplemented is modest.
5 Conclusion
We introduce MLAWSMOTE, an oversampling method combined with matrix decomposition, to tackle minority label missing in multi-label issues. MLAWSMOTE differs from traditional oversampling by focusing on label correlations. MLAWSMOTE learn prediction and label association weights based on the assumption of label association. It adjusting data generation ratios based on recovered labels versus original ones. For synthetic instances in the feature space, auxiliary weights are used to complement their label space information. Our proposed algorithm shows strong performance across diverse datasets with varying missing rates, the classification performance of problem-transfer classifiers is improved. Concretely, MLAWSMOTE achieving high scores in global metrics like ACC and HL, as well as minority correlation metrics such as MacroF1 and MicroF1.
In future research, further study is needed to effectively define the generation ratio in imMLML. MLAWSMOTE stands out from other oversampling methods as the metrics improve with an increase in the generation ratio. The integration of training efficiency of the classifier and desired metrics to determine the number of synthetic instances generated needs further refinement.
Data Availability
This publication is supported by multiple datasets, which are openly available at locations cited in the reference section.
Notes
Full experimental results: Huang, Kai (2024), “MLAWSMOTE”, Mendeley Data, V1, https://doi.org/10.17632/smdth2ky5z.1.
References
Al-Janabi, S.: Overcoming the main challenges of knowledge discovery through tendency to the intelligent data analysis. In: 2021 International Conference on Data Analytics for Business and Industry (ICDABI), IEEE, pp 286–294, (2021). https://doi.org/10.1109/ICDABI53623.2021.9655916
Al-Janabi, S., Al-Janabi, Z.: Development of deep learning method for predicting dc power based on renewable solar energy and multi-parameters function. Neural Comput. Appl. 35(21), 15273–15294 (2023). https://doi.org/10.1007/s00521-023-08480-6
Al-Janabi, S., Alkaim, A.: A novel optimization algorithm (lion-ayad) to find optimal dna protein synthesis. Egypt. Inf. J. 23(2), 271–290 (2022). https://doi.org/10.1016/j.eij.2022.01.004
Al-Janabi, S., Mohammed, G.: An intelligent returned energy model of cell and grid using a gain sharing knowledge enhanced long short-term memory neural network. J. Supercomput. 80(5), 5756–5814 (2024). https://doi.org/10.1007/s11227-023-05609-1
Charte, F., Rivera, AJ., del Jesus, MJ., et al.: Mlenn: a first approach to heuristic multilabel undersampling. In: Intelligent Data Engineering and Automated Learning–IDEAL 2014: 15th International Conference, Salamanca, Spain, September 10–12, 2014. Proceedings 15, Springer, pp 1–9, (2014). https://doi.org/10.1007/978-3-319-10840-7_1
Charte, F., Rivera, A.J., del Jesus, M.J., et al.: Addressing imbalance in multilabel classification: Measures and random resampling algorithms. Neurocomputing 163, 3–16 (2015). https://doi.org/10.1016/j.neucom.2014.08.091
Charte, F., Rivera, A.J., del Jesus, M.J., et al.: Mlsmote: approaching imbalanced multilabel learning through synthetic instance generation. Knowl. Based Syst. 89, 385–397 (2015). https://doi.org/10.1016/j.knosys.2015.07.019
Charte, F., Rivera, A.J., del Jesus, M.J., et al.: Dealing with difficult minority labels in imbalanced mutilabel data sets. Neurocomputing 326, 39–53 (2019). https://doi.org/10.1016/j.neucom.2016.08.158
Chawla, N.V., Bowyer, K.W., Hall, L.O., et al.: Smote: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002). https://doi.org/10.1613/jair.953
Chen, Z., Wei, X.S., Wang, P., et al.: Learning graph convolutional networks for multi-label recognition and applications. IEEE Trans. Pattern Anal. Mach. Intell. (2021). https://doi.org/10.1109/tpami.2021.3063496
Cheng, Y., Qian, K., Wang, Y., et al.: Missing multi-label learning with non-equilibrium based on classification margin. Appl. Soft Comput. 86(105), 924 (2020). https://doi.org/10.1016/j.asoc.2019.105924
Cheng, Y., Qian, K., Min, F.: Global and local attention-based multi-label learning with missing labels. Inf. Sci. 594, 20–42 (2022). https://doi.org/10.1016/j.ins.2022.02.022
Daubechies, I., Defrise, M., De Mol, C.: An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Commun. Pure Appl. Math. J. Issued Courant Inst. Math. Sci. 57(11), 1413–1457 (2004). https://doi.org/10.1002/cpa.20042
Figueiredo, M.A., Nowak, R.D.: An em algorithm for wavelet-based image restoration. IEEE Trans. Image Process. 12(8), 906–916 (2003). https://doi.org/10.1109/tip.2003.814255
Fürnkranz, J., Hüllermeier, E., Loza Mencía, E., et al.: Multilabel classification via calibrated label ranking. Mach. Learn. 73, 133–153 (2008). https://doi.org/10.1007/s10994-008-5064-8
Goodfellow, I., Pouget-Abadie, J., Mirza, M., et al.: Generative adversarial networks. Commun. ACM 63(11), 139–144 (2020). https://doi.org/10.1145/3422622
Han, Y., Wu, F., Zhuang, Y., et al.: Multi-label transfer learning with sparse representation. IEEE Trans. Circ. Syst. Video Technol. 20(8), 1110–1121 (2010). https://doi.org/10.1109/tcsvt.2010.2057015
Han, Y., Sun, G., Shen, Y., et al.: Multi-label learning with highly incomplete data via collaborative embedding. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp 1494–1503, (2018). https://doi.org/10.1145/3219819.3220038
Hashemi, A., Dowlatshahi, M.B., Nezamabadi-Pour, H.: Mfs-mcdm: Multi-label feature selection using multi-criteria decision making. Knowl. Based Syst. 206(106), 365 (2020). https://doi.org/10.1016/j.knosys.2020.106365
Hashemi, A., Dowlatshahi, M.B., Nezamabadi-Pour, H.: Mgfs: A multi-label graph-based feature selection algorithm via pagerank centrality. Expert Syst. Appl. 142(113), 024 (2020). https://doi.org/10.1016/j.eswa.2019.113024
He, Z.F., Yang, M., Gao, Y., et al.: Joint multi-label classification and label correlations with missing labels and feature selection. Knowl. Based Syst. 163, 145–158 (2019). https://doi.org/10.1016/j.knosys.2018.08.018
Huang, J., Qin, F., Zheng, X., et al.: Improving multi-label classification with missing labels by learning label-specific features. Inf. Sci. 492, 124–146 (2019). https://doi.org/10.1109/bigmm.2018.8499080
Huang, K., Wang, X.: Ada-incvae: Improved data generation using variational autoencoder for imbalanced classification. Appl. Intell. 52(3), 2838–2853 (2022). https://doi.org/10.1007/s10489-021-02566-1
Jiang, A., Wang, C., Zhu, Y.: Calibrated rank-svm for multi-label image categorization. In: 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), IEEE, pp 1450–1455, (2008). https://doi.org/10.1109/ijcnn.2008.4633988
Kadhuim, Z.A., Al-Janabi, S.: Codon-mrna prediction using deep optimal neurocomputing technique (dlstm-dsn-woa) and multivariate analysis. Results Eng. 17(100), 847 (2023). https://doi.org/10.1016/j.rineng.2022.100847
Kadhuim, Z.A., Al-Janabi, S.: Intelligent deep analysis of dna sequences based on ffgm to enhancement the performance and reduce the computation. Egypt. Inf. J. 24(2), 173–190 (2023). https://doi.org/10.1016/j.eij.2023.02.004
Kumar, S., Rastogi, R.: Low rank label subspace transformation for multi-label learning with missing labels. Inf. Sci. 596, 53–72 (2022). https://doi.org/10.1016/j.ins.2022.03.015
Li, H., Lin, Z.: Accelerated proximal gradient methods for nonconvex programming. Adv. Neural Inf. Process. Syst. 28, (2015). https://doi.org/10.1007/s10107-015-0871-8
Li, Y., Wu, B., Zhao, Y., et al.: Handling missing labels and class imbalance challenges simultaneously for facial action unit recognition. Multimed. Tools Appl. 78, 20309–20332 (2019). https://doi.org/10.1007/s11042-018-6836-1
Lim, H., Kim, D.W.: Mfc: Initialization method for multi-label feature selection based on conditional mutual information. Neurocomputing 382, 40–51 (2020). https://doi.org/10.1016/j.neucom.2019.11.071
Liu, B., Tsoumakas, G.: Synthetic oversampling of multi-label data based on local label distribution. In: Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2019, Würzburg, Germany, September 16–20, 2019, Proceedings, Part II, Springer, pp 180–193, (2020). https://doi.org/10.1007/978-3-030-46147-8_11
Liu, B., Blekas, K., Tsoumakas, G.: Multi-label sampling based on local label imbalance. Pattern Recogn. 122(108), 294 (2022). https://doi.org/10.1016/j.patcog.2021.108294
Liu, J., Huang, K., Chen, C., et al.: An oversampling algorithm of multi-label data based on cluster-specific samples and fuzzy rough set theory. Complex & Intelligent Systems pp 1–16, (2024). https://doi.org/10.1007/s40747-024-01498-w
Liu, W., Wang, H., Shen, X., et al.: The emerging trends of multi-label learning. IEEE Trans. Pattern Anal. Mach. Intell. 44(11), 7955–7974 (2021). https://doi.org/10.1109/tpami.2021.3119334
Ma, J., Chiu, B.C.Y., Chow, T.W.: Multilabel classification with group-based mapping: a framework with local feature selection and local label correlation. IEEE Trans. Cybern. 52(6), 4596–4610 (2020). https://doi.org/10.1109/tcyb.2020.3031832
Mohammed, G.S., Al-Janabi, S.: An innovative synthesis of optmization techniques (fdire-gsk) for generation electrical renewable energy from natural resources. Results Eng. 16(100), 637 (2022). https://doi.org/10.1016/j.rineng.2022.100637
Pereira, R.M., Costa, Y.M., Silla, C.N., Jr.: Mltl: A multi-label approach for the tomek link undersampling algorithm. Neurocomputing 383, 95–105 (2020). https://doi.org/10.1016/j.neucom.2019.11.076
Read, J., Pfahringer, B., Holmes, G., et al.: Classifier chains for multi-label classification. Mach. Learn. 85, 333–359 (2011). https://doi.org/10.1007/s10994-011-5256-5
Roseberry, M., Krawczyk, B., Cano, A.: Multi-label punitive knn with self-adjusting memory for drifting data streams. ACM Trans. Knowl. Discov. Data (TKDD) 13(6), 1–31 (2019). https://doi.org/10.1145/3363573
Salman, M.A., Mahdi, M.A., Al-Janabi, S.: A gmee-wfed system: Optimizing wind turbine distribution for enhanced renewable energy generation in the future. Int. J. Comput. Intell. Syst. 17(1), 5 (2024). https://doi.org/10.1007/s44196-023-00391-7
Schultheis, E., Wydmuch, M., Babbar, R., et al.: On missing labels, long-tails and propensities in extreme multi-label classification. In: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 1547–1557, (2022). https://doi.org/10.1145/3534678.3539466
Sun, L., Ye, P., Lyu, G., et al.: Weakly-supervised multi-label learning with noisy features and incomplete labels. Neurocomputing 413, 61–71 (2020). https://doi.org/10.1016/j.neucom.2020.06.101
SzymaĹ, P., Kajdanowicz, T., et al.: scikit-multilearn: A python library for multi-label classification. J. Mach. Learn. Res. 20(6), 1–22 (2019)
Tan, A., Ji, X., Liang, J., et al.: Weak multi-label learning with missing labels via instance granular discrimination. Inf. Sci. 594, 200–216 (2022). https://doi.org/10.1016/j.ins.2022.02.011
Tsoumakas, G., Katakis, I., Vlahavas, I.: Random k-labelsets for multilabel classification. IEEE Trans. Knowl. Data Eng. 23(7), 1079–1089 (2010). https://doi.org/10.1007/978-3-540-74958-5_38
Wan, S., Duan, Y., Zou, Q.: Hpslpred: an ensemble multi-label classifier for human protein subcellular location prediction with imbalanced source. Proteomics 17(17–18), 1700262 (2017). https://doi.org/10.1002/pmic.201700262
Wu, B., Lyu, S., Ghanem, B.: Constrained submodular minimization for missing labels and class imbalance in multi-label learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, (2016). https://doi.org/10.1609/aaai.v30i1.10186
Wu, T., Huang, Q., Liu, Z., et al.: Distribution-balanced loss for multi-label classification in long-tailed datasets. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16, Springer, pp 162–178, (2020). https://doi.org/10.1007/978-3-030-58548-8_10
Zhang, J., Luo, Z., Li, C., et al.: Manifold regularized discriminative feature selection for multi-label learning. Pattern Recogn. 95, 136–150 (2019). https://doi.org/10.1016/j.patcog.2019.06.003
Zhang, J., Li, S., Jiang, M., et al.: Learning from weakly labeled data based on manifold regularized sparse model. IEEE Trans. Cybern. 52(5), 3841–3854 (2020). https://doi.org/10.1109/tcyb.2020.3015269
Zhang, K., Mao, Z., Cao, P., et al.: Label correlation guided borderline oversampling for imbalanced multi-label data learning. Knowl. Based Syst. 279(110), 938 (2023). https://doi.org/10.1016/j.knosys.2023.110938
Zhang, M.L., Zhou, Z.H.: Ml-knn: a lazy learning approach to multi-label learning. Pattern Recogn. 40(7), 2038–2048 (2007). https://doi.org/10.1016/j.patcog.2006.12.019
Zhang, M.L., Zhou, Z.H.: A review on multi-label learning algorithms. IEEE Trans. Knowl. Data Eng. 26(8), 1819–1837 (2013). https://doi.org/10.1109/tkde.2013.39
Zhang, M.L., Li, Y.K., Liu, X.Y., et al.: Binary relevance for multi-label learning: an overview. Front. Comput. Sci. 12, 191–202 (2018). https://doi.org/10.1007/s11704-017-7031-7
Zhang, X., Ng, M.K.: A fast algorithm for solving linear inverse problems with uniform noise removal. J. Sci. Comput. 79(2), 1214–1240 (2019). https://doi.org/10.1007/s10915-018-0888-2
Zhu, P., Xu, Q., Hu, Q., et al.: Multi-label feature selection with missing labels. Pattern Recogn. 74, 488–502 (2018). https://doi.org/10.1016/j.patcog.2017.09.036
Acknowledgements
The authors would like to express our sincere gratitude and appreciation to the contributions throughout the course of this research paper.
Funding
Jian Mao reports financial support was provided by Xiamen Science and Technology Subsidy Project (2023CXY0318). Kai Huang reports financial support was provided by Natural Science Foundation (3502Z202372018) of Xiamen, China and Department of Education (JAT232012) of Fujian Province of China. Jingming Liu reports financial support was provided by Natural Science Foundation (2023J01800) of Fujian Province of China.
Author information
Authors and Affiliations
Contributions
The authors confirm their contribution to the paper as follows: study conception and design: Jian Mao, Kai Huang and Jinming Liu; data collection: Jinming Liu; analysis and interpretation of results: Jian Mao and Kai Huang; draft manuscript preparation: Jian Mao, Kai Huang and Jinming Liu. All authors reviewed the results and approved the final version of the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors have no conflict of interest to declare that are relevant to the content of this article.
Ethical approval
This article does not contain any studies with human participants or animals performed by any of the authors.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Mao, J., Huang, K. & Liu, J. MLAWSMOTE: Oversampling in Imbalanced Multi-label Classification with Missing Labels by Learning Label Correlation Matrix. Int J Comput Intell Syst 17, 205 (2024). https://doi.org/10.1007/s44196-024-00607-4
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s44196-024-00607-4