1. Introduction
A support vector machine (SVM) is a supervised machine learning model that has been widely utilized in pattern recognition [
1], such as text classification [
2,
3], face recognition [
4,
5], radar [
6], sonar [
7], etc. Generally, the basic idea of SVM is to maximize the minimum margin from the samples to the classification hyperplane. Built on the SVM, variants, including some discriminative classifiers featuring large-margin theory, have been proposed to improve the SVM or overcome its limitations [
8,
9,
10,
11,
12,
13].
For classification tasks, the standard SVM aims to find a hyperplane that allows diverse classes to be separated with a maximal margin. However, traditional SVMs mainly consider the separability of boundary points, while the underlying data structure information is commonly ignored. It is known that in real-world applications, different data sets may have different distributions, and from a statistical perspective, the structural information should be the key factor. Breiman et al. argued this point and showed that maximizing the minimum margin was not the key factor in model generalization [
14,
15]. Then, Reyzin found that the margin theory was still helpful to model generalization, but the margin distribution seemed more dominant [
16]. In this case, a classifier is expected to capture the data structure or distribution information, and a more reasonable discriminant boundary would be available when dealing with the complex structured dataset in certain classification tasks. Gao proved that the margin mean and margin variance do have an essential influence on the generalization performance of the classifier [
17]. Subsequently, large margin machine (LDM) and its modified version, i.e., optimal margin distribution learning machine (ODM), are proposed to maximize or minimize the margin mean and margin variance, respectively [
18,
19]. Considering the sensitivity of the number of samples and its inclination to generate an imbalanced margin distribution, Cheng considered the statistical characteristics with marginal distribution and constructed a double distribution support vector machine (DDSVM) [
20]. Additionally, due to the utilization of the sample distribution information, the improved SVMs have shown a superior performance [
21,
22,
23,
24,
25,
26,
27].
One method is to introduce structural information into the SVM. Belkin et al. [
28] proposed a laplacian support vector machine (LapSVM) by constructing a laplacian matrix of the manifold structure of the dataset and embedding a manifold regularization term inside the SVM. This approach is called the semi-supervised learning task. Based on this, a structured large margin machine (SLMM) [
29] is proposed to capture the structural information by using clustering techniques, which have proved to be sensitive to data distribution. However, the SLMM is optimized by second-order cone programming (SOCP), which has a large computational complexity. Furthermore, research has improved the SVM from the perspective of the objective function, of which the most representative method is structural regularized support vector machine (SRSVM) [
30]. Similar to the SLMM, the SRSVM also obtains structural information by using the clustering method, whereas SRSVM integrates the structural information directly into the objective function of a traditional SVM, rather than into the constraints. That is, SRSVM can also be solved by quadratic programming, hence a SVM with minimum within-class scatter (WCS-SVM) was proposed to combine minimum within-class scatter with SVM [
31]. Additionally, it is further extended to a fuzzy version coined FSVM with minimum within-class scatter (WCS-FSVM) [
32]. To enhance the discriminative ability, Zhang introduced Fisher regularization into SVM to form a Fisher regularization support vector machine (FisherSVM) [
33] that minimizes the within-class samples.
Overall, the structural SVM has matured to the extent that it can utilize structural information from the data and improve the generalization capacities of the model. It is usually expected to construct a classification model by explicitly mining structural information. Therefore, the available model is sensitive to data structure information, thus resulting in a general improvement in the model. Motivated by the aforementioned analysis, a novel pattern recognition classifier, namely a support vector machine based on earth mover’s distance (EMD-SVM), is proposed to learn the distribution information between classes automatically. Specifically, we utilized earth mover’s distance [
34] to capture structural information for data explicitly, and then the structural information will be embedded into the SVM, which acts as a regular term of the objective function optimized by quadratic programming. Additionally, we extended the EMD-SVM formulation from the linear classification to the nonlinear case. Considering the great success and state-of-the art performance of deep neural networks in machine vision and signal processing fields [
35,
36,
37,
38,
39,
40,
41], we replaced fully-connected layers in a standard CNN using SVM to cope with classification tasks [
42,
43,
44,
45,
46], which enabled improvements to be made in the recognition performance and generalization ability of CNN.
In terms of this, an improved support vector machine with earth mover’s distance (EMD-SVM) is proposed, which can be regarded as an improved generalization of the standard SVM. The main contributions of this study can be summarized as follows,
- (1)
We propose a new strategy to capture the underlying data structural information and thus improve the SVM classifier.
- (2)
The principles of the EMD-SVM in the linear and nonlinear cases are discussed in detail, respectively. It is proved to be a convex optimization problem and can be solved by the QP technique.
- (3)
We conduct experimental verification on three kinds of classification datasets, including UCI, image recognition, and radar emitter recognition, which have shown that the performance of the proposed EMD-SVM is superior and robust.
The rest of this paper is organized as follows.
Section 2 briefly describes SVM and Earth Mover’s Distance (EMD). The proposed EMD-SVM is introduced in
Section 3, which is followed by numerical results in
Section 4.
Section 5 presents the conclusions.
4. Experimental Results and Discussion
In this section, the EMD-SVM is evaluated on synthetic and real-world datasets. We compared the performance of the proposed EMD-SVM with standard SVM and some representative large margin methods, including SRSVM, LDM, ODM, and ELM [
48]. We first evaluated the effectiveness of the proposed EMD-SVM on a synthetic dataset to illustrate the impact of data distribution information on classification. Then, we evaluated the performance of these methods on UCI datasets and the Caltech 101 dataset. Next, we utilized a deep convolutional neural network to extract convolutional features and discuss the performance of EMD-SVM based on deep convolutional features. Finally, the proposed EMD-SVM was applied to the radar emitter recognition task. All the experiments were carried out on a PC with a 3.50 GHz CPU and 48 GB RAM.
4.1. Recognition Performance of EMD-SVM on Synthetic Dataset
The two-dimensional synthetic dataset consists of three groups of randomly generated Gaussian distributions. The blue plus represents the positive sample and the red star represents the negative samples.
Table 1 describes the attributes of the dataset. The hyperplanes of linear SVM, LDM, and EMD-SVM are displayed in
Figure 1.
As can be seen from
Figure 1, the positive class has a vertical distribution and the negative class is composed of two horizontal Gaussian distributions. Moreover, the distribution
N1 has more samples than
N2. It can be seen from
Figure 1 that due to the neglect of the structural information, SVM cannot compete with a complex structured dataset; Specifically, SVM ignores the cluster
N2 which has fewer samples than cluster
N1. The hyperplane only focuses on the separability between cluster
P and cluster
N1. LDM adopts margin mean and variance to characterize the margin distribution and optimizes it to achieve a better generalization performance. On the other hand, considering the structured distance information and the separability between the two distributions, the proposed EMD-SVM can also obtain a more reasonable hyperplane.
4.2. Recognition Performance of EMD-SVM with Hand-Crafted Classifier on UCI Datasets
In this section, we verify the performance of the proposed EMD-SVM on UCI datasets. The attributes of these datasets are presented in
Table 2.
We randomly selected half of the samples as the training set and the rest as the testing set. In the linear case, for ODM, the parameter D is selected from the set , the regularization parameter C1 and C2 are selected from the set , for SVM, SRSVM, EMD-SVM and LDM, the parameter C is selected from the set , and parameters , and are selected from the set . For ELM, the number of hidden neurons is defined as 1000, and the activation function is Sigmoid. In the nonlinear case, the RBF kernel is used for all algorithms. The width of the RBF kernel is selected from . Experiments were repeated 10 times with different data partitions.
We compared the average accuracy of all the algorithms.
Table 3 shows the accuracy result with linear kernel and
Table 4 shows the accuracy result with RBF kernel.
From the results, we can draw the following conclusions,
- (1)
The EMD-SVM combines the earth mover’s distance with standard SVM, which can introduce the data distribution information into the traditional SVM. The outstanding performance of EMD-SVM on most datasets further validates the necessity of distribution information for the classifier’s design.
- (2)
Although SRSVM can achieve comparable recognition results with EMD-SVM, its recognition performance is highly affected by the clustering method, as SRSVM is based on the clustering structure. In practical applications, different clustering methods must be used for different problems.
- (3)
LDM and ODM use the margin mean and variance to describe the margin distribution, while the first- and second-order statistics are often used to characterize Gaussian-distributed data, which has certain limitations. In contrast, EMD-SVM adopts EMD distance instead of Euclidean distance to describe the data distribution. The distribution information is then incorporated into the SVM object function in the form of regular terms, thus guiding SVM to learn the optimal classification boundary under this distribution metric.
4.3. Experiments on Caltech101 Dataset
In this subsection, we conduct an experiment on the Caltech101 dataset. Caltech101 is a digital image dataset provided by the California Institute of Technology, which contains a total of 9146 images divided into 101 attributes (including face, plane, animal, etc.) and a background category. We chose nine types of images for this experiment: airplanes, bonsai, cars, dolphins, electric guitars, easy faces, helicopters, leopards, and motorbikes. The features of SIFT, LBP, and PHOG are extracted from these images and the attributes are presented in
Table 5.
We randomly selected 80 images from each category as datasets, 64 of them as training samples and the remaining 16 as test samples. Ten independent experiments were conducted to evaluate the performance of the proposed EMD-SVM. We used linear kernel in the experiment and the parameters were selected in a similar way as in the UCI dataset experiments. For multi-class problems, the “one-to-one” strategy is adopted. We compared the average accuracy of all the algorithms and the results are shown in
Table 6.
It is clear that the EMD-SVM achieves better accuracy than the SVM, SRSVM, LDM, ODM and ELM methods in the multi-class classification problem. This indicates that distribution information can help to determine a better discriminant boundary. Moreover, the performance of LDM and ODM on the Caltech101 dataset further shows that characterizing the data distribution with first- and second-order statistics still has some limitations.
4.4. Recognition Performance of EMD-SVM Based on Deep Convolutional Features
In this section, we discuss the performance of EMD-SVM and other algorithms on deep convolutional features. We adopted the classical AlexNet as the pretrained CNN model, which contains five convolutional layers and three fully connected layers, and further details of the network can be referred to in [
40]. The DSLR and Amazon datasets were used to verify the effectiveness of EMD-SVM in the deep feature. The CNN model was pretrained by the ImageNet dataset and fine-tuned by the DSLR and Amazon datasets. Then, we extracted the fine-tuned deep features Fc6 and Fc7 as the inputs of the above five algorithms for classification, respectively.
Table 7 shows the details of the four deep features.
In the experiment, we randomly chose 50% of the samples as the training set and the rest as the testing set. Ten independent experiments were conducted to achieve a more stable result. A linear kernel was used in the experiment and the parameters were selected in a similar way as in the UCI dataset experiments.
Table 8 compares the accuracy results of EMD-SVM and other methods.
As can be seen, the overall performance of EMD-SVM is better than SVM and other methods. In addition, as the large margin algorithms LDM and ODM apply the ideas of maximizing margin mean and minimizing margin variance to the SVM model, compared with SVM, LDM and ODM are also highly competitive with SVM. The results demonstrate that considering the distribution of data can improve the classifier’s classification performance on complex data.
Additionally, we compared EMD-SVM with an MLP with two hidden layers with 1024 and 512 neurons, respectively. The accuracies of EMD-SVM and MLP are shown in
Table 9. It can be seen that MLP can achieve recognition results comparable to those of linear EMD-SVM, but still somewhat inferior to nonlinear EMD-SVM. Compared to MLP, EMD-SVM is based on the minimization of structural risk rather than empirical risk, thus avoiding the overfitting problem. By obtaining a structured description of the data distribution, it reduces the requirements for the size and distribution of the data, and has excellent generalization capabilities.
4.5. Recognition Performance of EMD-SVM for Radar Emitter Recognition
In order to test the effectiveness of our EMD-SVM in realistic applications, we conducted experiments on radar emitter recognition. The collected data are radar emitter signals with the same type and parameters. We extracted the FFT, welch power spectrum, ambiguous function slice, and cyclic spectrum slice (denoted as Data1, Data2, Data3, and Data4, respectively). The attributes of these datasets are presented in
Table 10. The corresponding waveforms of class1-class4 signals in Data4 are shown in
Figure 2.
To reduce the computation time, the PCA algorithm was utilized to extract 90% of the energy. We randomly chose the 80% percent samples as the training set and the remaining 20% percent samples as the test set. The experiment was repeated 10 times to generate 10 independent results for each dataset, and we compared the average accuracy and the standard deviation of all the algorithms. A linear kernel was used in the experiment and the parameters were selected in a similar way to the UCI dataset experiments. The results of the four radar datasets are shown in
Table 11, and it can be seen that our EMD-SVM still achieves superior results.