Open AccessArticle

An Improved SVM with Earth Mover’s Distance Regularization and Its Application in Pattern Recognition

Rui Feng

¹,

Haitao Dong

^1,2,*

Xuri Li

³,

Zhaochuang Gu

³,

Runyang Tian

⁴ and

Houde Li

⁵

School of Electronic Engineering, Xidian University, Xi’an 710071, China

School of Marine Science and Technology, Northwestern Polytechnical University, Xi’an 710072, China

Oil & Gas Technology Institute of Changqing Oilfifield Company, Xi’an 710021, China

⁴

College of Economics & Management, Shandong University of Science and Technology, Qingdao 266590, China

⁵

School of Computer and Information Science, Southwest University, Chongqing 400715, China

Author to whom correspondence should be addressed.

Electronics 2023, 12(3), 645; https://doi.org/10.3390/electronics12030645

Submission received: 3 December 2022 / Revised: 22 January 2023 / Accepted: 24 January 2023 / Published: 28 January 2023

(This article belongs to the Special Issue Machine Learning for Radar and Communication Signal Processing)

Download

Browse Figures

Versions Notes

Abstract

A support vector machine (SVM) aims to achieve an optimal hyperplane with a maximum interclass margin and has been widely utilized in pattern recognition. Traditionally, a SVM mainly considers the separability of boundary points (i.e., support vectors), while the underlying data structure information is commonly ignored. In this paper, an improved support vector machine with earth mover’s distance (EMD-SVM) is proposed. It can be regarded as an improved generalization of the standard SVM, and can automatically learn the distribution between the classes. To validate its performance, we discuss the necessity of the structural information of EMD-SVM in the linear and nonlinear cases, respectively. Experimental validation was designed and conducted in different application fields, which have shown its superior and robust performance.

Keywords:

support vector machine; structural information; Earth mover’s distance; deep convolutional feature; pattern recognition

1. Introduction

A support vector machine (SVM) is a supervised machine learning model that has been widely utilized in pattern recognition [1], such as text classification [2,3], face recognition [4,5], radar [6], sonar [7], etc. Generally, the basic idea of SVM is to maximize the minimum margin from the samples to the classification hyperplane. Built on the SVM, variants, including some discriminative classifiers featuring large-margin theory, have been proposed to improve the SVM or overcome its limitations [8,9,10,11,12,13].

For classification tasks, the standard SVM aims to find a hyperplane that allows diverse classes to be separated with a maximal margin. However, traditional SVMs mainly consider the separability of boundary points, while the underlying data structure information is commonly ignored. It is known that in real-world applications, different data sets may have different distributions, and from a statistical perspective, the structural information should be the key factor. Breiman et al. argued this point and showed that maximizing the minimum margin was not the key factor in model generalization [14,15]. Then, Reyzin found that the margin theory was still helpful to model generalization, but the margin distribution seemed more dominant [16]. In this case, a classifier is expected to capture the data structure or distribution information, and a more reasonable discriminant boundary would be available when dealing with the complex structured dataset in certain classification tasks. Gao proved that the margin mean and margin variance do have an essential influence on the generalization performance of the classifier [17]. Subsequently, large margin machine (LDM) and its modified version, i.e., optimal margin distribution learning machine (ODM), are proposed to maximize or minimize the margin mean and margin variance, respectively [18,19]. Considering the sensitivity of the number of samples and its inclination to generate an imbalanced margin distribution, Cheng considered the statistical characteristics with marginal distribution and constructed a double distribution support vector machine (DDSVM) [20]. Additionally, due to the utilization of the sample distribution information, the improved SVMs have shown a superior performance [21,22,23,24,25,26,27].

One method is to introduce structural information into the SVM. Belkin et al. [28] proposed a laplacian support vector machine (LapSVM) by constructing a laplacian matrix of the manifold structure of the dataset and embedding a manifold regularization term inside the SVM. This approach is called the semi-supervised learning task. Based on this, a structured large margin machine (SLMM) [29] is proposed to capture the structural information by using clustering techniques, which have proved to be sensitive to data distribution. However, the SLMM is optimized by second-order cone programming (SOCP), which has a large computational complexity. Furthermore, research has improved the SVM from the perspective of the objective function, of which the most representative method is structural regularized support vector machine (SRSVM) [30]. Similar to the SLMM, the SRSVM also obtains structural information by using the clustering method, whereas SRSVM integrates the structural information directly into the objective function of a traditional SVM, rather than into the constraints. That is, SRSVM can also be solved by quadratic programming, hence a SVM with minimum within-class scatter (WCS-SVM) was proposed to combine minimum within-class scatter with SVM [31]. Additionally, it is further extended to a fuzzy version coined FSVM with minimum within-class scatter (WCS-FSVM) [32]. To enhance the discriminative ability, Zhang introduced Fisher regularization into SVM to form a Fisher regularization support vector machine (FisherSVM) [33] that minimizes the within-class samples.

Overall, the structural SVM has matured to the extent that it can utilize structural information from the data and improve the generalization capacities of the model. It is usually expected to construct a classification model by explicitly mining structural information. Therefore, the available model is sensitive to data structure information, thus resulting in a general improvement in the model. Motivated by the aforementioned analysis, a novel pattern recognition classifier, namely a support vector machine based on earth mover’s distance (EMD-SVM), is proposed to learn the distribution information between classes automatically. Specifically, we utilized earth mover’s distance [34] to capture structural information for data explicitly, and then the structural information will be embedded into the SVM, which acts as a regular term of the objective function optimized by quadratic programming. Additionally, we extended the EMD-SVM formulation from the linear classification to the nonlinear case. Considering the great success and state-of-the art performance of deep neural networks in machine vision and signal processing fields [35,36,37,38,39,40,41], we replaced fully-connected layers in a standard CNN using SVM to cope with classification tasks [42,43,44,45,46], which enabled improvements to be made in the recognition performance and generalization ability of CNN.

In terms of this, an improved support vector machine with earth mover’s distance (EMD-SVM) is proposed, which can be regarded as an improved generalization of the standard SVM. The main contributions of this study can be summarized as follows,

(1): We propose a new strategy to capture the underlying data structural information and thus improve the SVM classifier.
(2): The principles of the EMD-SVM in the linear and nonlinear cases are discussed in detail, respectively. It is proved to be a convex optimization problem and can be solved by the QP technique.
(3): We conduct experimental verification on three kinds of classification datasets, including UCI, image recognition, and radar emitter recognition, which have shown that the performance of the proposed EMD-SVM is superior and robust.

The rest of this paper is organized as follows. Section 2 briefly describes SVM and Earth Mover’s Distance (EMD). The proposed EMD-SVM is introduced in Section 3, which is followed by numerical results in Section 4. Section 5 presents the conclusions.

2. Related Work

2.1. Support Vector Machine

Taking binary classification for instance, we review the principle of SVM. Usually, we use a testing set D to evaluate the discriminative ability of the classifier for new samples, and then use the “testing error” on it as an approximation of the generalization error. Considering the training samples set

D = {(x_{1}, y_{1}), (x_{2}, y_{2}), \dots, (x_{n}, y_{n})}, y_{i} \in {- 1, + 1}

, the standard SVM aims to find a hyperplane

f = ω^{T} x + b

, which can separate the samples of different classes with a margin of

\frac{2}{‖ ω ‖}

. Its objective function can be given as follows,

\begin{array}{l} \min \frac{1}{2} ω^{T} ω \\ s . t . y_{i} (ω^{T} x_{i} + b) \geq 1, i = 1, 2, \dots, n \end{array}

(1)

For the linear non-separable cases, by using

ξ_{i} \geq 0, i = 1, 2, \dots, n

and penalty factors to penalize the samples that violate inequality constraints, the following soft-margin SVM can be obtained,

\begin{array}{l} \min \frac{1}{2} ω^{T} ω + C \sum_{i = 1}^{n} ξ_{i} \\ s . t . y_{i} (ω^{T} x_{i} + b) \geq 1 - ξ_{i} \\ ξ_{i} \geq 0, i = 1, 2, \dots, n \end{array}

(2)

where

ξ_{i}

is the slack variables, the matrix

C

denotes to the trade-off between errors of training data and generalization [47].

Then, the standard SVM can be trained by solving a dual quadratic programming problem. The dual problem can be formulated as below,

\begin{array}{l} \max - \frac{1}{2} \sum_{i = 1}^{n} \sum_{j = 1}^{n} α_{i} α_{j} y_{i} y_{j} x_{i}^{T} x_{j} + \sum_{i = 1}^{n} α_{i} \\ s . t . 0 \leq α_{i} \leq C, i = 1, \dots, n \\ \sum_{i = 1}^{n} α_{i} y_{i} = 0 \end{array}

(3)

2.2. The Earth Mover‘s Distance

The Earth Mover’s Distance is defined as the minimal cost that transforms one distribution into the other. It is proposed to solve the transportation problem through linear optimization. Next, we explain the algorithm by referring to the cargo transportation problem.

Suppose there are two distributions

P = {(p_{i}, u_{p_{i}})}_{i = 1}^{m}

and

Q = {(q_{j}, u_{q_{j}})}_{j = 1}^{n}

, where

p_{i}

is the supplier,

u_{p_{i}}

is the quantity of goods it owns,

q_{j}

is the warehouse, and

u_{q_{j}}

is the quantity of goods it can accommodate. Then, the EMD distance can be expressed as a linear optimization problem as below,

\begin{array}{l} W O R K (P, Q, F) = \sum_{i = 1}^{m} \sum_{j = 1}^{n} d_{i j} f_{i j} \\ s . t . f_{i j} \geq 0 1 \leq i \leq m, 1 \leq j \leq n \\ \sum_{j = 1}^{n} f_{i j} \leq ω_{p_{i}} 1 \leq i \leq m \\ \sum_{i = 1}^{m} f_{i j} \leq ω_{q_{j}} 1 \leq j \leq n \\ \sum_{i = 1}^{m} \sum_{j = 1}^{n} f_{i j} = \min (\sum_{i = 1}^{m} ω_{p_{i}}, \sum_{i = 1}^{m} ω_{q_{j}}) \\ d_{i j} = | p_{i} - q_{j} | \end{array}

(4)

where,

d_{i j}

represents the distance between

p_{i}

and

q_{j}

ω_{p_{i}}

and

ω_{q_{j}}

denotes the total supply and accommodation capacity, respectively. Here we expect to find a flow

F = [f_{i j}]

that minimizes the overall transportation costs, where

f_{i j}

is the flow from

p_{i}

q_{j}

. Then, the EMD distance can be normalized as,

E M D (P, Q) = \frac{\sum_{i = 1}^{m} \sum_{j = 1}^{n} d_{i j} f_{i j}}{\sum_{i = 1}^{m} \sum_{j = 1}^{n} f_{i j}}

(5)

3. EMD-SVM Model

In this section, the EMD-SVM model is expounded by taking binary classification as an example. Then, the principles of the EMD-SVM in the linear and nonlinear cases are discussed in detail, respectively.

3.1. EMD-SVM for Linear Case

The EMD-SVM model for the linear case can be given as,

\begin{array}{l} \min \frac{1}{2} ω^{T} ω + \frac{λ}{2} ω^{T} E_{d} ω + C \sum_{i = 1}^{n} ξ_{i} \\ s . t . y_{i} (ω^{T} x_{i} + b) \geq 1 - ξ_{i} \\ ξ_{i} \geq 0, i = 1, 2, \dots, n \end{array}

(6)

where

E_{d} = \frac{1}{e m d} I

, I is the identity matrix,

e m d

represents the EMD distance between the two kinds of distributions, and

λ

is a parameter used to regulate the relative importance of the distance representation within the distribution of the two classes.

Incorporating the constraints, we can rewrite Equation (5) as a primal Lagrangian equation,

L (ω, b, α) = \frac{1}{2} ω^{T} ω + \frac{λ}{2} ω^{T} E_{d} ω + C \sum_{i = 1}^{n} ξ_{i} + \sum_{i = 1}^{n} α_{i} [1 - ξ_{i} - y_{i} (ω^{T} x_{i} + b)] - \sum_{i = 1}^{n} μ_{i} ξ_{i}

(7)

where the KKT conditions for the primal problem can be obtained as follows,

\frac{\partial L}{\partial ω} = (I + λ E_{d}) ω - \sum_{i = 1}^{n} α_{i} y_{i} x_{i} = 0

(8)

\frac{\partial L}{\partial b} = \sum_{i = 1}^{n} α_{i} y_{i} = 0

(9)

\frac{\partial L}{\partial ξ_{i}} = C - α_{i} - μ_{i} = 0

(10)

α_{i} \geq 0, μ_{i} \geq 0

(11)

y_{i} (ω^{T} x_{i} + b) - 1 + ξ_{i} \geq 0

(12)

α_{i} (y_{i} (ω^{T} x_{i} + b) - 1 + ξ_{i}) = 0

(13)

ξ_{i} \geq 0, μ_{i} ξ_{i} \geq 0

(14)

Substitute Equations (8)–(10) into Equation (7), we can obtain

\begin{matrix} \frac{1}{2} ω^{T} (I + λ E_{d}) ω + C \sum_{i = 1}^{n} ξ_{i} + \sum_{i = 1}^{n} α_{i} [1 - ξ_{i} - y_{i} (ω^{T} x_{i} + b)] - \sum_{i = 1}^{n} μ_{i} ξ_{i} \\ = \frac{1}{2} \sum_{i = 1}^{n} α_{i} y_{i} x_{i}^{T} {(I + λ E_{d})}^{- 1} \sum_{j = 1}^{n} α_{j} y_{j} x_{j} + \sum_{i = 1}^{n} α_{i} - \sum_{i = 1}^{n} α_{i} y_{i} \sum_{j = 1}^{n} α_{j} y_{j} x_{j}^{T} {(I + λ E_{d})}^{- 1} x_{i} \\ = - \frac{1}{2} \sum_{i = 1}^{n} \sum_{j = 1}^{n} α_{i} α_{j} y_{i} y_{j} x_{i}^{T} {(I + λ E_{d})}^{- 1} x_{j} + \sum_{i = 1}^{n} α_{i} \end{matrix}

(15)

Then, we can transform the primal Lagrangian equation into the dual problem,

\begin{array}{l} \max - \frac{1}{2} \sum_{i = 1}^{n} \sum_{j = 1}^{n} α_{i} α_{j} y_{i} y_{j} x_{i}^{T} {(I + λ E_{d})}^{- 1} x_{j} + \sum_{i = 1}^{n} α_{i} \\ s . t . 0 \leq α_{i} \leq C, i = 1, \dots, n \\ \sum_{i = 1}^{n} α_{i} y_{i} = 0 \end{array}

(16)

Hence, we can obtain the solution

α_{i}

by using the QP techniques. In predicting the class labels for testing data

x

, the classifier function can be derived as below,

\begin{matrix} C l a s s x & = s g n [ω^{T} x + b] \\ = s g n [\sum_{i = 1}^{n} α_{i} y_{i} x_{i}^{T} {(I + λ E_{d})}^{- 1} x + b] \end{matrix}

(17)

3.2. EMD-SVM for Nonlinear Case

Like the SVM, we can also construct the kernel functions for EMD-SVM to cope with nonlinear problems. We can construct a mapping function Φ so that to map the training data to a higher-level feature space H, i.e.,

Φ : R^{d} \mapsto H

. Then, the kernel transposition problem can be described as [47],

W O R K (ϕ (P), ϕ (Q), F) = \sum_{i = 1}^{m} \sum_{j = 1}^{n} d_{ϕ (p_{i}) ϕ (q_{j})} f_{i j}

(18)

where the ground distance

d_{ϕ (p_{i}) ϕ (q_{j})}

between

ϕ (p_{i})

and

ϕ (q_{j})

can be calculated by,

d_{ϕ (p_{i}) ϕ (q_{j})} = \sqrt{{‖ ϕ (p_{i}) - ϕ (q_{j}) ‖}^{2}}

(19)

If there were a “kernel function”

K

such that

K (x_{i}, x_{j}) = ϕ {(x_{i})}^{T} ϕ (x_{j})

, we would use

K

to rewrite the Equation (19) as,

d_{ϕ (p_{i}) ϕ (q_{j})} = \sqrt{K (p_{i}, p_{i}) - K (p_{i}, q_{j}) - K (q_{j}, p_{i}) + K (q_{j}, q_{j})}

(20)

Then, the kernel of the EMD-SVM can be defined,

\begin{array}{l} \min \frac{1}{2} ω^{T} ω + \frac{λ}{2} ω^{T} ϕ (X) E_{d} ϕ^{T} (X) ω + C \sum_{i = 1}^{n} ξ_{i} \\ s . t . y_{i} (ω^{T} ϕ (x_{i}) + b) \geq 1 - ξ_{i} \\ ξ_{i} \geq 0, i = 1, 2, \dots, n \end{array}

(21)

where

E_{d}

represents the EMD distance of the two kinds of distributions in the kernel space.

The Largrangian form for this problem can be formed as below,

\begin{matrix} L (ω, b, α) \\ = \frac{1}{2} ω^{T} ω + \frac{λ}{2} ω^{T} ϕ (X) E_{d} ϕ^{T} (X) ω + C \sum_{i = 1}^{n} ξ_{i} \\ + \sum_{i = 1}^{n} α_{i} [1 - ξ_{i} - y_{i} (ω^{T} ϕ (x_{i}) + b)] - \sum_{i = 1}^{n} μ_{i} ξ_{i} \end{matrix}

(22)

Setting the partial derivative of

L (ω, b, α)

with respect to w and b equal to zero,

ω = {[I + λ ϕ (X) E_{d} ϕ^{T} (X)]}^{- 1} \sum_{i = 1}^{n} α_{i} y_{i} ϕ (x_{i})

(23)

\sum_{i = 1}^{n} α_{i} y_{i} = 0

(24)

C - α_{i} - μ_{i} = 0

(25)

The dual problem can be further formed as,

\begin{array}{l} \max - \frac{1}{2} \sum_{i = 1}^{n} \sum_{j = 1}^{n} α_{i} α_{j} y_{i} y_{j} ϕ^{T} (x_{i}) {(I + λ ϕ (X) E_{d} ϕ^{T} (X))}^{- 1} ϕ (x_{j}) + \sum_{i = 1}^{n} α_{i} \\ s . t . 0 \leq α_{i} \leq C, i = 1, \dots, n \\ \sum_{i = 1}^{n} α_{i} y_{i} = 0 \end{array}

(26)

According to Woodbury’s formula, we could have,

{(I + U B V)}^{- 1} = A^{- 1} - A^{- 1} U B {(B + B V A^{- 1} U B)}^{- 1} B V A^{- 1}

(27)

Then, we have,

\begin{array}{l} {[I + λ ϕ (X) E_{d} ϕ^{T} (X)]}^{- 1} \\ = I - λ ϕ (X) E_{d} {[E_{d} + λ E_{d} ϕ^{T} (X) ϕ (X) E_{d}]}^{- 1} E_{d} ϕ^{T} (X) \\ = I - λ ϕ (X) E_{d} {[E_{d} + λ E_{d} K E_{d}]}^{- 1} E_{d} ϕ^{T} (X) \\ = I - λ ϕ (X) P ϕ^{T} (X) \end{array}

(28)

where

P = E_{d} [E_{d} + λ E_{d} K E_{d}] E_{d}

(29)

Let

K_{i :}

denote the

i

-th row of

K

K_{: j}

denote the

j

-th column of

K

, then the dual problem can be cast as

\begin{array}{l} - \frac{1}{2} \sum_{i = 1}^{n} \sum_{j = 1}^{n} α_{i} α_{j} y_{i} y_{j} [K_{i j} - λ K_{i :} P K_{: j}] + \sum_{i = 1}^{n} α_{i} \\ s . t . 0 \leq α_{i} \leq C, i = 1, \dots, n \\ \sum_{i = 1}^{n} α_{i} y_{i} = 0 \end{array}

(30)

Once the solution

α

are obtained from the above convex optimization problem, we can obtain the hyperplane. Then, the class label of a data point

x \in R^{n}

can be determined as

\begin{matrix} C l a s s x & = s g n [ω^{T} ϕ (x) + b] \\ = s g n [\sum_{i = 1}^{n} α_{i} y_{i} K (x_{i}, x) - λ \sum_{i = 1}^{n} α_{i} y_{i} K_{i :} P K (X, x) + b] \end{matrix}

(31)

4. Experimental Results and Discussion

In this section, the EMD-SVM is evaluated on synthetic and real-world datasets. We compared the performance of the proposed EMD-SVM with standard SVM and some representative large margin methods, including SRSVM, LDM, ODM, and ELM [48]. We first evaluated the effectiveness of the proposed EMD-SVM on a synthetic dataset to illustrate the impact of data distribution information on classification. Then, we evaluated the performance of these methods on UCI datasets and the Caltech 101 dataset. Next, we utilized a deep convolutional neural network to extract convolutional features and discuss the performance of EMD-SVM based on deep convolutional features. Finally, the proposed EMD-SVM was applied to the radar emitter recognition task. All the experiments were carried out on a PC with a 3.50 GHz CPU and 48 GB RAM.

4.1. Recognition Performance of EMD-SVM on Synthetic Dataset

The two-dimensional synthetic dataset consists of three groups of randomly generated Gaussian distributions. The blue plus represents the positive sample and the red star represents the negative samples. Table 1 describes the attributes of the dataset. The hyperplanes of linear SVM, LDM, and EMD-SVM are displayed in Figure 1.

As can be seen from Figure 1, the positive class has a vertical distribution and the negative class is composed of two horizontal Gaussian distributions. Moreover, the distribution N₁ has more samples than N₂. It can be seen from Figure 1 that due to the neglect of the structural information, SVM cannot compete with a complex structured dataset; Specifically, SVM ignores the cluster N₂ which has fewer samples than cluster N₁. The hyperplane only focuses on the separability between cluster P and cluster N₁. LDM adopts margin mean and variance to characterize the margin distribution and optimizes it to achieve a better generalization performance. On the other hand, considering the structured distance information and the separability between the two distributions, the proposed EMD-SVM can also obtain a more reasonable hyperplane.

4.2. Recognition Performance of EMD-SVM with Hand-Crafted Classifier on UCI Datasets

In this section, we verify the performance of the proposed EMD-SVM on UCI datasets. The attributes of these datasets are presented in Table 2.

We randomly selected half of the samples as the training set and the rest as the testing set. In the linear case, for ODM, the parameter D is selected from the set

[0, 0.1, \dots, 0.5]

, the regularization parameter C₁ and C₂ are selected from the set

[2^{- 8}, \dots, 2^{8}]

, for SVM, SRSVM, EMD-SVM and LDM, the parameter C is selected from the set

[10^{- 3}, \dots, 10^{3}]

, and parameters

λ

λ_{1}

and

λ_{2}

are selected from the set

[2^{- 8}, \dots, 2^{8}]

. For ELM, the number of hidden neurons is defined as 1000, and the activation function is Sigmoid. In the nonlinear case, the RBF kernel

k (x_{i}, x_{j}) = \exp (- \frac{1}{2 σ^{2}} {‖ x_{i} - x_{j} ‖}^{2})

is used for all algorithms. The width of the RBF kernel is selected from

[2^{- 8}, \dots, 2^{8}]

. Experiments were repeated 10 times with different data partitions.

We compared the average accuracy of all the algorithms. Table 3 shows the accuracy result with linear kernel and Table 4 shows the accuracy result with RBF kernel.

From the results, we can draw the following conclusions,

(1): The EMD-SVM combines the earth mover’s distance with standard SVM, which can introduce the data distribution information into the traditional SVM. The outstanding performance of EMD-SVM on most datasets further validates the necessity of distribution information for the classifier’s design.
(2): Although SRSVM can achieve comparable recognition results with EMD-SVM, its recognition performance is highly affected by the clustering method, as SRSVM is based on the clustering structure. In practical applications, different clustering methods must be used for different problems.
(3): LDM and ODM use the margin mean and variance to describe the margin distribution, while the first- and second-order statistics are often used to characterize Gaussian-distributed data, which has certain limitations. In contrast, EMD-SVM adopts EMD distance instead of Euclidean distance to describe the data distribution. The distribution information is then incorporated into the SVM object function in the form of regular terms, thus guiding SVM to learn the optimal classification boundary under this distribution metric.

4.3. Experiments on Caltech101 Dataset

In this subsection, we conduct an experiment on the Caltech101 dataset. Caltech101 is a digital image dataset provided by the California Institute of Technology, which contains a total of 9146 images divided into 101 attributes (including face, plane, animal, etc.) and a background category. We chose nine types of images for this experiment: airplanes, bonsai, cars, dolphins, electric guitars, easy faces, helicopters, leopards, and motorbikes. The features of SIFT, LBP, and PHOG are extracted from these images and the attributes are presented in Table 5.

We randomly selected 80 images from each category as datasets, 64 of them as training samples and the remaining 16 as test samples. Ten independent experiments were conducted to evaluate the performance of the proposed EMD-SVM. We used linear kernel in the experiment and the parameters were selected in a similar way as in the UCI dataset experiments. For multi-class problems, the “one-to-one” strategy is adopted. We compared the average accuracy of all the algorithms and the results are shown in Table 6.

It is clear that the EMD-SVM achieves better accuracy than the SVM, SRSVM, LDM, ODM and ELM methods in the multi-class classification problem. This indicates that distribution information can help to determine a better discriminant boundary. Moreover, the performance of LDM and ODM on the Caltech101 dataset further shows that characterizing the data distribution with first- and second-order statistics still has some limitations.

4.4. Recognition Performance of EMD-SVM Based on Deep Convolutional Features

In this section, we discuss the performance of EMD-SVM and other algorithms on deep convolutional features. We adopted the classical AlexNet as the pretrained CNN model, which contains five convolutional layers and three fully connected layers, and further details of the network can be referred to in [40]. The DSLR and Amazon datasets were used to verify the effectiveness of EMD-SVM in the deep feature. The CNN model was pretrained by the ImageNet dataset and fine-tuned by the DSLR and Amazon datasets. Then, we extracted the fine-tuned deep features Fc6 and Fc7 as the inputs of the above five algorithms for classification, respectively. Table 7 shows the details of the four deep features.

In the experiment, we randomly chose 50% of the samples as the training set and the rest as the testing set. Ten independent experiments were conducted to achieve a more stable result. A linear kernel was used in the experiment and the parameters were selected in a similar way as in the UCI dataset experiments. Table 8 compares the accuracy results of EMD-SVM and other methods.

As can be seen, the overall performance of EMD-SVM is better than SVM and other methods. In addition, as the large margin algorithms LDM and ODM apply the ideas of maximizing margin mean and minimizing margin variance to the SVM model, compared with SVM, LDM and ODM are also highly competitive with SVM. The results demonstrate that considering the distribution of data can improve the classifier’s classification performance on complex data.

Additionally, we compared EMD-SVM with an MLP with two hidden layers with 1024 and 512 neurons, respectively. The accuracies of EMD-SVM and MLP are shown in Table 9. It can be seen that MLP can achieve recognition results comparable to those of linear EMD-SVM, but still somewhat inferior to nonlinear EMD-SVM. Compared to MLP, EMD-SVM is based on the minimization of structural risk rather than empirical risk, thus avoiding the overfitting problem. By obtaining a structured description of the data distribution, it reduces the requirements for the size and distribution of the data, and has excellent generalization capabilities.

4.5. Recognition Performance of EMD-SVM for Radar Emitter Recognition

In order to test the effectiveness of our EMD-SVM in realistic applications, we conducted experiments on radar emitter recognition. The collected data are radar emitter signals with the same type and parameters. We extracted the FFT, welch power spectrum, ambiguous function slice, and cyclic spectrum slice (denoted as Data1, Data2, Data3, and Data4, respectively). The attributes of these datasets are presented in Table 10. The corresponding waveforms of class1-class4 signals in Data4 are shown in Figure 2.

To reduce the computation time, the PCA algorithm was utilized to extract 90% of the energy. We randomly chose the 80% percent samples as the training set and the remaining 20% percent samples as the test set. The experiment was repeated 10 times to generate 10 independent results for each dataset, and we compared the average accuracy and the standard deviation of all the algorithms. A linear kernel was used in the experiment and the parameters were selected in a similar way to the UCI dataset experiments. The results of the four radar datasets are shown in Table 11, and it can be seen that our EMD-SVM still achieves superior results.

5. Conclusions

In this paper, we propose a novel SVM classifier with earth mover’s distance, which can automatically learn the distribution between the classes. The EMD-SVM can be seen as a generalization of the standard SVM by calculating the EMD distance, and we discuss the principle of the EMD-SVM in linear and nonlinear cases, respectively. The experimental results indicate that the proposed EMD-SVM has a superior and robust performance. In the future, we will pay more attention to overcoming the drawbacks of a long training time for large-scale datasets and sensitivity to hyper-parameters of kernel functions. It would also be interesting to generalize the idea of EMD-SVM to other learning settings.

Author Contributions

Conceptualization, R.F. and Z.G.; methodology, R.F. and H.D.; software, R.F.; validation, H.L.; formal analysis, X.L.; investigation, R.F.; data curation, X.L. and Z.G.; writing—original draft preparation, R.F. and R.T.; writing—review and editing, H.D.; funding acquisition, H.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Natural Science Foundation of China (Grant No. 62201439) and China Postdoctoral Science Foundation (Grant No. 2022M722493). The authors would like to thank the anonymous reviewers for their valuable comments and suggestions.

Data Availability Statement

The data utilized is open access and can be found by the following websites: http://www.vision.caltech.edu/Image_Datasets/Caltech101/; https://github.com/jindongwang/transferlearning/blob/master/data/dataset.md#officecaltech.

Conflicts of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationship that could be construed as a potential conflict of interest.

References

Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
Joachims, T. A statistical learning learning model of text classification for support vector machines. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, New Orleans, LA, USA, 9–13 September 2001; pp. 128–136. [Google Scholar] [CrossRef]
Lilleberg, J.; Zhu, Y.; Zhang, Y. Support vector machines and Word2vec for text classification with semantic features. In Proceedings of the 2015 IEEE 14th International Conference on Cognitive Informatics & Cognitive Computing (ICCI* CC), Beijing, China, 6–8 July 2015; pp. 136–140. [Google Scholar] [CrossRef]
Osuna, E.; Freund, R.; Girosit, F. Training support vector machines: An application to face detection. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Juan, PR, USA, 17–19 June 1997; pp. 130–136. [Google Scholar] [CrossRef]
Ghimire, D.; Jeong, S.; Lee, J.; Park, S.H. Facial expression recognition based on local region specific features and support vector machines. Multimedia Tools Appl. 2017, 76, 7803–7821. [Google Scholar] [CrossRef] [Green Version]
Eryildirim, A.; Onaran, I. Pulse Doppler Radar Target Recognition using a Two-Stage SVM Procedure. IEEE Trans. Aerosp. Electron. Syst. 2011, 47, 1450–1457. [Google Scholar] [CrossRef] [Green Version]
Dong, H.; Wang, H.; Shen, X.; He, K. Parameter matched stochastic resonance with damping for passive sonar detection. J. Sound Vib. 2019, 458, 479–496. [Google Scholar] [CrossRef]
Suykens, J.A.K.; Vandewalle, J. Least Squares Support Vector Machine Classifiers. Neural Process. Lett. 1999, 9, 293–300. [Google Scholar] [CrossRef]
Jayadeva; Khemchandani, R.; Chandra, S. Twin Support Vector Machines for Pattern Classification. IEEE Trans. Pattern Anal. Mach. Intell. 2007, 29, 905–910. [Google Scholar] [CrossRef]
Lin, C.-F.; Wang, S.-D. Fuzzy support vector machines. IEEE Trans. Neural Netw. 2002, 13, 464–471. [Google Scholar] [CrossRef]
Ding, S.; Zhu, Z.; Zhang, X. An overview on semi-supervised support vector machine. Neural Comput. Appl. 2017, 28, 969–978. [Google Scholar] [CrossRef]
Iranmehr, A.; Masnadi-Shirazi, H.; Vasconcelos, N. Cost-sensitive support vector machines. Neurocomputing 2019, 343, 50–64. [Google Scholar] [CrossRef] [Green Version]
Huang, S.; Cai, N.; Pacheco, P.P.; Narandes, S.; Wang, Y.; Xu, W. Applications of Support Vector Machine (SVM) Learning in Cancer Genomics. Cancer Genom. Proteom. 2018, 15, 41–51. [Google Scholar] [CrossRef] [Green Version]
Vapnik, V. The Nature of Statistical Learning Theory; Springer: Berlin/Heidelberg, Germany, 1999. [Google Scholar] [CrossRef]
Breiman, L. Prediction Games and Arcing Algorithms. Neural Comput. 1999, 11, 1493–1517. [Google Scholar] [CrossRef] [PubMed]
Reyzin, L.; Schapire, R.E. How boosting the margin can also boost classifier complexity. In Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, USA, 25–29 June 2006; pp. 753–760. [Google Scholar] [CrossRef]
Gao, W.; Zhou, Z.-H. On the doubt about margin explanation of boosting. Artif. Intell. 2013, 203, 1–18. [Google Scholar] [CrossRef]
Zhang, T.; Zhou, Z.-H. Large margin distribution machine. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 24–27 August 2014; pp. 313–322. [Google Scholar] [CrossRef] [Green Version]
Teng, Z.; Zhou, Z.-H. Optimal margin distribution machine. IEEE Trans. Knowl. Data Eng. 2020, 32, 1143–1156. [Google Scholar] [CrossRef] [Green Version]
Cheng, F.; Zhang, J.; Li, Z.; Tang, M. Double distribution support vector machine. Pattern Recognit. Lett. 2017, 88, 20–25. [Google Scholar] [CrossRef]
Zhou, Y.-H.; Zhou, Z.-H. Large Margin Distribution Learning with Cost Interval and Unlabeled Data. IEEE Trans. Knowl. Data Eng. 2016, 28, 1749–1763. [Google Scholar] [CrossRef]
Wang, H.; Wang, Y.; Zhou, Z.; Ji, X.; Gong, D.; Zhou, J.; Li, Z.; Liu, W. CosFace: Large Margin Cosine Loss for Deep Face Recognition. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5265–5274. [Google Scholar] [CrossRef] [Green Version]
Elsayed, G.; Krishnan, D.; Mobahi, H.; Regan, K.; Bengio, S. Large margin deep networks for classification. Adv. Neural Inf. Process. Syst. 2018, 31, 850–860. [Google Scholar]
Cheng, F.; Zhang, J.; Wen, C.; Liu, Z.; Li, Z. Large cost-sensitive margin distribution machine for imbalanced data classification. Neurocomputing 2017, 224, 45–57. [Google Scholar] [CrossRef]
Zhan, K.; Wang, H.; Huang, H.; Xie, Y. Large margin distribution machine for hyperspectral image classification. J. Electron. Imaging 2016, 25, 63024. [Google Scholar] [CrossRef]
Rastogi, R.; Anand, P.; Chandra, S. Large-margin Distribution Machine-based regression. Neural Comput. Appl. 2020, 32, 3633–3648. [Google Scholar] [CrossRef]
Abe, S. Unconstrained large margin distribution machines. Pattern Recognit. Lett. 2017, 98, 96–102. [Google Scholar] [CrossRef] [Green Version]
Belkin, M.; Niyogi, P.; Sindhwani, V. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. J. Mach. Learn. Res. 2006, 7, 2399–2434. [Google Scholar]
Yeung, D.S.; Wang, D.; Ng, W.W.Y.; Tsang, E.C.C.; Wang, X. Structured large margin machines: Sensitive to data distributions. Mach. Learn. 2007, 68, 171–200. [Google Scholar] [CrossRef] [Green Version]
Xue, H.; Chen, S.; Yang, Q. Structural Regularized Support Vector Machine: A Framework for Structural Large Margin Classifier. IEEE Trans. Neural Netw. 2011, 22, 573–587. [Google Scholar] [CrossRef]
An, W.; Liang, M. A new intrusion detection method based on SVM with minimum within-class scatter. Secur. Commun. Netw. 2013, 6, 1064–1074. [Google Scholar] [CrossRef]
An, W.; Liang, M. Fuzzy support vector machine based on within-class scatter for classification problems with outliers or noises. Neurocomputing 2013, 110, 101–110. [Google Scholar] [CrossRef]
Zhang, L.; Zhou, W.-D. Fisher-regularized support vector machine. Inf. Sci. 2016, 343–344, 79–93. [Google Scholar] [CrossRef]
Rubner, Y.; Tomasi, C.; Guibas, L.J. The Earth Mover’s Distance as a Metric for Image Retrieval. Int. J. Comput. Vis. 2000, 40, 99–121. [Google Scholar] [CrossRef]
Kendall, A.; Gal, Y. What uncertainties do we need in bayesian deep learning for computer vision? Adv. Neural Inf. Process. Syst. 2017, 30, 5580–5590. [Google Scholar] [CrossRef]
Zhu, Z.; Yi, Z.; Li, S.; Li, L. Deep Muti-Modal Generic Representation Auxiliary Learning Networks for End-to-End Radar Emitter Classification. Aerospace 2022, 9, 732. [Google Scholar] [CrossRef]
Li, L.; Dong, Z.; Zhu, Z.; Jiang, Q. Deep-learning Hopping Capture Model for Automatic Modulation Classification of Wireless Communication Signals. IEEE Trans. Aerosp. Electron. Syst. 2022, 1–12. [Google Scholar] [CrossRef]
Ribeiro, F.D.S.; Calivá, F.; Swainson, M.; Gudmundsson, K.; Leontidis, G.; Kollias, S. Deep Bayesian Self-Training. Neural Comput. Appl. 2020, 32, 4275–4291. [Google Scholar] [CrossRef] [Green Version]
Zhang, X.; Zhou, X.; Lin, M.; Sun, J. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6848–6856. [Google Scholar] [CrossRef] [Green Version]
Zhu, Z.; Ji, H.; Zhang, W.; Li, L.; Ji, T. Complex Convolutional Neural Network for Signal Representation and Its Application to Radar Emitter Recognition. IEEE Commun. Lett. 2023, 1. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 60, 84–90. [Google Scholar] [CrossRef] [Green Version]
Nagi, J.; Di Caro, G.A.; Giusti, A.; Nagi, F.; Gambardella, L.M. Convolutional Neural Support Vector Machines: Hybrid Visual Pattern Classifiers for Multi-robot Systems. In Proceedings of the 11th International Conference on Machine Learning and Applications, Boca Raton, FL, USA, 12–15 December 2012; pp. 27–32. [Google Scholar] [CrossRef]
Niu, X.-X.; Suen, C.Y. A novel hybrid CNN–SVM classifier for recognizing handwritten digits. Pattern Recognit. 2012, 45, 1318–1325. [Google Scholar] [CrossRef]
Elleuch, M.; Maalej, R.; Kherallah, M. A New Design Based-SVM of the CNN Classifier Architecture with Dropout for Offline Arabic Handwritten Recognition. Procedia Comput. Sci. 2016, 80, 1712–1723. [Google Scholar] [CrossRef] [Green Version]
Tao, Q.-Q.; Zhan, S.; Li, X.-H.; Kurihara, T. Robust face detection using local CNN and SVM based on kernel combination. Neurocomputing 2016, 211, 98–105. [Google Scholar] [CrossRef]
Wu, H.; Huang, Q.; Wang, D.; Gao, L. A CNN-SVM combined model for pattern recognition of knee motion using mechanomyography signals. J. Electromyogr. Kinesiol. 2018, 42, 136–142. [Google Scholar] [CrossRef] [PubMed]
Burges, C.J.C. A Tutorial on Support Vector Machines for Pattern Recognition. Data Min. Knowl. Discov. 1998, 2, 121–167. [Google Scholar] [CrossRef]
Huang, G.-B.; Wang, D.H.; Lan, Y. Extreme learning machines: A survey. Int. J. Mach. Learn. Cybern. 2011, 2, 107–122. [Google Scholar] [CrossRef]

Figure 1. Hyperplanes of SVM and EMD-SVM. (a) SVM; (b) LDM; (c) EMD-SVM.

Figure 2. Waveforms of class1–class4 signals in Data4, (a) class1; (b) class2; (c) class3; (d) class4.

Table 1. The attributes of the synthetic dataset.

Samples	Gaussian Distribution	Num	Mean	Covariance
Positive	Gaussian distribution P	200	[0; 5]	[0.3,0; 0,5]
Negative	Gaussian distribution N₁	180	[6; 8]	[1,0; 0,0.5]
Negative	Gaussian distribution N₂	20	[3; 2]	[1.5,0; 0,1.5]

Table 2. Attributes of experimental datasets.

Dataset	Feature	Num	Class
Sonar	60	208	2
Breast	9	277	2
Cryotherapy	6	90	2
Fertility	9	100	2
Wdbc	30	569	2
Ionosphere	34	351	2
Hepatitis	19	155	2
Spectf	44	267	2
Pima	8	768	2
Heart	13	303	2
Tae	5	151	3
Iris	4	150	3

Table 3. Accuracy comparisons with linear kernel.

Dataset	SVM	SRSVM	LDM	ODM	ELM	EMD-SVM
Sonar	77.11 ± 3.84	77.40 ± 1.50	76.83 ± 3.80	76.35 ± 4.30	78.26 ± 3.79	77.69 ± 4.12
Breast	70.86 ± 2.78	71.80 ± 2.20	70.72 ± 2.06	70.58 ± 2.26	57.62 ± 6.64	71.80 ± 2.65
Cryotherapy	83.91 ± 5.10	85.35 ± 5.92	70.39 ± 4.99	68.77 ± 7.98	78.77 ± 7.33	85.64 ± 6.24
Fertility	86.80 ± 3.55	86.80 ± 3.55	86.80 ± 3.55	86.80 ± 3.55	74.2 ± 4.36	87.00 ± 4.02
Wdbc	95.12 ± 1.13	96.63 ± 0.99	94.88 ± 1.11	91.68 ± 1.83	87.75 ± 2.09	96.04 ± 1.14
Ionosphere	87.10 ± 1.69	87.84 ± 1.76	84.03 ± 2.74	84.49 ± 2.40	80.85 ± 2.05	87.73 ± 2.3
Hepatitis	80.75 ± 7.64	83.75 ± 4.28	83.75 ± 3.58	85.50 ± 4.68	76.25 ± 2.71	83.75 ± 4.28
Spectf	78.81 ± 3.75	79.85 ± 3.50	79.18 ± 2.32	79.10 ± 2.25	63.88 ± 5.34	79.78 ± 2.84
Pima	76.07 ± 1.88	76.11 ± 1.84	66.90 ± 1.94	67.14 ± 2.09	60.91 ± 2.48	76.30 ± 1.84
Heart	83.09 ± 2.63	83.28 ± 3.15	83.36 ± 2.15	82.76 ± 2.14	68.94 ± 4.48	83.42 ± 3.05
Tae	48.68 ± 3.59	51.04 ± 4.94	44.73 ± 5.10	44.87 ± 4.51	50.52 ± 4.85	51.45 ± 3.89
Iris	97.60 ± 2.41	97.80 ± 1.56	97.07 ± 1.24	97.07 ± 1.39	77.06 ± 6.85	97.87 ± 1.05