Abstract
We consider the classification problem when the input features are represented as matrices rather than vectors. To preserve the intrinsic structures for classification, a successful method is the support matrix machine (SMM) in Luo et al. (in: Proceedings of the 32nd international conference on machine learning, Lille, France, no 1, pp 938–947, 2015), which optimizes an objective function with a hinge loss plus a so-called spectral elastic net penalty. However, the issues of extending SMM to multicategory classification still remain. Moreover, in practice, it is common to see the training data contaminated by outlying observations, which can affect the robustness of existing matrix classification methods. In this paper, we address these issues by introducing a robust angle-based classifier, which boils down binary and multicategory problems to a unified framework. Benefitting from the use of truncated hinge loss functions, the proposed classifier achieves certain robustness to outliers. The underlying optimization model becomes nonconvex, but admits a natural DC (difference of two convex functions) representation. We develop a new and efficient algorithm by incorporating the DC algorithm and primal–dual first-order methods together. The proposed DC algorithm adaptively chooses the accuracy of the subproblem at each iteration while guaranteeing the overall convergence of the algorithm. The use of primal–dual methods removes a natural complexity of the linear operator in the subproblems and enables us to use the proximal operator of the objective functions, and matrix–vector operations. This advantage allows us to solve large-scale problems efficiently. Theoretical and numerical results indicate that for problems with potential outliers, our method can be highly competitive among existing methods.
Similar content being viewed by others
References
Altun, K., Barshan, B.: Human activity recognition using inertial/magnetic sensor units. In: International Workshop on Human Behavior Understanding, pp. 38–51. Springer (2010)
An, L.T.H., Tao, P.D.: Solving a class of linearly constrained indefinite quadratic problems by DC algorithms. J. Glob. Optim. 11(3), 253–285 (1997)
Bauschke, H., Combettes, P.L.: Convex Analysis and Monotone Operators Theory in Hilbert Spaces, 2nd edn. Springer, Berlin (2017)
Boser, B.E., Guyon, I.M., Vapnik, V.N.: A training algorithm for optimal margin classifiers. In: Proceedings of the Fifth Annual Workshop on Computational Learning Theory, pp. 144–152. ACM (1992)
Boyd, S.: Alternating direction method of multipliers. In: Talk at NIPS Workshop on Optimization and Machine Learning (2011)
Cai, D., He, X., Wen, J.-R., Han, J., Ma, W.-Y.: Support tensor machines for text categorization. Technical Report (2006)
Cai, J.-F., Candès, E.J., Shen, Z.: A singular value thresholding algorithm for matrix completion. SIAM J. Optim. 20(4), 1956–1982 (2010)
Chambolle, A., Pock, T.: A first-order primal-dual algorithm for convex problems with applications to imaging. J. Math. Imaging Vis. 40(1), 120–145 (2011)
Chambolle, A., Pock, T.: On the ergodic convergence rates of a first-order primal-dual algorithm. Math. Program. 159(1–2), 253–287 (2016)
Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)
Davis, D., Yin, W.: Faster convergence rates of relaxed Peaceman–Rachford and ADMM under regularity assumptions. Math. Oper. Res. 42, 783–805 (2014)
Friedman, J., Hastie, T., Tibshirani, R.: The Elements of Statistical Learning. Springer Series in Statistics, 2nd edn. Springer, New York (2001)
He, B.S., Yuan, X.M.: On the \({O}(1/n)\) convergence rate of the Douglas–Rachford alternating direction method. SIAM J. Numer. Anal. 50, 700–709 (2012)
Hou, C., Nie, F., Zhang, C., Yi, D., Wu, Y.: Multiple rank multi-linear SVM for matrix data classification. Pattern Recognit. 47(1), 454–469 (2014)
Huber, P.J., Ronchetti, E.: Robust Statistics. Wiley Series in Probability and Statistics, 2nd edn. Wiley, Hoboken (2009)
Latala, R.: Some estimates of norms of random matrices. Proc. Am. Math. Soc. 133(5), 1273–1282 (2005)
Lee, Y., Lin, Y., Wahba, G.: Multicategory support vector machines: theory and application to the classification of microarray data and satellite radiance data. J. Am. Stat. Assoc. 99(465), 67–81 (2004)
Liu, Y.: Fisher consistency of multicategory support vector machines. In: Artificial Intelligence and Statistics, pp. 291–298 (2007)
Luo, L., Xie, Y., Zhang, Z., Li, W.-J.: Support matrix machines. In: Proceedings of the 32nd International Conference on Machine Learning, Lille, France, no. 1, pp. 938–947 (2015)
Mohri, M., Rostamizadeh, A., Talwalkar, A.: Foundations of Machine Learning. Adaptive Computation and Machine Learning Series. MIT Press, Cambridge (2012)
Monteiro, R.D.C., Svaiter, B.F.: Iteration-complexity of block-decomposition algorithms and the alternating direction method of multipliers. SIAM J. Optim. 23(1), 475–507 (2013)
Negahban, S., Wainwright, M.J.: Estimation of (near) low-rank matrices with noise and high-dimensional scaling. Ann. Stat. 39(2), 1069–1097 (2011)
Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course, volume 87 of Applied Optimization. Kluwer Academic Publishers, Boston (2004)
Pirsiavash, H., Ramanan, D., Fowlkes, C.: Bilinear classifiers for visual recognition. In: Advances in Neural Information Processing Systems, pp. 1482–1490 (2009)
Rockafellar, R.T.: Convex Analysis, volume 28 of Princeton Mathematics Series. Princeton University Press, Princeton (1970)
Sun, H., Craig, B., Zhang, L.: Angle-based multicategory distance-weighted SVM. J. Mach. Learn. Res. 18(85), 1–21 (2017)
Tao, D., Li, X., Wu, X., Hu, W., Maybank, S.J.: Supervised tensor learning. Knowl. Inf. Syst. 13(1), 1–42 (2007)
Tran-Dinh, Q.: Proximal alternating penalty algorithms for constrained convex optimization. Comput. Optim. Appl. 72(1), 1–43 (2019)
Wright, S.J.: Coordinate descent algorithms. Math. Program. 151(1), 3–34 (2015)
Wu, Y., Liu, Y.: Robust truncated hinge loss support vector machines. J. Am. Stat. Assoc. 102(479), 974–983 (2007)
Yang, J., Zhang, D., Frangi, A.F., Yang, J.-Y.: Two-dimensional PCA: a new approach to appearance-based face representation and recognition. IEEE Trans. Pattern Anal. Mach. Intell. 26(1), 131–137 (2004)
Zhang, C., Liu, Y.: Multicategory angle-based large-margin classification. Biometrika 101(3), 625–640 (2014)
Zhang, C., Liu, Y., Wang, J., Zhu, H.: Reinforced angle-based multicategory support vector machines. J. Comput. Gr. Stat. 25(3), 806–825 (2016)
Zhang, C., Pham, M., Fu, S., Liu, Y.: Robust multicategory support vector machines using difference convex algorithm. Math. Program. 169(1), 277–305 (2018)
Zhao, J., Yu, G., Liu, Y., et al.: Assessing robustness of classification using an angular breakdown point. Ann. Stat. 46(6B), 3362–3389 (2018)
Zhou, H., Li, L.: Regularized matrix regression. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 76(2), 463–483 (2014)
Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 67(2), 301–320 (2005)
Acknowledgements
The authors are grateful to the editor and the reviewers for their insightful comments that have significantly improved the article. Qian and Zou were supported in part by NNSF of China Grants 11690015 11622104 and 11431006, and NSF of Tianjin 18JCJQJC46000. Tran-Dinh was supported in part by US NSF-Grant DMS-1619884. Liu was supported in part by US NSF Grants IIS1632951 and DMS-1821231, and NIH Grant R01GM126550.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix: Proofs of technical results
Appendix: Proofs of technical results
In this appendix, we provide all the remaining proofs of the results presented in the main text.
1.1 Proof of Lemma 2: Lipschitz continuity and boundedness
Since \([a]_{+} = \max \left\{ 0, a\right\} = (a + \left| a\right| )/2\), the function \(P_t\) defined in (15) can be rewritten as \(P_t({\mathcal {A}}({\mathbf {z}})) = \Vert {\hat{{\mathcal {A}}}}{\mathbf {z}}+ \mu \Vert _1 + {\mathbf {d}}_t^{\top }{\mathbf {z}}\) for some matrix \({\hat{{\mathcal {A}}}}\) and vectors \(\mu \) and \({\mathbf {d}}_t\). Here, \({\mathbf {d}}_t \triangleq {\bar{{\mathbf {d}}}} - \lambda ^{-1}\text{ vec }\left( \nabla {{\varPsi }}({\mathbf {M}}^t)\right) \). However, \({\varPsi }\) is also Lipschitz continuous due to its definition. This implies that \(\nabla {{\varPsi }}({\mathbf {M}}^t)\) is uniformly bounded, i.e., there exists a constant \(C_0\in (0, +\infty )\) such that \(\left\| \nabla {{\varPsi }}({\mathbf {M}}^t)\right\| _F \le C_0\) for all \({\mathbf {M}}^t\in {\mathbb {R}}^{(K-1)\times pq}\). As a consequence, \(P_t\) is Lipschitz continuous with the uniform constant \(L_0\) that is independent of t, i.e., \(\vert P_t({\mathbf {u}}) - P_t({\widehat{{\mathbf {u}}}})\vert \le L_0\Vert {\mathbf {u}}- {\widehat{{\mathbf {u}}}}\Vert _F\) for all \({\mathbf {u}}, {\widehat{{\mathbf {u}}}}\). The boundedness of \(\mathrm {dom}(P_t^{*})\) of the conjugate \(P_t^{*}\) follows from [3, Corollary 17.19]. \(\square \)
1.2 The proof of Lemma 3: the convergence of the primal–dual methods
Let \({\mathcal {G}}({\mathbf {M}}, {\mathbf {Y}}) = Q_t({\mathbf {M}}) + \langle {\mathcal {A}}({\mathbf {M}}), {\mathbf {Y}}\rangle - P_t^{*}({\mathbf {Y}})\), where \(P_t^{*}\) is the Fenchel conjugate of \(P_t\). Applying [9, Theorem 4] with \(f = 0\), for any \({\mathbf {M}}\) and \({\mathbf {Y}}\), we have
where \(T_l = \sum _{i=1}^{l}\frac{\sigma ^t_{i-1}}{\sigma ^t_0}\), and \({\overline{{\mathbf {Y}}}}^t_l = \frac{1}{T_l}\sum _{j=1}^l\frac{\sigma ^t_{j-1}}{\sigma _0^t}{\mathbf {Y}}^t_j\).
By the update rule in (18), we have \(\omega ^t_{l+1}\sigma ^t_{l+1} = \omega ^t_l\sigma ^t_l\). Hence, by induction, we have \(\omega ^t_l\sigma ^t_l = \omega ^t_0\sigma ^t_0 = \left\| {\mathcal {A}}\right\| ^{-2}\). On the other hand, by [8, Lemma 2], with the choice of \(\lambda = \left\| {\mathcal {A}}\right\| ^{-1}(1+\rho _t)\), we have
Using this estimate and \(\sigma ^t_l = \left\| {\mathcal {A}}\right\| ^{-2}\omega ^{-t}_l\), we have
where \(c = \left\| {\mathcal {A}}\right\| (1+\rho _t)^{-1}\). Hence, we can estimate \(T_l\) as \(T_l \ge \tfrac{1}{2}(1+\rho _t+\left\| {\mathcal {A}}\right\| )^{-1}(1+\rho _t)l^2\). Using this estimate of \(T_l\), \(\sigma ^t_0 = \omega ^t_0 = \Vert {\mathcal {A}}\Vert \), and \({\widetilde{F}}_t({\mathbf {M}}^t_l) - {\widetilde{F}}_t({\overline{{\mathbf {M}}}}^{t+1}) \le {\mathcal {G}}({\mathbf {M}}^t_l, {\overline{{\mathbf {Y}}}}^{t+1}) - {\mathcal {G}}({\overline{{\mathbf {M}}}}^{t+1}, {\overline{{\mathbf {Y}}}}^t_l)\), we obtain from (28) that
This is exactly (20).
Next, we prove (21). By introducing \({\mathbf {Y}}= {\mathcal {A}}({\mathbf {M}})\), we can reformulate the strongly convex subproblem (15) into the following constrained convex problem:
Note that \(Q_t\) is strongly convex with the strong convexity parameter \(1 + \rho _t\). We can apply [28, Algorithm 2] to solve (29). If we define
then, from the proof of [28, Theorem 2], we can show that
By Lemma 2, \(P_t\) is Lipschitz continuous with the Lipschitz constant \(L_0\). Then we have
Combining (30) and this estimate, we obtain
Similar to the proof of [28, Corollary 1], by using \(\sigma ^t_0 = \frac{1+\rho _t}{2\Vert {\mathcal {A}}\Vert ^2}\), the last inequality leads to
Combining the two last estimates, we obtain
which is exactly (21). \(\square \)
1.3 Proof of statistical properties
We provide the proof of Theorems 3 and 4 in this section.
1.3.1 Proof of Theorem 3: Fisher’s consistency
In our RMSMM (3), one can abstract the truncated hinge loss function as
Then, the conditional loss can be rewritten as
[34, Theorem 1] showed that for a vector data \({\mathbf {x}}\), the robust classifier based on the loss function \(\phi ({\mathbf {f}}({\mathbf {x}}),y)\) is Fisher consistent with \(\gamma \in \left[ 0,~\tfrac{1}{2}\right] \) and \(s\le 0\). By vectorizing the matrix data \({\mathbf {X}}\) to a new vector \({\mathbf {x}}=\text{ vec }({\mathbf {X}})\), then all settings here are the same as those of Theorem 1 in [34]. In this case, Fisher consistency results can naturally be transferred to matrix-type data. \(\square \)
1.3.2 Proof of Theorem 4: misclassification rates
First, we introduce the Rademacher complexity. Let \({\mathscr {G}} = \{g:{\mathbf {X}}\times {\mathcal {Y}}\rightarrow {\mathbb {R}}\}\) be a class of loss functions. Given the sample \({\mathcal {T}}=\{({\mathbf {X}}_i,y_i)\}_{i=1}^N\), we define the empirical Rademacher complexity of \({\mathscr {G}}\) as
where \({\varvec{\sigma }} = \{\sigma _i\}_{i=1}^N\) are i.i.d. random variables with \(\Pr (\sigma _1=1)=\Pr (\sigma _1=-1)=1/2\). The Rademacher complexity of \({\mathscr {G}}\) is defined as
For our model, let
and
To prove Theorem 4, we first recall the following lemma which provides a bound on \(E\left[ {\mathbb {I}}_\kappa \left\{ h({\mathbf {X}},y)\right\} \right] \) by the empirical error and the Rademacher complexity.
Lemma 4
For any \(h\in H\), with probability at least \(1-\delta \), we have
The proof of Lemma 4 can be found in [34].
Now, we need to derive the upper bound of the Rademacher complexity used in Lemma 4. Since \({\mathbb {I}}_\kappa \) is \(\frac{1}{\kappa }\)-Lipschitz, we have
where \(\tilde{{\mathbf {X}}}_i\) denotes \({\mathbf {X}}_i-{\bar{{\mathbf {X}}}}\) and \({\bar{{\mathbf {X}}}}=N^{-1}\sum _{i=1}^N{\mathbf {X}}_i\). The first inequality is due to Lemma 4.2 in [20], and the absolute values of the entries in \({\mathbf {w}}_y- {\mathbf {w}}_k\) are all bounded by 1.
Firstly, by the assumption, we can write \({\mathbf {X}}=E({\mathbf {X}})+{\mathbf {E}}\), where \(E({\mathbf {X}})=\sum _{k=1}^K \Pr ({\mathcal {Y}}=k){\mathbf {C}}_k\) and the variance and the fourth moment of the entries are \(\sigma ^2\) and \(\mu _4^4\). Accordingly, \(\tilde{{\mathbf {X}}}_i={\mathbf {E}}_i-{\bar{{\mathbf {E}}}}\), where \({\bar{{\mathbf {E}}}}=N^{-1}\sum _{i=1}^N{\mathbf {E}}_i\). Since \(\{({\mathbf {X}}_i,y_i)\}_{i=1}^N\) are the i.i.d. copies of \(({\mathbf {X}},{\mathcal {Y}})\), we have
Because \(E[(\sum _{i=1}^N \sigma _i {\mathbf {E}}_i)^2]=N\sigma ^2\) and \(E[(\sum _{i=1}^N \sigma _i {\mathbf {E}}_i)^4]=N\mu _4^4 + 3N(N-1)\sigma ^4\), by Theorem 2 in [16] we have
where c is a constant which does not depend on \({\mathcal {T}}\). By similar arguments, it is easy to see that
Accordingly, we obtain the upper bound of the Rademacher complexity as
The proof is completed by using Lemma 4 with this bound and the fact that the continuous indicator function \({\mathbb {I}}_\kappa \) is an upper bound of the indicator function for any \(\kappa \). \(\square \)
1.3.3 Proof of Theorem 5: breakdown point analysis
Let \(F({\mathbf {M}},{\mathcal {T}})\) denote the loss function (3) with the sample \({\mathcal {T}}\), and
For the MSMM classifier, we can choose the contaminated observation as \(({\mathbf {X}}^o,k)\) with \(\text{ vec }({{\mathbf {X}}^o})^\top =-c{\mathbf {w}}_k^\top {\widehat{{\mathbf {M}}}}\). For any \({\mathbf {M}}\in {\varDelta }^+\), \({\mathbf {w}}_k^\top {\widehat{{\mathbf {M}}}} {\mathbf {M}}^\top {\mathbf {w}}_k > 0\), then \({\mathbf {w}}_k^\top {\mathbf {M}}\text{ vec }{({\mathbf {X}}^o)} = -c {\mathbf {w}}_k^\top {\widehat{{\mathbf {M}}}}{\mathbf {M}}^\top {\mathbf {w}}_k \rightarrow -\infty \) as \(c \rightarrow \infty \). In this situation, the loss term corresponding to this contaminated observation will tend to infinity. Hence, we have \({\widetilde{{\mathbf {M}}}} \in {\varDelta }^-\) and the classifier breaks down.
For the RMSMM, since \({\widehat{{\mathbf {M}}}} \ne 0\), \({\widehat{{\mathbf {M}}}}\) is an interior point of \({\varDelta }^+\), the claim
is true. Note that the loss function
is bounded by \((K-1)(1-s)\). For any \(m \le n\epsilon _1/[2(1+\delta )(K-1)(1-s)]\) with \(\delta >0\) being any positive constant, any corresponding \(n-m\) clean subset \({\mathcal {T}}_{n-m} \subset {\mathcal {T}}_n\), and any \({\mathbf {M}}\in {\mathbb {R}}^{p \times q}\), we have
Therefore,
and
The last inequality reveals that \({\widetilde{{\mathbf {M}}}} \in {\varDelta }^+\) and thus the classifier would not break down when \(m \le n\epsilon _1/[2(1+\delta )(K-1)(1-s)]\) observations are contaminated. Finally, the proof is complete by setting \(\delta \rightarrow 0\) . \(\square \)
1.4 Derivation of Eq. (2): the dual problem
Lemma 5
For a \(p \times q\) real matrix \({\mathbf {A}}\), the subdifferential of the nuclear norm \(\Vert \cdot \Vert _{*}\) is given as
where \({\mathbf {U}}_{\mathbf {A}} {\varvec{{\varSigma }}}_{\mathbf {A}} {\mathbf {V}}_{\mathbf {A}}^\top \) is the SVD of \({\mathbf {A}}\), and \(\partial \) stands for the operator of subgradients.
Lemma 6
Suppose that \({\mathbf {X}} \in {\mathbb {R}}^{p \times q}\), \(\partial G({\mathbf {X}})= \rho {\mathbf {X}} - {\mathbf {P}} + \tau \partial ||{\mathbf {X}} ||_* \), where \({\mathbf {P}} \in {\mathbb {R}}^{p \times q}\) is a constant matrix w.r.t. \({\mathbf {X}}\). Let the SVD of \({\mathbf {P}}\) be
where \(\varvec{{\varSigma }}_0\) contains the singular values of \({\mathbf {P}}\) which are greater than \(\tau \), and \(\varvec{{\varSigma }}_1\) contains the rest. Then, we have \({\mathbf {0}} \in \partial G({\mathbf {X}}^*)\), where \({\mathbf {X}}^*=\rho ^{-1}{\mathcal {D}}_\tau ({\mathbf {P}}) =\rho ^{-1}{\mathbf {U}}_0 (\varvec{{\varSigma }}_0-\tau {\mathbf {I}}) {\mathbf {V}}_0^\top \).
Lemma 6 can be verified by using Lemma 5 with \({\mathbf {Z}} = \tau ^{-1}{\mathbf {U}}_1 \varvec{{\varSigma }}_1 {\mathbf {V}}_1^\top \).
Now we derive the dual problem (2) of (1). As in the classical SVM, by setting \(C=({N\lambda })^{-1}\), we can rewrite (1) into the following form:
The corresponding Lagrange function of this problem can be written as
where \(\alpha _i \ge 0\) and \(\mu _i \ge 0\) are corresponding Lagrange multipliers. By setting the derivatives w.r.t. b and \(\xi _i\) of this Lagrange function to zero, we get
Based on Lemma 6 and setting the derivative w.r.t. \({\mathbf {M}}\) to zero, we have \({\mathbf {M}} = {\mathcal {D}}_\tau (\sum _{i=1}^N \alpha _i y_i {\mathbf {X}}_i).\) Substituting these conditions into (31), we obtain
Rights and permissions
About this article
Cite this article
Qian, C., Tran-Dinh, Q., Fu, S. et al. Robust multicategory support matrix machines. Math. Program. 176, 429–463 (2019). https://doi.org/10.1007/s10107-019-01386-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10107-019-01386-z
Keywords
- Angle-based classifiers
- DCA (difference of convex function) algorithm
- Fisher consistency
- Nonconvex optimization
- Robustness
- Spectral elastic net