Nothing Special   »   [go: up one dir, main page]

Skip to main content
Log in

Robust multicategory support matrix machines

  • Full Length Paper
  • Series B
  • Published:
Mathematical Programming Submit manuscript

Abstract

We consider the classification problem when the input features are represented as matrices rather than vectors. To preserve the intrinsic structures for classification, a successful method is the support matrix machine (SMM) in Luo et al. (in: Proceedings of the 32nd international conference on machine learning, Lille, France, no 1, pp 938–947, 2015), which optimizes an objective function with a hinge loss plus a so-called spectral elastic net penalty. However, the issues of extending SMM to multicategory classification still remain. Moreover, in practice, it is common to see the training data contaminated by outlying observations, which can affect the robustness of existing matrix classification methods. In this paper, we address these issues by introducing a robust angle-based classifier, which boils down binary and multicategory problems to a unified framework. Benefitting from the use of truncated hinge loss functions, the proposed classifier achieves certain robustness to outliers. The underlying optimization model becomes nonconvex, but admits a natural DC (difference of two convex functions) representation. We develop a new and efficient algorithm by incorporating the DC algorithm and primal–dual first-order methods together. The proposed DC algorithm adaptively chooses the accuracy of the subproblem at each iteration while guaranteeing the overall convergence of the algorithm. The use of primal–dual methods removes a natural complexity of the linear operator in the subproblems and enables us to use the proximal operator of the objective functions, and matrix–vector operations. This advantage allows us to solve large-scale problems efficiently. Theoretical and numerical results indicate that for problems with potential outliers, our method can be highly competitive among existing methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Altun, K., Barshan, B.: Human activity recognition using inertial/magnetic sensor units. In: International Workshop on Human Behavior Understanding, pp. 38–51. Springer (2010)

  2. An, L.T.H., Tao, P.D.: Solving a class of linearly constrained indefinite quadratic problems by DC algorithms. J. Glob. Optim. 11(3), 253–285 (1997)

    Article  MATH  Google Scholar 

  3. Bauschke, H., Combettes, P.L.: Convex Analysis and Monotone Operators Theory in Hilbert Spaces, 2nd edn. Springer, Berlin (2017)

    Book  MATH  Google Scholar 

  4. Boser, B.E., Guyon, I.M., Vapnik, V.N.: A training algorithm for optimal margin classifiers. In: Proceedings of the Fifth Annual Workshop on Computational Learning Theory, pp. 144–152. ACM (1992)

  5. Boyd, S.: Alternating direction method of multipliers. In: Talk at NIPS Workshop on Optimization and Machine Learning (2011)

  6. Cai, D., He, X., Wen, J.-R., Han, J., Ma, W.-Y.: Support tensor machines for text categorization. Technical Report (2006)

  7. Cai, J.-F., Candès, E.J., Shen, Z.: A singular value thresholding algorithm for matrix completion. SIAM J. Optim. 20(4), 1956–1982 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  8. Chambolle, A., Pock, T.: A first-order primal-dual algorithm for convex problems with applications to imaging. J. Math. Imaging Vis. 40(1), 120–145 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  9. Chambolle, A., Pock, T.: On the ergodic convergence rates of a first-order primal-dual algorithm. Math. Program. 159(1–2), 253–287 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  10. Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)

    MATH  Google Scholar 

  11. Davis, D., Yin, W.: Faster convergence rates of relaxed Peaceman–Rachford and ADMM under regularity assumptions. Math. Oper. Res. 42, 783–805 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  12. Friedman, J., Hastie, T., Tibshirani, R.: The Elements of Statistical Learning. Springer Series in Statistics, 2nd edn. Springer, New York (2001)

    MATH  Google Scholar 

  13. He, B.S., Yuan, X.M.: On the \({O}(1/n)\) convergence rate of the Douglas–Rachford alternating direction method. SIAM J. Numer. Anal. 50, 700–709 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  14. Hou, C., Nie, F., Zhang, C., Yi, D., Wu, Y.: Multiple rank multi-linear SVM for matrix data classification. Pattern Recognit. 47(1), 454–469 (2014)

    Article  MATH  Google Scholar 

  15. Huber, P.J., Ronchetti, E.: Robust Statistics. Wiley Series in Probability and Statistics, 2nd edn. Wiley, Hoboken (2009)

    Book  MATH  Google Scholar 

  16. Latala, R.: Some estimates of norms of random matrices. Proc. Am. Math. Soc. 133(5), 1273–1282 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  17. Lee, Y., Lin, Y., Wahba, G.: Multicategory support vector machines: theory and application to the classification of microarray data and satellite radiance data. J. Am. Stat. Assoc. 99(465), 67–81 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  18. Liu, Y.: Fisher consistency of multicategory support vector machines. In: Artificial Intelligence and Statistics, pp. 291–298 (2007)

  19. Luo, L., Xie, Y., Zhang, Z., Li, W.-J.: Support matrix machines. In: Proceedings of the 32nd International Conference on Machine Learning, Lille, France, no. 1, pp. 938–947 (2015)

  20. Mohri, M., Rostamizadeh, A., Talwalkar, A.: Foundations of Machine Learning. Adaptive Computation and Machine Learning Series. MIT Press, Cambridge (2012)

    MATH  Google Scholar 

  21. Monteiro, R.D.C., Svaiter, B.F.: Iteration-complexity of block-decomposition algorithms and the alternating direction method of multipliers. SIAM J. Optim. 23(1), 475–507 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  22. Negahban, S., Wainwright, M.J.: Estimation of (near) low-rank matrices with noise and high-dimensional scaling. Ann. Stat. 39(2), 1069–1097 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  23. Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course, volume 87 of Applied Optimization. Kluwer Academic Publishers, Boston (2004)

    Book  MATH  Google Scholar 

  24. Pirsiavash, H., Ramanan, D., Fowlkes, C.: Bilinear classifiers for visual recognition. In: Advances in Neural Information Processing Systems, pp. 1482–1490 (2009)

  25. Rockafellar, R.T.: Convex Analysis, volume 28 of Princeton Mathematics Series. Princeton University Press, Princeton (1970)

    Google Scholar 

  26. Sun, H., Craig, B., Zhang, L.: Angle-based multicategory distance-weighted SVM. J. Mach. Learn. Res. 18(85), 1–21 (2017)

    MathSciNet  MATH  Google Scholar 

  27. Tao, D., Li, X., Wu, X., Hu, W., Maybank, S.J.: Supervised tensor learning. Knowl. Inf. Syst. 13(1), 1–42 (2007)

    Article  Google Scholar 

  28. Tran-Dinh, Q.: Proximal alternating penalty algorithms for constrained convex optimization. Comput. Optim. Appl. 72(1), 1–43 (2019)

    Article  MathSciNet  MATH  Google Scholar 

  29. Wright, S.J.: Coordinate descent algorithms. Math. Program. 151(1), 3–34 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  30. Wu, Y., Liu, Y.: Robust truncated hinge loss support vector machines. J. Am. Stat. Assoc. 102(479), 974–983 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  31. Yang, J., Zhang, D., Frangi, A.F., Yang, J.-Y.: Two-dimensional PCA: a new approach to appearance-based face representation and recognition. IEEE Trans. Pattern Anal. Mach. Intell. 26(1), 131–137 (2004)

    Article  Google Scholar 

  32. Zhang, C., Liu, Y.: Multicategory angle-based large-margin classification. Biometrika 101(3), 625–640 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  33. Zhang, C., Liu, Y., Wang, J., Zhu, H.: Reinforced angle-based multicategory support vector machines. J. Comput. Gr. Stat. 25(3), 806–825 (2016)

    Article  MathSciNet  Google Scholar 

  34. Zhang, C., Pham, M., Fu, S., Liu, Y.: Robust multicategory support vector machines using difference convex algorithm. Math. Program. 169(1), 277–305 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  35. Zhao, J., Yu, G., Liu, Y., et al.: Assessing robustness of classification using an angular breakdown point. Ann. Stat. 46(6B), 3362–3389 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  36. Zhou, H., Li, L.: Regularized matrix regression. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 76(2), 463–483 (2014)

    Article  MathSciNet  Google Scholar 

  37. Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 67(2), 301–320 (2005)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

The authors are grateful to the editor and the reviewers for their insightful comments that have significantly improved the article. Qian and Zou were supported in part by NNSF of China Grants 11690015 11622104 and 11431006, and NSF of Tianjin 18JCJQJC46000. Tran-Dinh was supported in part by US NSF-Grant DMS-1619884. Liu was supported in part by US NSF Grants IIS1632951 and DMS-1821231, and NIH Grant R01GM126550.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yufeng Liu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: Proofs of technical results

Appendix: Proofs of technical results

In this appendix, we provide all the remaining proofs of the results presented in the main text.

1.1 Proof of Lemma 2: Lipschitz continuity and boundedness

Since \([a]_{+} = \max \left\{ 0, a\right\} = (a + \left| a\right| )/2\), the function \(P_t\) defined in (15) can be rewritten as \(P_t({\mathcal {A}}({\mathbf {z}})) = \Vert {\hat{{\mathcal {A}}}}{\mathbf {z}}+ \mu \Vert _1 + {\mathbf {d}}_t^{\top }{\mathbf {z}}\) for some matrix \({\hat{{\mathcal {A}}}}\) and vectors \(\mu \) and \({\mathbf {d}}_t\). Here, \({\mathbf {d}}_t \triangleq {\bar{{\mathbf {d}}}} - \lambda ^{-1}\text{ vec }\left( \nabla {{\varPsi }}({\mathbf {M}}^t)\right) \). However, \({\varPsi }\) is also Lipschitz continuous due to its definition. This implies that \(\nabla {{\varPsi }}({\mathbf {M}}^t)\) is uniformly bounded, i.e., there exists a constant \(C_0\in (0, +\infty )\) such that \(\left\| \nabla {{\varPsi }}({\mathbf {M}}^t)\right\| _F \le C_0\) for all \({\mathbf {M}}^t\in {\mathbb {R}}^{(K-1)\times pq}\). As a consequence, \(P_t\) is Lipschitz continuous with the uniform constant \(L_0\) that is independent of t, i.e., \(\vert P_t({\mathbf {u}}) - P_t({\widehat{{\mathbf {u}}}})\vert \le L_0\Vert {\mathbf {u}}- {\widehat{{\mathbf {u}}}}\Vert _F\) for all \({\mathbf {u}}, {\widehat{{\mathbf {u}}}}\). The boundedness of \(\mathrm {dom}(P_t^{*})\) of the conjugate \(P_t^{*}\) follows from [3, Corollary 17.19]. \(\square \)

1.2 The proof of Lemma 3: the convergence of the primal–dual methods

Let \({\mathcal {G}}({\mathbf {M}}, {\mathbf {Y}}) = Q_t({\mathbf {M}}) + \langle {\mathcal {A}}({\mathbf {M}}), {\mathbf {Y}}\rangle - P_t^{*}({\mathbf {Y}})\), where \(P_t^{*}\) is the Fenchel conjugate of \(P_t\). Applying [9, Theorem 4] with \(f = 0\), for any \({\mathbf {M}}\) and \({\mathbf {Y}}\), we have

$$\begin{aligned} {\mathcal {G}}({\mathbf {M}}^t_l, {\mathbf {Y}}) - {\mathcal {G}}({\mathbf {M}}, {\overline{{\mathbf {Y}}}}^t_l) \le \frac{1}{T_l}\left( \frac{\Vert {\mathbf {M}}^t_0 - {\mathbf {M}}\Vert _F^2}{2\omega ^t_0} + \frac{\Vert {\mathbf {Y}}^t_0 - {\mathbf {Y}}\Vert _F^2}{2\sigma ^t_0}\right) , \end{aligned}$$
(28)

where \(T_l = \sum _{i=1}^{l}\frac{\sigma ^t_{i-1}}{\sigma ^t_0}\), and \({\overline{{\mathbf {Y}}}}^t_l = \frac{1}{T_l}\sum _{j=1}^l\frac{\sigma ^t_{j-1}}{\sigma _0^t}{\mathbf {Y}}^t_j\).

By the update rule in (18), we have \(\omega ^t_{l+1}\sigma ^t_{l+1} = \omega ^t_l\sigma ^t_l\). Hence, by induction, we have \(\omega ^t_l\sigma ^t_l = \omega ^t_0\sigma ^t_0 = \left\| {\mathcal {A}}\right\| ^{-2}\). On the other hand, by [8, Lemma 2], with the choice of \(\lambda = \left\| {\mathcal {A}}\right\| ^{-1}(1+\rho _t)\), we have

$$\begin{aligned} \frac{\left\| {\mathcal {A}}\right\| }{1+\rho _t} + \frac{\left\| {\mathcal {A}}\right\| l}{\left\| {\mathcal {A}}\right\| + (1+\rho _t)} \le \frac{1}{(1+\rho _t)\omega ^t_l} \le \frac{\left\| {\mathcal {A}}\right\| }{1+\rho _t} + l. \end{aligned}$$

Using this estimate and \(\sigma ^t_l = \left\| {\mathcal {A}}\right\| ^{-2}\omega ^{-t}_l\), we have

$$\begin{aligned} T_l = \sum _{i=1}^l\frac{\sigma ^t_{i-1}}{\sigma ^t_0} = \frac{1}{\left\| {\mathcal {A}}\right\| }\sum _{i=1}^l\frac{1}{\omega ^t_{i-1}} \ge \sum _{i=1}^l \left( \frac{i-1}{1+c} + 1 \right) = \frac{l(l-1)}{2(1+c)} + l \ge \frac{l^2}{2(1+c)}, \end{aligned}$$

where \(c = \left\| {\mathcal {A}}\right\| (1+\rho _t)^{-1}\). Hence, we can estimate \(T_l\) as \(T_l \ge \tfrac{1}{2}(1+\rho _t+\left\| {\mathcal {A}}\right\| )^{-1}(1+\rho _t)l^2\). Using this estimate of \(T_l\), \(\sigma ^t_0 = \omega ^t_0 = \Vert {\mathcal {A}}\Vert \), and \({\widetilde{F}}_t({\mathbf {M}}^t_l) - {\widetilde{F}}_t({\overline{{\mathbf {M}}}}^{t+1}) \le {\mathcal {G}}({\mathbf {M}}^t_l, {\overline{{\mathbf {Y}}}}^{t+1}) - {\mathcal {G}}({\overline{{\mathbf {M}}}}^{t+1}, {\overline{{\mathbf {Y}}}}^t_l)\), we obtain from (28) that

$$\begin{aligned} {\widetilde{F}}_t({\mathbf {M}}^t_l) - {\widetilde{F}}_t({\overline{{\mathbf {M}}}}^{t+1}) \le \frac{(1+\rho _t + \Vert {\mathcal {A}}\Vert )\Vert {\mathcal {A}}\Vert }{(1+\rho _t)l^2}\left( \Vert {\mathbf {M}}^t_0 - {\overline{{\mathbf {M}}}}^{t+1}\Vert _F^2 + \Vert {\mathbf {Y}}^t_0 - {\overline{{\mathbf {Y}}}}^{t+1}\Vert _F^2 \right) . \end{aligned}$$

This is exactly (20).

Next, we prove (21). By introducing \({\mathbf {Y}}= {\mathcal {A}}({\mathbf {M}})\), we can reformulate the strongly convex subproblem (15) into the following constrained convex problem:

$$\begin{aligned} {\widetilde{F}}_t({\overline{{\mathbf {M}}}}^{t+1}) = \min _{{\mathbf {M}},{\mathbf {Y}}} \left\{ {\widetilde{F}}_t({\mathbf {M}},{\mathbf {Y}}) = P_t({\mathbf {Y}}) + Q_t({\mathbf {M}})~\mid ~{\mathcal {A}}({\mathbf {M}}) - {\mathbf {Y}}= 0\right\} . \end{aligned}$$
(29)

Note that \(Q_t\) is strongly convex with the strong convexity parameter \(1 + \rho _t\). We can apply [28, Algorithm 2] to solve (29). If we define

$$\begin{aligned} {\varDelta }_{\sigma ^t_l}({\mathbf {M}}^t_{l+1}) = P_t({\mathbf {Y}}^t_{l+1}) + Q_t({\mathbf {M}}^t_{l+1}) + \frac{\sigma ^t_l}{2}\Vert {\mathcal {A}}({\mathbf {M}}^t_{l+1}) - {\mathbf {Y}}^t_{l+1}\Vert _F^2 - {\widetilde{F}}_t({\overline{{\mathbf {M}}}}^{t+1}), \end{aligned}$$

then, from the proof of [28, Theorem 2], we can show that

$$\begin{aligned} {\varDelta }_{\sigma ^t_l}({\mathbf {M}}^t_{l+1}) \le \frac{2\big [\sigma ^t_0\Vert {\mathcal {A}}\Vert ^2 + 1 + \rho _t\big ]\Vert {\mathbf {M}}^t_0 - {\overline{{\mathbf {M}}}}^{t+1}\Vert _F^2}{(l+2)^2}. \end{aligned}$$
(30)

By Lemma 2, \(P_t\) is Lipschitz continuous with the Lipschitz constant \(L_0\). Then we have

$$\begin{aligned} \begin{aligned}&{\widetilde{F}}_t({\mathbf {M}}^t_{l+1}) - {\widetilde{F}}_t({\overline{{\mathbf {M}}}}^{t+1}) \\&\quad = P_t({\mathcal {A}}({\mathbf {M}}^t_{l+1})) + Q_t({\mathbf {M}}^t_{l+1}) - {\widetilde{F}}_t({\overline{{\mathbf {M}}}}^{t+1}) \\&\quad \le P_t({\mathbf {Y}}^t_{l+1}) + Q_t({\mathbf {M}}^t_{l+1}) - {\widetilde{F}}_t({\overline{{\mathbf {M}}}}^{t+1}) + L_0\Vert {\mathcal {A}}({\mathbf {M}}^t_{l+1}) - {\mathbf {Y}}^t_{l+1}\Vert _F. \end{aligned} \end{aligned}$$

Combining (30) and this estimate, we obtain

$$\begin{aligned} 0&\le {\widetilde{F}}_t({\mathbf {M}}^t_{l+1}) - {\widetilde{F}}_t({\overline{{\mathbf {M}}}}^{t+1}) \\&\le \frac{2\big [\sigma ^t_0\Vert {\mathcal {A}}\Vert ^2 + 1 + \rho _t\big ]\Vert {\mathbf {M}}^t_0 - {\overline{{\mathbf {M}}}}^{t+1}\Vert _F^2}{(l+2)^2} \\&\quad + L_0\Vert {\mathcal {A}}({\mathbf {M}}^t_{l+1}) - {\mathbf {Y}}^t_{l+1}\Vert _F - \frac{\sigma ^t_l}{2}\Vert {\mathcal {A}}({\mathbf {M}}^t_{l+1}) - {\mathbf {Y}}^t_{l+1}\Vert _F^2. \end{aligned}$$

Similar to the proof of [28, Corollary 1], by using \(\sigma ^t_0 = \frac{1+\rho _t}{2\Vert {\mathcal {A}}\Vert ^2}\), the last inequality leads to

$$\begin{aligned} \Vert {\mathcal {A}}({\mathbf {M}}^t_{l+1}) - {\mathbf {Y}}^t_{l+1}\Vert _F \le \frac{4\Vert {\mathcal {A}}\Vert }{(l+1)^2}\Big [\frac{2L_0\Vert {\mathcal {A}}\Vert }{1+\rho _t} + \sqrt{3}\Vert {\mathbf {M}}^t_0 - {\overline{{\mathbf {M}}}}^{t+1}\Vert _F\Big ]. \end{aligned}$$

Combining the two last estimates, we obtain

$$\begin{aligned} {\widetilde{F}}_t({\mathbf {M}}^t_{l}) - {\widetilde{F}}_t({\overline{{\mathbf {M}}}}^{t+1})\le & {} \frac{4\Vert {\mathcal {A}}\Vert L_0}{(l+1)^2}\Big [\frac{2L_0\Vert {\mathcal {A}}\Vert }{1+\rho _t} + \sqrt{3}\Vert {\mathbf {M}}^t_0 - {\overline{{\mathbf {M}}}}^{t+1}\Vert _F\Big ] \\&+ \,\frac{3(\rho _t+1)\Vert {\mathbf {M}}^t_0 - {\overline{{\mathbf {M}}}}^{t+1}\Vert _F^2}{(l+1)^2}, \end{aligned}$$

which is exactly (21). \(\square \)

1.3 Proof of statistical properties

We provide the proof of Theorems 3 and 4 in this section.

1.3.1 Proof of Theorem 3: Fisher’s consistency

In our RMSMM (3), one can abstract the truncated hinge loss function as

$$\begin{aligned} \phi ({\mathbf {f}}({\mathbf {X}}),y)=\gamma T_{(K-1)s}(\langle {\mathbf {f}}({\mathbf {X}}),~{\mathbf {w}}_{y}\rangle )+(1-\gamma )\sum _{k\ne y}R_{s}(\langle {\mathbf {f}}({\mathbf {X}}),~{\mathbf {w}}_k\rangle ). \end{aligned}$$

Then, the conditional loss can be rewritten as

$$\begin{aligned} L({\mathbf {X}})\triangleq \sum _{k=1}^K\left[ \gamma P_k T_{(K-1)s}(\langle {\mathbf {f}}({\mathbf {X}}),~{\mathbf {w}}_{k}\rangle )+ (1-P_k)R_{s}(\langle {\mathbf {f}}({\mathbf {X}}),~{\mathbf {w}}_k\rangle ) \right] . \end{aligned}$$

[34, Theorem 1] showed that for a vector data \({\mathbf {x}}\), the robust classifier based on the loss function \(\phi ({\mathbf {f}}({\mathbf {x}}),y)\) is Fisher consistent with \(\gamma \in \left[ 0,~\tfrac{1}{2}\right] \) and \(s\le 0\). By vectorizing the matrix data \({\mathbf {X}}\) to a new vector \({\mathbf {x}}=\text{ vec }({\mathbf {X}})\), then all settings here are the same as those of Theorem 1 in [34]. In this case, Fisher consistency results can naturally be transferred to matrix-type data. \(\square \)

1.3.2 Proof of Theorem 4: misclassification rates

First, we introduce the Rademacher complexity. Let \({\mathscr {G}} = \{g:{\mathbf {X}}\times {\mathcal {Y}}\rightarrow {\mathbb {R}}\}\) be a class of loss functions. Given the sample \({\mathcal {T}}=\{({\mathbf {X}}_i,y_i)\}_{i=1}^N\), we define the empirical Rademacher complexity of \({\mathscr {G}}\) as

$$\begin{aligned} {\hat{R}}_N({\mathscr {G}}) = E_{\varvec{\sigma }}\left\{ \sup _{g \in {\mathscr {G}}}\frac{1}{N}\sum _{i=1}^N \sigma _i g({\mathbf {X}}_i,y_i)\right\} , \end{aligned}$$

where \({\varvec{\sigma }} = \{\sigma _i\}_{i=1}^N\) are i.i.d. random variables with \(\Pr (\sigma _1=1)=\Pr (\sigma _1=-1)=1/2\). The Rademacher complexity of \({\mathscr {G}}\) is defined as

$$\begin{aligned} R_N({\mathscr {G}})=E_{{\varvec{\sigma }}, {\mathcal {T}}}\left\{ \sup _{g \in {\mathscr {G}}}\frac{1}{N}\sum _{i=1}^N \sigma _i g({\mathbf {X}}_i,y_i)\right\} . \end{aligned}$$

For our model, let

$$\begin{aligned} H = \left\{ h({\mathbf {X}},y) = \min _{k \ne y}(\langle {\mathbf {f}}({\mathbf {X}}),~{\mathbf {w}}_y - {\mathbf {w}}_k\rangle ) ~\mid ~{\mathbf {f}}\in {\mathcal {F}}, ~\sum _j||{\mathbf {M}}_j ||_* \le r \right\} , \end{aligned}$$

and

$$\begin{aligned} {\mathbb {I}}_\kappa (x) = {\left\{ \begin{array}{ll} 1 &{} x < 0, \\ 1 - \frac{1}{\kappa } x &{} 0 \le x \le \kappa , \\ 0 &{} \text{ otherwise }. \end{array}\right. } \end{aligned}$$

To prove Theorem 4, we first recall the following lemma which provides a bound on \(E\left[ {\mathbb {I}}_\kappa \left\{ h({\mathbf {X}},y)\right\} \right] \) by the empirical error and the Rademacher complexity.

Lemma 4

For any \(h\in H\), with probability at least \(1-\delta \), we have

$$\begin{aligned} E\left[ {\mathbb {I}}_\kappa \left\{ h({\mathbf {X}},y) \right\} \right] \le \frac{1}{N}\sum _{i=1}^N {\mathbb {I}}_\kappa \left\{ h({\mathbf {X}}_i,y_i)\right\} + 2R_N({\mathbb {I}}_\kappa \circ H) + \left\{ \frac{\log (\delta ^{-1})}{N} \right\} ^{1/2}. \end{aligned}$$

The proof of Lemma 4 can be found in [34].

Now, we need to derive the upper bound of the Rademacher complexity used in Lemma 4. Since \({\mathbb {I}}_\kappa \) is \(\frac{1}{\kappa }\)-Lipschitz, we have

$$\begin{aligned} R_N({\mathbb {I}}_\kappa \circ H)&\le \frac{1}{\kappa } E_{{\varvec{\sigma }},{\mathcal {T}}}\left\{ \sup _{\sum ||{\mathbf {M}}_j ||_* \le r} \frac{1}{N} \sum _{i=1}^N \sigma _i \sum _{j=1}^{K-1} {{\,\mathrm{tr}\,}}({\mathbf {M}}_j^\top \tilde{{\mathbf {X}}}_i)\right\} \\&= \frac{r}{\kappa N} E_{{\varvec{\sigma }},{\mathcal {T}}} \left\{ \left\| \sum _{i=1}^N \sigma _i \tilde{{\mathbf {X}}}_i\right\| _2\right\} , \end{aligned}$$

where \(\tilde{{\mathbf {X}}}_i\) denotes \({\mathbf {X}}_i-{\bar{{\mathbf {X}}}}\) and \({\bar{{\mathbf {X}}}}=N^{-1}\sum _{i=1}^N{\mathbf {X}}_i\). The first inequality is due to Lemma 4.2 in [20], and the absolute values of the entries in \({\mathbf {w}}_y- {\mathbf {w}}_k\) are all bounded by 1.

Firstly, by the assumption, we can write \({\mathbf {X}}=E({\mathbf {X}})+{\mathbf {E}}\), where \(E({\mathbf {X}})=\sum _{k=1}^K \Pr ({\mathcal {Y}}=k){\mathbf {C}}_k\) and the variance and the fourth moment of the entries are \(\sigma ^2\) and \(\mu _4^4\). Accordingly, \(\tilde{{\mathbf {X}}}_i={\mathbf {E}}_i-{\bar{{\mathbf {E}}}}\), where \({\bar{{\mathbf {E}}}}=N^{-1}\sum _{i=1}^N{\mathbf {E}}_i\). Since \(\{({\mathbf {X}}_i,y_i)\}_{i=1}^N\) are the i.i.d. copies of \(({\mathbf {X}},{\mathcal {Y}})\), we have

$$\begin{aligned}&\left\| \sum _{i=1}^N \sigma _i \tilde{{\mathbf {X}}}_i \right\| _2 \le \left| \sum _{i=1}^N \sigma _i\right| \left\| {\bar{{\mathbf {E}}}} \right\| _2 + \left\| \sum _{i=1}^N \sigma _i {\mathbf {E}}_i \right\| _2. \end{aligned}$$

Because \(E[(\sum _{i=1}^N \sigma _i {\mathbf {E}}_i)^2]=N\sigma ^2\) and \(E[(\sum _{i=1}^N \sigma _i {\mathbf {E}}_i)^4]=N\mu _4^4 + 3N(N-1)\sigma ^4\), by Theorem 2 in [16] we have

$$\begin{aligned}&E_{{\varvec{\sigma }}, {\mathcal {T}}}\left( \left\| \sum _{i=1}^N \sigma _i {\mathbf {E}}_i \right\| _2\right) \\&\quad \le c \sigma N^{1/2}\left\{ p^{1/2} + q^{1/2} + (pq)^{1/4}[N\mu _4^4 + 3N(N-1)\sigma ^4]^{1/4}/(\sigma N^{1/2})\right\} \\&\quad \le c \sigma (1+\frac{3^{1/4}}{2})N^{1/2}\left\{ p^{1/2} + q^{1/2}\right\} + O(N^{1/4}(p^{1/2} + q^{1/2})), \end{aligned}$$

where c is a constant which does not depend on \({\mathcal {T}}\). By similar arguments, it is easy to see that

$$\begin{aligned} E_{\varvec{\sigma },{\mathcal {T}}}\left( \left| \sum _{i=1}^N \sigma _i\right| \left\| {\bar{{\mathbf {E}}}} \right\| _2\right)&\le \sqrt{E_{\varvec{\sigma }}\left\{ \left( \sum _{i=1}^N \sigma _i\right) ^2\right\} }E_{{\mathcal {T}}}\left( ||{\bar{{\mathbf {E}}}} ||_2\right) \\&= N^{1/2}E_{{\mathcal {T}}}\left( ||{\bar{{\mathbf {E}}}} ||_2\right) =O(p^{1/2} + q^{1/2}). \end{aligned}$$

Accordingly, we obtain the upper bound of the Rademacher complexity as

$$\begin{aligned} R_N({\mathbb {I}}_\kappa \circ H) \le \frac{r}{\kappa \sqrt{N}}\left\{ c\sigma \left( 1+\frac{3^{1/4}}{2}\right) (p^{1/2} + q^{1/2})\right\} . \end{aligned}$$

The proof is completed by using Lemma 4 with this bound and the fact that the continuous indicator function \({\mathbb {I}}_\kappa \) is an upper bound of the indicator function for any \(\kappa \). \(\square \)

1.3.3 Proof of Theorem 5: breakdown point analysis

Let \(F({\mathbf {M}},{\mathcal {T}})\) denote the loss function (3) with the sample \({\mathcal {T}}\), and

$$\begin{aligned} {\varDelta }^+ \triangleq \left\{ {\mathbf {M}}~\mid ~\forall k,~s.t.~{\mathbf {w}}_k^\top {\widehat{{\mathbf {M}}}} {\mathbf {M}}^\top {\mathbf {w}}_k > 0\right\} ~~\text {and}~~{\varDelta }^- \triangleq \left\{ {\mathbf {M}}\mid ~\exists k,~s.t.~{\mathbf {w}}_k^\top {\widehat{{\mathbf {M}}}} {\mathbf {M}}^\top {\mathbf {w}}_k \le 0\right\} . \end{aligned}$$

For the MSMM classifier, we can choose the contaminated observation as \(({\mathbf {X}}^o,k)\) with \(\text{ vec }({{\mathbf {X}}^o})^\top =-c{\mathbf {w}}_k^\top {\widehat{{\mathbf {M}}}}\). For any \({\mathbf {M}}\in {\varDelta }^+\), \({\mathbf {w}}_k^\top {\widehat{{\mathbf {M}}}} {\mathbf {M}}^\top {\mathbf {w}}_k > 0\), then \({\mathbf {w}}_k^\top {\mathbf {M}}\text{ vec }{({\mathbf {X}}^o)} = -c {\mathbf {w}}_k^\top {\widehat{{\mathbf {M}}}}{\mathbf {M}}^\top {\mathbf {w}}_k \rightarrow -\infty \) as \(c \rightarrow \infty \). In this situation, the loss term corresponding to this contaminated observation will tend to infinity. Hence, we have \({\widetilde{{\mathbf {M}}}} \in {\varDelta }^-\) and the classifier breaks down.

For the RMSMM, since \({\widehat{{\mathbf {M}}}} \ne 0\), \({\widehat{{\mathbf {M}}}}\) is an interior point of \({\varDelta }^+\), the claim

$$\begin{aligned} \epsilon _1 = \min _{{\mathbf {M}}\in {\varDelta }^-} F({\mathbf {M}},~{\mathcal {T}}_n) - \min _{{\mathbf {M}}\in {\varDelta }^+} F({\mathbf {M}},~{\mathcal {T}}_n) >0 \end{aligned}$$

is true. Note that the loss function

$$\begin{aligned} l({\mathbf {X}},{\mathcal {Y}},{\mathbf {M}}) = \gamma T_{s(K-1)}({\mathbf {w}}_y^\top {\mathbf {M}}\text{ vec }({{\mathbf {X}}})) + (1-\gamma )\sum _{k \ne {\mathcal {Y}}} R_s({\mathbf {w}}_k^\top {\mathbf {M}}\text{ vec }({{\mathbf {X}}})) \end{aligned}$$

is bounded by \((K-1)(1-s)\). For any \(m \le n\epsilon _1/[2(1+\delta )(K-1)(1-s)]\) with \(\delta >0\) being any positive constant, any corresponding \(n-m\) clean subset \({\mathcal {T}}_{n-m} \subset {\mathcal {T}}_n\), and any \({\mathbf {M}}\in {\mathbb {R}}^{p \times q}\), we have

$$\begin{aligned} 0\le & {} F({\mathbf {M}},~{\mathcal {T}}_n) - \frac{n-m}{n}F({\mathbf {M}},~{\mathcal {T}}_{n-m}) = \frac{1}{n}\sum _{i \in {\mathcal {T}}_{n}\setminus {\mathcal {T}}_{n-m}} l({\mathbf {X}}_i,y_i,{\mathbf {M}}) \\\le & {} \frac{m(K-1)(1-s)}{n} < \frac{\epsilon _1}{2+2\delta }. \end{aligned}$$

Therefore,

$$\begin{aligned} \left| \min _{{\mathbf {M}}\in {\varDelta }^-} F({\mathbf {M}},~{\mathcal {T}}_n) - \min _{{\mathbf {M}}\in {\varDelta }^+} F({\mathbf {M}},~{\mathcal {T}}_n) - \min _{{\mathbf {M}}\in {\varDelta }^-} F({\mathbf {M}},~\widetilde{{\mathcal {T}}}_{n,m}) \right. \\ \left. + \min _{{\mathbf {M}}\in {\varDelta }^+} F({\mathbf {M}},~\widetilde{{\mathcal {T}}}_{n,m})\right| \le \frac{\epsilon _1}{1+\delta }, \end{aligned}$$

and

$$\begin{aligned} \min _{{\mathbf {M}}\in {\varDelta }^-} F({\mathbf {M}},~\widetilde{{\mathcal {T}}}_{n,m}) - \min _{{\mathbf {M}}\in {\varDelta }^+} F({\mathbf {M}},~\widetilde{{\mathcal {T}}}_{n,m})> \frac{\epsilon _1 \delta }{1+\delta } > 0. \end{aligned}$$

The last inequality reveals that \({\widetilde{{\mathbf {M}}}} \in {\varDelta }^+\) and thus the classifier would not break down when \(m \le n\epsilon _1/[2(1+\delta )(K-1)(1-s)]\) observations are contaminated. Finally, the proof is complete by setting \(\delta \rightarrow 0\) . \(\square \)

1.4 Derivation of Eq. (2): the dual problem

Lemma 5

For a \(p \times q\) real matrix \({\mathbf {A}}\), the subdifferential of the nuclear norm \(\Vert \cdot \Vert _{*}\) is given as

$$\begin{aligned} \partial ||{\mathbf {A}} ||_{*} = \left\{ {\mathbf {U}}_{\mathbf {A}} {\mathbf {V}}_{\mathbf {A}}^\top +{\mathbf {Z}} ~\mid ~ {\mathbf {Z}} \in {\mathbb {R}}^{p \times q}, {\mathbf {U}}_{\mathbf {A}}^\top {\mathbf {Z}}={\mathbf {0}},\, {\mathbf {ZV}}_{\mathbf {A}} ={\mathbf {0}},\, ||{\mathbf {Z}} ||_2 \le 1 \right\} , \end{aligned}$$

where \({\mathbf {U}}_{\mathbf {A}} {\varvec{{\varSigma }}}_{\mathbf {A}} {\mathbf {V}}_{\mathbf {A}}^\top \) is the SVD of \({\mathbf {A}}\), and \(\partial \) stands for the operator of subgradients.

Lemma 6

Suppose that \({\mathbf {X}} \in {\mathbb {R}}^{p \times q}\), \(\partial G({\mathbf {X}})= \rho {\mathbf {X}} - {\mathbf {P}} + \tau \partial ||{\mathbf {X}} ||_* \), where \({\mathbf {P}} \in {\mathbb {R}}^{p \times q}\) is a constant matrix w.r.t. \({\mathbf {X}}\). Let the SVD of \({\mathbf {P}}\) be

$$\begin{aligned} {\mathbf {P}} = {\mathbf {U}}_0 \varvec{{\varSigma }}_0 {\mathbf {V}}_0^\top + {\mathbf {U}}_1 \varvec{{\varSigma }}_1 {\mathbf {V}}_1^\top , \end{aligned}$$

where \(\varvec{{\varSigma }}_0\) contains the singular values of \({\mathbf {P}}\) which are greater than \(\tau \), and \(\varvec{{\varSigma }}_1\) contains the rest. Then, we have \({\mathbf {0}} \in \partial G({\mathbf {X}}^*)\), where \({\mathbf {X}}^*=\rho ^{-1}{\mathcal {D}}_\tau ({\mathbf {P}}) =\rho ^{-1}{\mathbf {U}}_0 (\varvec{{\varSigma }}_0-\tau {\mathbf {I}}) {\mathbf {V}}_0^\top \).

Lemma 6 can be verified by using Lemma 5 with \({\mathbf {Z}} = \tau ^{-1}{\mathbf {U}}_1 \varvec{{\varSigma }}_1 {\mathbf {V}}_1^\top \).

Now we derive the dual problem (2) of (1). As in the classical SVM, by setting \(C=({N\lambda })^{-1}\), we can rewrite (1) into the following form:

$$\begin{aligned} \left\{ \begin{aligned}&\min _{{\mathbf {M}},b,\xi } \left\{ \frac{1}{2}{{\,\mathrm{tr}\,}}({\mathbf {M}}^\top {\mathbf {M}})+ \tau ||{\mathbf {M}} ||_* + C\sum _{i=1}^N \xi _i \right\} \\&\text{ s.t. } \quad \xi _i \ge 0,~~y_i\left[ {{\,\mathrm{tr}\,}}({\mathbf {M}}^\top {\mathbf {X}}_i) + b\right] \ge 1-\xi _i,~~i=1,\ldots ,N. \end{aligned}\right. \end{aligned}$$

The corresponding Lagrange function of this problem can be written as

$$\begin{aligned} \begin{aligned} L_P({\mathbf {M}},b,\xi ,\alpha ,\mu )&= \frac{1}{2}{{\,\mathrm{tr}\,}}({\mathbf {M}}^\top {\mathbf {M}})+\tau ||{\mathbf {M}} ||_* + C\sum _{i=1}^N \xi _i\\&\quad - \sum _{i=1}^N \alpha _i [y_i\{{{\,\mathrm{tr}\,}}({\mathbf {M}}^\top {\mathbf {X}}_i)+b\}-1+\xi _i] - \sum _{i=1}^N \mu _i \xi _i, \end{aligned} \end{aligned}$$
(31)

where \(\alpha _i \ge 0\) and \(\mu _i \ge 0\) are corresponding Lagrange multipliers. By setting the derivatives w.r.t. b and \(\xi _i\) of this Lagrange function to zero, we get

$$\begin{aligned} \left\{ \begin{array}{ll} \displaystyle \sum _{i=1}^N \alpha _i y_i=0, &{}\\ C-\alpha _i-\mu _i =0,&{}\quad i=1,\ldots ,N. \end{array}\right. \end{aligned}$$

Based on Lemma 6 and setting the derivative w.r.t. \({\mathbf {M}}\) to zero, we have \({\mathbf {M}} = {\mathcal {D}}_\tau (\sum _{i=1}^N \alpha _i y_i {\mathbf {X}}_i).\) Substituting these conditions into (31), we obtain

$$\begin{aligned} \left\{ \begin{aligned}&\min _{\alpha } \left\{ \frac{1}{2}||{\mathcal {D}}_\tau \left( \sum _{i=1}^N \alpha _i y_i {\mathbf {X}}_i\right) ||_F^2 - \sum _{i=1}^N \alpha _i \right\} \\&\text{ s.t. } \quad 0 \le \alpha _i \le C;\quad i=1,\ldots , N,~\sum _{i=1}^N \alpha _i y_i=0. \end{aligned}\right. \end{aligned}$$

This gives us the dual problem (2) of (1). \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Qian, C., Tran-Dinh, Q., Fu, S. et al. Robust multicategory support matrix machines. Math. Program. 176, 429–463 (2019). https://doi.org/10.1007/s10107-019-01386-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10107-019-01386-z

Keywords

Mathematics Subject Classification

Navigation