Abstract
Support vector machine (SVM) is a powerful binary classification tool, but the growing size of modern data is bringing challenges to it. First, the non-smoothness of hinge loss poses difficulties in large-scale computation. Second, the existing large-scale distributed algorithms heavily rely on uniformity and randomness conditions, which are frequently violated in practice. To solve these issues, we first construct a convolution smoothing SVM, which enjoys a smooth and convex objective function. Then a distributed SVM is developed, in which the estimator can be calculated conveniently by minimizing a pilot sample-based distributed surrogate loss. In particular, it can be adaptive when the uniformity or randomness condition is violated. The established theoretical results and numerical experiments on both synthetic and real data all confirm the proposed methods.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability
No datasets were generated or analysed during the current study.
References
Blanchard, G., Bousquet, O., Massart, P.: Statistical performance of support vector machines. Ann. Stat. 36, 489–531 (2008)
Battey, H., Fan, J., Liu, H., Lu, J., Zhu, Z.: Distributed testing and estimation under sparse high dimensional models. Ann. Stat. 46, 1352–1382 (2018)
Chang, K., Hsieh, C., Lin, C.: Coordinate descent method for large scale \(l_{2}\)-loss linear support vector machines. J. Mach. Learn. Res. 9, 1369–1398 (2008)
Chen, L., Zhou, Y.: Quantile regression in big data: a divide and conquer based strategy. Comput. Stat. Data Anal. 144, 106892 (2020)
Chen, X., Liu, W., Zhang, Y.: Quantile regression under memory constraint. Ann. Stat. 47, 3244–3273 (2019)
Chen, X., Xie, M.: A split-and-conquer approach for analysis of extraordinarily large data. Stat. Sin. 24, 1655–1684 (2014)
Chen, B., Harker, P.: Smooth approximations to nonlinear complementarity problems. SIAM J. Optimiz. 7, 403–420 (1997)
Chen, C., Mangasarian, O.: Smoothing methods for convex inequalities and linear complementarity problems. Math. Program. 71, 51–69 (1995)
Chen, X., Ye, Y.: On homotopy-smoothing methods for variational inequalities. SIAM J. Control. Optim. 37, 589–616 (1999)
Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20, 273–297 (1995)
Cervantes, J., Garcia-Lamont, F., Rodriguez-Mazahua, L., Lopez, A.: A comprehensive survey on support vector machine classification: applications, challenges and trends. Neurocomputing 408, 189–215 (2020)
Fan, J., Guo, Y., Wang, K.: Communication-efficient accurate statistical estimation, (2019). arXiv: 1906.04870
Fan, J., Wang, D., Wang, K., Zhu, Z.: Distributed estimation of principal eigenspaces, (2017). arXiv: 1702.06488
Fan, R., Chang, K., Hsieh, C., Wang, X., Lin, C.: LIBLINEAR: a library for large linear classification. J. Mach. Learn. Res. 9, 1871–1874 (2008)
Gopal, S., Yang, Y.: Distributed training of large-scale logistic models. In: International Conference on Machine Learning, 289–297 (2013)
Horowitz, J.: Bootstrap methods for median regression models. Econometrica 66, 1327–1351 (1998)
Huang, C., Huo, X.: A distributed one-step estimator. Math. Program. 174, 41–76 (2019)
Jordan, M.I., Lee, J.D., Yang, Y.: Communication-efficient distributed statistical inference. J. Am. Stat. Assoc. 14, 668–681 (2019)
Koo, J., Lee, Y., Kim, Y., Park, C.: A bahadur representation of the linear support vector machine. J. Mach. Learn. Res. 9, 1343–1368 (2008)
Koenker, R., Bassett, G.: Regression quantiles. Econometrica 46, 33–50 (1978)
Koenker, R.: Quantile regression. Cambridge University Press, Cambridge (2005)
Lee, Y., Mangasarian, O.: SSVM: a smooth support vector machine for classification. Comput. Optim. Appl. 20, 5–22 (2001)
Lepski, O., Mammen, E., Spokoiny, V.: Optimal spatial adaptation to inhomogeneous smoothness: an approach based on kernel estimates with variable bandwidth selectors. Ann. Stat. 25, 929–947 (1997)
Lian, H., Fan, Z.: Divide-and-conquer for debiased l1-norm support vector machine in ultra-high dimensions. J. Mach. Learn. Res. 18, 6691–6716 (2017)
Liu, Y., Zhang, H., Park, C., Ahn, J.: Support vector machines with adaptive \(l_{q}\) penalty. Comput. Stat. Data Anal. 51, 6380–6394 (2007)
Luo, L., Song, P.: Renewable estimation and incremental inference in generalized linear models with streaming data sets. J. Roy. Stat. Soc. B 82, 69–97 (2020)
Nemirovski, A., Yudin, D.: Problem complexity and method efficiency in optimization. Wiley, New York (1983)
Peng, B., Wang, L., Wu, Y.: An error bound for \(l_{1}\)-norm support vector machine coefficients in ultra-high dimension. J. Mach. Learn. Res. 17, 1–26 (2016)
Pan, R., Ren, T., Guo, B., Li, F., Li, G., Wang, H.: A note on distributed quantile regression by pilot sampling and one-step updating. J. Bus. Econ. Stat. 40, 1691–1700 (2022)
Park, C., Kim, K., Myung, R., Koo, J.: Oracle properties of scad-penalized support vector machine. J. Stat. Plan. Inference 142, 2257–2270 (2012)
Scovel, J., Steinwart, I.: Fast rates for support vector machines using gaussian kernels. Ann. Stat. 35, 575–607 (2007)
Shalev-Shwartz, S., Zhang, T.: Stochastic dual coordinate ascent methods for regularized loss. J. Mach. Learn. Res. 14, 567–599 (2013)
Shalev-Shwartz, S., Zhang, T.: Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. Math. Program. 155, 105–145 (2016)
Shamir, O., Srebro, N., Zhang, T.: Communication-efficient distributed optimization using an approximate newton-type method. Int Conf. Mach. Learn. 32, 1000–1008 (2014)
Steinwart, I.: Consistency of support vector machines and other regularized kernel machines. IEEE Trans. Inf. Theory 51, 128–142 (2005)
Sun, G., Wang, X., Yan, Y., Zhang, R.: Statistical inference and distributed implementation for linear multi-category SVM. Stat 12, e611 (2023)
Vapnik, V.: The nature of statistical learning theory. Springer, New York (1996)
Wang, X., Yang, Z., Chen, X., Liu, W.: Distributed inference for linear support vector machine. J. Mach. Learn. Res. 20, 1–41 (2019)
Wang, F., Zhu, Y., Huang, D., Qi, H., Wang, H.: Distributed one-step upgraded estimation for non-uniformly and non-randomly distributed data. Comput. Stat. Data Anal. 162, 107265 (2021)
Wang, K., Li, S.: Distributed statistical optimization for non-randomly stored big data with application to penalized learning. Stat. Comput. 33, 73 (2023)
Wang, K., Wang, H., Li, S.: Renewable quantile regression for streaming datasets. Knowl.-Based Syst. 235, 107675 (2022)
Wang, G., Zhang, G., Choi, K., Lam, K., Lu, J.: Output based transfer learning with least squares support vector machine and its application in bladder cancer prognosis. Neurocomputing 387, 279–292 (2020)
Wang, K., Yang, J., Polat, K., Alhudhaif, A., Sun, X.: Convolution smoothing and non-convex regularization for support vector machine in high dimensions. Appl. Soft Comput. 155, 111433 (2024)
Zhang, T.: Statistical behavior and consistency of classification methods based on convex risk minimization. Ann. Stat. 32, 56–84 (2004)
Zhang, Y., Duchi, J., Wainwright, M.: Divide and conquer kernel ridge regression: a distributed algorithm with minimax optimal rates. J. Mach. Learn. Res. 16, 3299–3340 (2015)
Zhang, X., Wu, Y., Wang, L., Li, R.: Variable selection for support vector machine in moderately high dimensions. J. R. Stat. Soc. Ser. B 78, 53–76 (2016)
Zhu, X., Li, F., Wang, H.: Least squares approximation for a distributed system, (2019). arXiv preprint arXiv: 1908.04904
Zhao, T., Cheng, G., Liu, H.: A partially linear framework for massive heterogeneous data. Ann. Stat. 44, 1400–1437 (2016)
Author information
Authors and Affiliations
Contributions
Kangning Wang and Xiaofei Sun developed the methods and wrote the main manuscript text. Jin Liu prepared numerical studies. All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
The research was supported by NNSF project of China (12401355) and Humanity and Social Science Foundation of Ministry of Education of China (24YJC910009).
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wang, K., Liu, J. & Sun, X. Support vector machine in big data: smoothing strategy and adaptive distributed inference. Stat Comput 34, 188 (2024). https://doi.org/10.1007/s11222-024-10506-5
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11222-024-10506-5