Nothing Special   »   [go: up one dir, main page]

Skip to main content
Log in

Support vector machine in big data: smoothing strategy and adaptive distributed inference

  • Original Paper
  • Published:
Statistics and Computing Aims and scope Submit manuscript

Abstract

Support vector machine (SVM) is a powerful binary classification tool, but the growing size of modern data is bringing challenges to it. First, the non-smoothness of hinge loss poses difficulties in large-scale computation. Second, the existing large-scale distributed algorithms heavily rely on uniformity and randomness conditions, which are frequently violated in practice. To solve these issues, we first construct a convolution smoothing SVM, which enjoys a smooth and convex objective function. Then a distributed SVM is developed, in which the estimator can be calculated conveniently by minimizing a pilot sample-based distributed surrogate loss. In particular, it can be adaptive when the uniformity or randomness condition is violated. The established theoretical results and numerical experiments on both synthetic and real data all confirm the proposed methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data availability

No datasets were generated or analysed during the current study.

References

  • Blanchard, G., Bousquet, O., Massart, P.: Statistical performance of support vector machines. Ann. Stat. 36, 489–531 (2008)

    MathSciNet  Google Scholar 

  • Battey, H., Fan, J., Liu, H., Lu, J., Zhu, Z.: Distributed testing and estimation under sparse high dimensional models. Ann. Stat. 46, 1352–1382 (2018)

    MathSciNet  Google Scholar 

  • Chang, K., Hsieh, C., Lin, C.: Coordinate descent method for large scale \(l_{2}\)-loss linear support vector machines. J. Mach. Learn. Res. 9, 1369–1398 (2008)

    MathSciNet  Google Scholar 

  • Chen, L., Zhou, Y.: Quantile regression in big data: a divide and conquer based strategy. Comput. Stat. Data Anal. 144, 106892 (2020)

    MathSciNet  Google Scholar 

  • Chen, X., Liu, W., Zhang, Y.: Quantile regression under memory constraint. Ann. Stat. 47, 3244–3273 (2019)

    MathSciNet  Google Scholar 

  • Chen, X., Xie, M.: A split-and-conquer approach for analysis of extraordinarily large data. Stat. Sin. 24, 1655–1684 (2014)

    MathSciNet  Google Scholar 

  • Chen, B., Harker, P.: Smooth approximations to nonlinear complementarity problems. SIAM J. Optimiz. 7, 403–420 (1997)

    MathSciNet  Google Scholar 

  • Chen, C., Mangasarian, O.: Smoothing methods for convex inequalities and linear complementarity problems. Math. Program. 71, 51–69 (1995)

    MathSciNet  Google Scholar 

  • Chen, X., Ye, Y.: On homotopy-smoothing methods for variational inequalities. SIAM J. Control. Optim. 37, 589–616 (1999)

    MathSciNet  Google Scholar 

  • Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20, 273–297 (1995)

    Google Scholar 

  • Cervantes, J., Garcia-Lamont, F., Rodriguez-Mazahua, L., Lopez, A.: A comprehensive survey on support vector machine classification: applications, challenges and trends. Neurocomputing 408, 189–215 (2020)

    Google Scholar 

  • Fan, J., Guo, Y., Wang, K.: Communication-efficient accurate statistical estimation, (2019). arXiv: 1906.04870

  • Fan, J., Wang, D., Wang, K., Zhu, Z.: Distributed estimation of principal eigenspaces, (2017). arXiv: 1702.06488

  • Fan, R., Chang, K., Hsieh, C., Wang, X., Lin, C.: LIBLINEAR: a library for large linear classification. J. Mach. Learn. Res. 9, 1871–1874 (2008)

    Google Scholar 

  • Gopal, S., Yang, Y.: Distributed training of large-scale logistic models. In: International Conference on Machine Learning, 289–297 (2013)

  • Horowitz, J.: Bootstrap methods for median regression models. Econometrica 66, 1327–1351 (1998)

    MathSciNet  Google Scholar 

  • Huang, C., Huo, X.: A distributed one-step estimator. Math. Program. 174, 41–76 (2019)

    MathSciNet  Google Scholar 

  • Jordan, M.I., Lee, J.D., Yang, Y.: Communication-efficient distributed statistical inference. J. Am. Stat. Assoc. 14, 668–681 (2019)

    MathSciNet  Google Scholar 

  • Koo, J., Lee, Y., Kim, Y., Park, C.: A bahadur representation of the linear support vector machine. J. Mach. Learn. Res. 9, 1343–1368 (2008)

    MathSciNet  Google Scholar 

  • Koenker, R., Bassett, G.: Regression quantiles. Econometrica 46, 33–50 (1978)

    MathSciNet  Google Scholar 

  • Koenker, R.: Quantile regression. Cambridge University Press, Cambridge (2005)

    Google Scholar 

  • Lee, Y., Mangasarian, O.: SSVM: a smooth support vector machine for classification. Comput. Optim. Appl. 20, 5–22 (2001)

    MathSciNet  Google Scholar 

  • Lepski, O., Mammen, E., Spokoiny, V.: Optimal spatial adaptation to inhomogeneous smoothness: an approach based on kernel estimates with variable bandwidth selectors. Ann. Stat. 25, 929–947 (1997)

    MathSciNet  Google Scholar 

  • Lian, H., Fan, Z.: Divide-and-conquer for debiased l1-norm support vector machine in ultra-high dimensions. J. Mach. Learn. Res. 18, 6691–6716 (2017)

    Google Scholar 

  • Liu, Y., Zhang, H., Park, C., Ahn, J.: Support vector machines with adaptive \(l_{q}\) penalty. Comput. Stat. Data Anal. 51, 6380–6394 (2007)

    Google Scholar 

  • Luo, L., Song, P.: Renewable estimation and incremental inference in generalized linear models with streaming data sets. J. Roy. Stat. Soc. B 82, 69–97 (2020)

    MathSciNet  Google Scholar 

  • Nemirovski, A., Yudin, D.: Problem complexity and method efficiency in optimization. Wiley, New York (1983)

    Google Scholar 

  • Peng, B., Wang, L., Wu, Y.: An error bound for \(l_{1}\)-norm support vector machine coefficients in ultra-high dimension. J. Mach. Learn. Res. 17, 1–26 (2016)

    Google Scholar 

  • Pan, R., Ren, T., Guo, B., Li, F., Li, G., Wang, H.: A note on distributed quantile regression by pilot sampling and one-step updating. J. Bus. Econ. Stat. 40, 1691–1700 (2022)

    MathSciNet  Google Scholar 

  • Park, C., Kim, K., Myung, R., Koo, J.: Oracle properties of scad-penalized support vector machine. J. Stat. Plan. Inference 142, 2257–2270 (2012)

    MathSciNet  Google Scholar 

  • Scovel, J., Steinwart, I.: Fast rates for support vector machines using gaussian kernels. Ann. Stat. 35, 575–607 (2007)

    MathSciNet  Google Scholar 

  • Shalev-Shwartz, S., Zhang, T.: Stochastic dual coordinate ascent methods for regularized loss. J. Mach. Learn. Res. 14, 567–599 (2013)

    MathSciNet  Google Scholar 

  • Shalev-Shwartz, S., Zhang, T.: Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. Math. Program. 155, 105–145 (2016)

    MathSciNet  Google Scholar 

  • Shamir, O., Srebro, N., Zhang, T.: Communication-efficient distributed optimization using an approximate newton-type method. Int Conf. Mach. Learn. 32, 1000–1008 (2014)

    Google Scholar 

  • Steinwart, I.: Consistency of support vector machines and other regularized kernel machines. IEEE Trans. Inf. Theory 51, 128–142 (2005)

    Google Scholar 

  • Sun, G., Wang, X., Yan, Y., Zhang, R.: Statistical inference and distributed implementation for linear multi-category SVM. Stat 12, e611 (2023)

  • Vapnik, V.: The nature of statistical learning theory. Springer, New York (1996)

    Google Scholar 

  • Wang, X., Yang, Z., Chen, X., Liu, W.: Distributed inference for linear support vector machine. J. Mach. Learn. Res. 20, 1–41 (2019)

    MathSciNet  Google Scholar 

  • Wang, F., Zhu, Y., Huang, D., Qi, H., Wang, H.: Distributed one-step upgraded estimation for non-uniformly and non-randomly distributed data. Comput. Stat. Data Anal. 162, 107265 (2021)

    MathSciNet  Google Scholar 

  • Wang, K., Li, S.: Distributed statistical optimization for non-randomly stored big data with application to penalized learning. Stat. Comput. 33, 73 (2023)

    MathSciNet  Google Scholar 

  • Wang, K., Wang, H., Li, S.: Renewable quantile regression for streaming datasets. Knowl.-Based Syst. 235, 107675 (2022)

    Google Scholar 

  • Wang, G., Zhang, G., Choi, K., Lam, K., Lu, J.: Output based transfer learning with least squares support vector machine and its application in bladder cancer prognosis. Neurocomputing 387, 279–292 (2020)

    Google Scholar 

  • Wang, K., Yang, J., Polat, K., Alhudhaif, A., Sun, X.: Convolution smoothing and non-convex regularization for support vector machine in high dimensions. Appl. Soft Comput. 155, 111433 (2024)

    Google Scholar 

  • Zhang, T.: Statistical behavior and consistency of classification methods based on convex risk minimization. Ann. Stat. 32, 56–84 (2004)

    MathSciNet  Google Scholar 

  • Zhang, Y., Duchi, J., Wainwright, M.: Divide and conquer kernel ridge regression: a distributed algorithm with minimax optimal rates. J. Mach. Learn. Res. 16, 3299–3340 (2015)

    MathSciNet  Google Scholar 

  • Zhang, X., Wu, Y., Wang, L., Li, R.: Variable selection for support vector machine in moderately high dimensions. J. R. Stat. Soc. Ser. B 78, 53–76 (2016)

    MathSciNet  Google Scholar 

  • Zhu, X., Li, F., Wang, H.: Least squares approximation for a distributed system, (2019). arXiv preprint arXiv: 1908.04904

  • Zhao, T., Cheng, G., Liu, H.: A partially linear framework for massive heterogeneous data. Ann. Stat. 44, 1400–1437 (2016)

    MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Contributions

Kangning Wang and Xiaofei Sun developed the methods and wrote the main manuscript text. Jin Liu prepared numerical studies. All authors reviewed the manuscript.

Corresponding author

Correspondence to Xiaofei Sun.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The research was supported by NNSF project of China (12401355) and Humanity and Social Science Foundation of Ministry of Education of China (24YJC910009).

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (pdf 2412 KB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, K., Liu, J. & Sun, X. Support vector machine in big data: smoothing strategy and adaptive distributed inference. Stat Comput 34, 188 (2024). https://doi.org/10.1007/s11222-024-10506-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11222-024-10506-5

Keywords

Navigation