Abstract
Support vector machine (SVM) is a popular supervised machine learning technique extensively applied to various real-life applications. A weakness of SVM, though, is that its efficiency rapidly decreases as the size of training dataset increases. This issue is more challenging for the devices and applications of IoT where the computing resource is quite constrained. As a result, numerous schemes have been proposed to decrease the training time of SVM, especially for the large-scale dataset. In this paper, a survey on the state-of-the-art methods for reducing the number of training data points and increasing the speed of SVM is presented. The existing methods are categorized into five types based on the approach employed for data reduction process, namely clustering-based, geometrical, distance-based, random sampling, and evolutionary methods. The key characteristics of the schemes are reviewed and compared against six practical benchmarks. This survey will be useful to select an appropriate method for the given target problem and develop a new approach for improving the performance of SVM dealing with the large-scale dataset.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Availability of data and material
Data sharing not applicable to this article as no datasets were generated or analyzed during the current study.
Code availability
Not applicable.
References
Abbasion S, Rafsanjani A, Farshidianfar A, Irani N (2007) Rolling element bearings multi-fault classification based on the wavelet denoising and support vector machine. Mech Syst Signal Process 21:2933–2945
Abe S, Inoue T (2001) Fast training of support vector machines by extracting boundary data. In: Dorffner G, Bischof H, Hornik K (eds) Artificial neural networks—ICANN 2001. Springer, Berlin, Heidelberg, pp 308–313
Almasi ON, Rouhani M (2016) Fast and de-noise support vector machine training method based on fuzzy clustering method for large real world datasets. Turk J Electr Eng Comput Sci 24:219–233
Angiulli F (2007) Fast nearest neighbor condensation for large data sets classification. IEEE Trans Knowl Data Eng 19:1450–1464. https://doi.org/10.1109/TKDE.2007.190645
Angiulli F, Astorino A (2010) Scaling up support vector machines using nearest neighbor condensation. IEEE Trans Neural Netw 21:351–357. https://doi.org/10.1109/TNN.2009.2039227
Awad M, Khan L, Bastani F, Yen I-L (2004) An effective support vector machines (SVMs) performance using hierarchical clustering. In: IEEE, pp 663–667
Balcázar J, Dai Y, Watanabe O (2001) A Random sampling technique for training support vector machines. In: Abe N, Khardon R, Zeugmann T (eds) Algorithmic learning theory. Springer, Berlin, Heidelberg, pp 119–134
Bang S, Jhun M (2014) Weighted support vector machine using k-means clustering. Commun Stat Simul Comput 43:2307–2324
Barber CB, Dobkin DP, Huhdanpaa H (1996) The quickhull algorithm for convex hulls. ACM Trans Math Softw 22:469–483. https://doi.org/10.1145/235815.235821
Bennett KP, Bredensteiner EJ (2000) Duality and geometry in SVM classifiers, pp 57–64
Birzhandi P, Youn HY (2019) CBCH (clustering-based convex hull) for reducing training time of support vector machine. J Supercomput. https://doi.org/10.1007/s11227-019-02795-9
Birzhandi P, Kim KT, Lee B, Youn HY (2019) Reduction of training data using parallel hyperplane for support vector machine. Appl Artif Intell 33:497–516. https://doi.org/10.1080/08839514.2019.1583449
Cervantes J, Li X, Yu W (2006) Support vector machine classification based on fuzzy clustering for large data sets. Springer, Berlin, pp 572–582
Cervantes J, Li X, Yu W, Li K (2008) Support vector machine classification for large data sets via minimum enclosing ball clustering. Neurocomputing 71:611–619. https://doi.org/10.1016/j.neucom.2007.07.028
Chang C-C, Lin C-J (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2:27:1-27:27. https://doi.org/10.1145/1961189.1961199
Chau AL, Li X, Yu W (2013) Large data sets classification using convex–concave hull and support vector machine. Soft Comput 17:793–804
Cheng H, Tan P, Jin R (2010) Efficient algorithm for localized support vector machine. IEEE Trans Knowl Data Eng 22:537–549. https://doi.org/10.1109/TKDE.2009.116
Crisp DJ, Burges CJ (2000) A geometric interpretation of v-SVM classifiers, pp 244–250
Dakka J, Farkas-Pall K, Balasubramanian V, Turilli M, Wan S, Wright DW, Zasada S, Coveney PV, Jha S (2018) Enabling trade-offs between accuracy and computational cost: adaptive algorithms to reduce time to clinical insight. In: 2018 18th IEEE/ACM international symposium on cluster, cloud and grid computing (CCGRID), pp 572–577
de Almeida MB, de Braga AP, Braga JP (2000) SVM-KM: speeding SVMs learning with a priori cluster selection and k-means. In: Proceedings. vol 1. Sixth Brazilian symposium on neural networks, pp 162–167
Demidova L, Sokolova Y, Nikulchev E et al (2015) Use of fuzzy clustering algorithms ensemble for SVM classifier development. Int Rev Model Simul IREMOS 8:446–457. https://doi.org/10.15866/iremos.v8i4.6825
Dong J, Krzyżak A, Suen CY (2005) An improved handwritten Chinese character recognition system using support vector machine. Pattern Recogn Lett 26:1849–1856
Elouedi Z, Mellouli K, Smets P (2001) Belief decision trees: theoretical foundations. Int J Approx Reason 28:91–124. https://doi.org/10.1016/S0888-613X(01)00045-7
Garg A, Upadhyaya S, Kwiat K (2013) A user behavior monitoring and profiling scheme for masquerade detection. Handb Stat Mach Learn Theory Appl 31:353–379
Goodrich B, Albrecht D, Tischer P (2009) Algorithms for the computation of reduced convex hulls. In: Nicholson A, Li X (eds) AI 2009: advances in artificial intelligence. Springer, Berlin, Heidelberg, pp 230–239
Grother PJ, Candela GT, Blue JL (1997) Fast implementations of nearest neighbor classifiers. Pattern Recogn 30:459–465. https://doi.org/10.1016/S0031-3203(96)00098-2
Guo L, Boukir S (2015) Fast data selection for SVM training using ensemble margin. Pattern Recogn Lett 51:112–119. https://doi.org/10.1016/j.patrec.2014.08.003
He Q, Xie Z, Hu Q, Wu C (2011) Neighborhood based sample and feature selection for SVM classification learning. Neurocomputing 74:1585–1594. https://doi.org/10.1016/j.neucom.2011.01.019
Kaufman L (1999) Solving the quadratic programming problem arising in support vector classification, pp 147–167
Kawulok M, Nalepa J (2012) Support vector machines training data selection using a genetic algorithm. In: Gimel’farb G, Hancock E, Imiya A et al (eds) Structural, syntactic, and statistical pattern recognition. Springer, Berlin, Heidelberg, pp 557–565
Khosravani HR, Ruano AE, Ferreira PM (2013) A simple algorithm for convex hull determination in high dimensions. In: 2013 IEEE 8th international symposium on intelligent signal processing, pp 109–114
Koggalage R, Halgamuge S (2004) Reducing the number of training samples for fast support vector machine classification. Neural Inf Process Lett Rev 2:57–65
Kumar MA, Gopal M (2010) A comparison study on multiple binary-class SVM methods for unilabel text categorization. Pattern Recogn Lett 31:1437–1444
Lee Y, Huang S (2007) Reduced support vector machines: a statistical theory. IEEE Trans Neural Netw 18:1–13. https://doi.org/10.1109/TNN.2006.883722
Lee SW, Verri A (eds) (2003) Pattern recognition with support vector machines: first international workshop, SVM 2002, Niagara Falls, Canada. Proceedings, vol 2388. Springer
Li R, Bhanu B, Krawiec K (2007) Hybrid coevolutionary algorithms vs. SVM algorithms. In: Proceedings of the 9th annual conference on genetic and evolutionary computation. ACM, New York, pp 456–463
Li C, Liu K, Wang H (2011) The incremental learning algorithm with support vector machine based on hyperplane-distance. Appl Intell 34:19–27
Li I-J, Wu J-L, Yeh C-H (2018) A fast classification strategy for SVM on the large-scale high-dimensional datasets. Pattern Anal Appl 21:1023–1038. https://doi.org/10.1007/s10044-017-0620-0
Liu P, Choo K-KR, Wang L, Huang F (2017) SVM or deep learning? A comparative study on remote sensing image classification. Soft Comput 21:7053–7065. https://doi.org/10.1007/s00500-016-2247-2
López-Chau A, Li X, Yu W (2012) Convex-concave hull for classification with support vector machine. In: 2012 IEEE 12th international conference on data mining workshops, pp 431–438
Makris A, Kosmopoulos D, Perantonis S, Theodoridis S (2011) A hierarchical feature fusion framework for adaptive visual tracking. Image vis Comput 29:594–606. https://doi.org/10.1016/j.imavis.2011.07.001
Manimala K, David IG, Selvi K (2015) A novel data selection technique using fuzzy C-means clustering to enhance SVM-based power quality classification. Soft Comput 19:3123–3144. https://doi.org/10.1007/s00500-014-1472-9
Mavroforakis ME, Theodoridis S (2006) A geometric approach to support vector machine (SVM) classification. IEEE Trans Neural Netw 17:671–682
Mavroforakis ME, Sdralis M, Theodoridis S (2006) A novel SVM geometric algorithm based on reduced convex hulls. In: IEEE, pp 564–568
Mitra V, Wang C-J, Banerjee S (2007) Text classification: a least square support vector machine approach. Appl Soft Comput 7:908–914
Moslemnejad S, Hamidzadeh J (2019) A hybrid method for increasing the speed of SVM training using belief function theory and boundary region. Int J Mach Learn Cyber. https://doi.org/10.1007/s13042-019-00944-3
Muruganantham A, Nguyen PT, Lydia EL, Shankar K, Hashim W, Maseleno A (2019) Big data analytics and intelligence: a perspective for health care. Int J Eng Adv Technol 8:861–864
Nalepa J, Kawulok M (2014a) Adaptive genetic algorithm to select training data for support vector machines. In: Esparcia-Alcázar AI, Mora AM (eds) Applications of evolutionary computation. Springer, Berlin, Heidelberg, pp 514–525
Nalepa J, Kawulok M (2014b) A memetic algorithm to select training data for support vector machines. In: Proceedings of the 2014 annual conference on genetic and evolutionary computation. ACM, New York, pp 573–580
Nalepa J, Blocho M (2016) Adaptive memetic algorithm for minimizing distance in the vehicle routing problem with time windows. Soft Comput 20:2309–2327. https://doi.org/10.1007/s00500-015-1642-4
Nalepa J, Kawulok M (2016) Adaptive memetic algorithm enhanced with data geometry analysis to select training data for SVMs. Neurocomputing 185:113–132. https://doi.org/10.1016/j.neucom.2015.12.046
Nalepa J, Kawulok M (2019) Selecting training sets for support vector machines: a review. Artif Intell Rev 52(2):857–900
Nalepa J, Siminski K, Kawulok M (2015) Towards parameter-less support vector machines. In: 2015 3rd IAPR Asian conference on pattern recognition (ACPR), pp 211–215
Osuna E, De Castro O (2002) Convex hull in feature space for support vector machines. Springer, Berlin, pp 411–419
Osuna E, Freund R, Girosi F (1997) An improved training algorithm for support vector machines. In: Neural networks for signal processing VII. Proceedings of the 1997 IEEE signal processing society workshop, pp 276–285
Ougiaroglou S, Diamantaras KI, Evangelidis G (2018) Exploring the effect of data reduction on neural network and support vector machine classification. Neurocomputing 280:101–110. https://doi.org/10.1016/j.neucom.2017.08.076
Peng P, Ma QL, Hong LM (2009) The research of the parallel SMO algorithm for solving SVM. In: 2009 International conference on machine learning and cybernetics, pp 1271–1274
Pietruszkiewicz W, Imada A (2013) Artificial intelligence evolved from random behaviour: departure from the state of the Art. In: Yang X-S (ed) Artificial intelligence, evolutionary computing and metaheuristics: in the footsteps of alan turing. Springer, Berlin, Heidelberg, pp 19–41
Platt J (1998) Sequential minimal optimization: a fast algorithm for training support vector machines
Qiu J, Wu Q, Ding G et al (2016) A survey of machine learning for big data processing. EURASIP J Adv Signal Process 2016:67. https://doi.org/10.1186/s13634-016-0355-x
Sánchez AVD (2003) Advanced support vector machines and kernel methods. Neurocomputing 55:5–20
Shen X, Li Z, Jiang Z, Zhan Y (2013) Distributed SVM classification with redundant data removing. In: IEEE, pp 866–870
Shen X-J, Mu L, Li Z et al (2016) Large-scale support vector machine classification with redundant data reduction. Neurocomputing 172:189–197
Shin H, Cho S (2002) Pattern selection for support vector classifiers. In: Yin H, Allinson N, Freeman R et al (eds) Intelligent data engineering and automated learning—IDEAL 2002. Springer, Berlin, Heidelberg, pp 469–474
Sun Z, Guo Z, Liu C et al (2017) Fast extended one-versus-rest multi-label support vector machine using approximate extreme points. IEEE Access 5:8526–8535
Theodoridis S, Mavroforakis M (2007) Reduced convex hulls: a geometric approach to support vector machines [lecture notes]. IEEE Signal Process Mag 24(3):119–122
Varadwaj P, Purohit N, Arora B (2009) Detection of splice sites using support vector machine. Springer, Berlin, pp 493–502
Wang J, Neskovic P, Cooper LN (2007) Selecting data for fast support vector machines training. In: Chen K, Wang L (eds) Trends in neural computation. Springer, Berlin, Heidelberg, pp 61–84
Wang D, Qiao H, Zhang B, Wang M (2013) Online support vector machine based on convex hull vertices selection. IEEE Trans Neural Netw Learn Syst 24:593–609. https://doi.org/10.1109/TNNLS.2013.2238556
Wani MA (2013) Hybrid method for fast SVM training in applications involving large volumes of data. In: 2013 12th international conference on machine learning and applications, pp 491–494
Wrona S, Pawełczyk M (2013) Controllability-oriented placement of actuators for active noise-vibration control of rectangular plates using a memetic algorithm. Archiv Acoust 38:529–536
Xia S, Xiong Z, Luo Y, Dong L (2015) A method to improve support vector machine based on distance to hyperplane. Optik Int J Light Electr Opt 126:2405–2410
Yang Q, Webb G (eds) (2008) PRICAI 2006: trends in artificial intelligence: 9th Pacific rim international conference on artificial intelligence, Guilin, China, August 7–11 Proceedings. Springer
Yang Y, Yu D, Cheng J (2007) A fault diagnosis approach for roller bearing based on IMF envelope spectrum and SVM. Measurement 40:943–950
Yao Y, Liu Y, Yu Y et al (2013) K-SVM: an effective SVM algorithm based on K-means clustering. JCP 8:2632–2639
Yu H, Yang J, Han J, Li X (2005) Making SVMs scalable to large data sets using hierarchical cluster indexing. Data Min Knowl Disc 11:295–321
Zeng Z-Q, Yu H-B, Xu H-R et al (2008) Fast training support vector machines using parallel sequential minimal optimization. In: 2008 3rd International conference on intelligent system and knowledge engineering, pp 997–1001
Zeng M, Yang Y, Zheng J, Cheng J (2015) Maximum margin classification based on flexible convex hulls. Neurocomputing 149:957–965
Zeng Z-Q, Xu H-R, Xie Y-Q, Gao J (2008) A geometric approach to train SVM on very large data sets. In: 2008 3rd International conference on intelligent system and knowledge engineering, pp 991–996
Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases. In: Proceedings of the 1996 ACM SIGMOD international conference on management of data. ACM, New York, pp 103–114
Zhiyong D, Zuolin D, Peixin Q, Xianfang W (2010) Fuzzy support vector machine based on improved sequential minimal optimization algorithm. In: 2010 international conference on computer and communication technologies in agriculture engineering, pp 152–155
Zhong W, Chow R, Stolz R et al (2008) Hierarchical clustering support vector machines for classifying type-2 diabetes patients. Bioinformatics Research and Applications. Springer, Berlin, Heidelberg, pp 379–389
Zhou C, Yin K, Cao Y, Ahmed B (2016) Application of time series analysis and PSO–SVM model in predicting the Bazimen landslide in the Three Gorges Reservoir, China. Eng Geol 204:108–120. https://doi.org/10.1016/j.enggeo.2016.02.009
Funding
This work was partly supported by the Institute for Information and communications Technology Promotion (IITP) Grant funded by the Korea government (MSIT) (No. 2016–0-00133, Research on Edge computing via collective intelligence of hyperconnection IoT nodes), Korea, under the National Program for Excellence in SW supervised by the IITP (Institute for Information and communications Technology Promotion) (2015–0-00914), Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (2017R1A2B2009095, Research on SDN-based WSN Supporting Real-time Stream Data Processing and Multiconnectivity, 2019R1I1A1A01058780, Efficient Management of SDN-based Wireless Sensor Network Using Machine Learning Technique), the second Brain Korea 21 PLUS project.
Author information
Authors and Affiliations
Contributions
All authors contributed to this study. PB and HYY had the idea for the article and performed the literature search and data analysis. PB drafted the paper and KTK and HYY critically revised the work.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Ethics approval
This article does not contain any studies with human participants or animals performed by any of the authors.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Birzhandi, P., Kim, K.T. & Youn, H.Y. Reduction of training data for support vector machine: a survey. Soft Comput 26, 3729–3742 (2022). https://doi.org/10.1007/s00500-022-06787-5
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-022-06787-5