tutorial

Parallel Variable Selection for Effective Performance Prediction

Authors:

Kesheng WuAuthors Info & Claims

CCGrid '17: Proceedings of the 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing

Pages 208 - 217

https://doi.org/10.1109/CCGRID.2017.47

Published: 14 May 2017 Publication History

Abstract

Large data analysis problems often involve a large number of variables, and the corresponding analysis algorithms may examine all variable combinations to find the optimal solution. For example, to model the time required to complete a scientific workflow, we need to consider the impact of dozens of parameters. To reduce the model building time and reduce the likelihood of overfitting, we look to variable selection methods to identify the critical variables for the performance model. In this work, we create a combination of variable selection and performance prediction methods that is as effective as the exhaustive search approach when the exhaustive search could be completed in a reasonable amount of time. To handle the cases where the exhaustive search is too time consuming, we develop the parallelized variable selection algorithm. Additionally, we develop a parallel grouping mechanism that further reduces the variable selection time by 70%.

As a case study, we exercise the variable selection technique with the performance measurement data from the Palomar Transient Factory (PTF) workflow. The application scientists have determined that about 50 variables and parameters are important to the performance of the workflows. Our tests show that the Sequential Backward Selection algorithm is able to approximate the optimal subset relatively quickly. By reducing the number of variables used to build the model from 50 to 4, we are able to maintain the prediction quality while reducing the model building time by a factor of 6. Using the parallelization and grouping techniques we developed in this work, the variable selection process was reduced from over 18 hours to 15 minutes while ending up with the same variable subset.

References

[1]

T. Hey, S. Tansley, and K. Tolle, Eds., The Fourth Paradigm: Data-Intensive Scientific Discovery. Microsoft, Oct. 2009.

[2]

A. Shoshani and D. Rotem, Eds., Scientific Data Management: Challenges, Technology, and Deployment. Chapman & Hall/CRC Press, 2010.

Digital Library

[3]

C.-H. Lo, J. Yuan, and J. Ni, "Optimal temperature variable selection by grouping approach for thermal error modeling and compensation," International Journal of Machine Tools & Manufacture, vol. 39, pp. 1383--1396, Sep. 1999.

[4]

S. Maldonado and R. Weber, "A wrapper method for feature selection using support vector machines," Information Sciences, vol. 179, no. 13, pp. 2208--2217, 2009. {Online}. Available: http://www.sciencedirect.com/science/article/pii/S0020025509000917

Digital Library

[5]

L. Yu and H. Liu, "Efficient feature selection via analysis of relevance and redundancy," Journal of Machine Learning Research, vol. 5, pp. 1205--1224, 2004.

Digital Library

[6]

N. M. Law, S. R. Kulkarni, R. G. Dekany, E. O. Ofek, R. M. Quimby, P. E. Nugent, J. Surace, C. C. Grillmair, J. S. Bloom, M. M. Kasliwal, L. Bildsten, T. Brown, S. B. Cenko, D. Ciardi, E. Croner, S. G. Djorgovski, J. v. Eyken, A. V. Filippenko, D. B. Fox, A. Gal-Yam, D. Hale, N. Hamam, G. Helou, J. Henning, D. A. Howell, J. Jacobsen, R. Laher, S. Mattingly, D. McKenna, A. Pickles, D. Poznanski, G. Rahmer, A. Rau, W. Rosing, M. Shara, R. Smith, D. Starr, M. Sullivan, V. Velur, R. Walters, and J. Zolkower, "The palomar transient factory: System overview, performance, and first results," Publications of the Astronomical Society of the Pacific, vol. 121, no. 886, pp. 1395--1408, 2009.

[7]

V. Svetnik, A. Liaw, C. Tong, J. C. Culberson, R. P. Sheridan, and B. P. Feuston, "Random forest: a classification and regression tool for compound classification and qsar modeling," Journal of chemical information and computer sciences, vol. 43, no. 6, pp. 1947--1958, 2003.

[8]

L. Breiman, "Random forests," Machine Learning, vol. 45, no. 1, pp. 5--32, Oct. 2001.

Digital Library

[9]

L. Xu and W.-J. Zhang, "Comparison of different methods for variable selection," Analytica Chimica Acta, vol. 446, pp. 477--483, Nov. 2001.

[10]

W. Yoo, M. Koo, Y. Cao, A. Sim, P. Nugent, and K. Wu, "PATHA: Performance analysis tool for hpc applications," in 2015 IEEE 34th International Performance Computing and Communications Conference (IPCCC). IEEE, 2015, pp. 1--8.

Digital Library

[11]

M. A. Hall, "Correlation-based feature selection for machine learning," Ph.D. dissertation, University of Waikato, Apr. 1999.

[12]

J. H. Friedman, "Greedy function approximation: a gradient boosting machine," Annals of statistics, pp. 1189--1232, 2001.

[13]

A. Snavely, L. Carrington, N. Wolter, J. Labarta, R. Badia, and A. Purkayastha, "A framework for performance modeling and prediction," in Supercomputing, ACM/IEEE 2002 Conference. IEEE, 2002, pp. 1--17.

Digital Library

[14]

R. Susukita, H. Ando, M. Aoyagi, H. Honda, Y. Inadomi, K. Inoue, S. Ishizuki, Y. Kimura, H. Komatsu, M. Kurokawa et al., "Performance prediction of large-scale parallell system and application using macro-level simulation," in Proceedings of the 2008 ACM/IEEE conference on Supercomputing. IEEE Press, 2008, p. 20.

Digital Library

[15]

Q. Song, J. Ni, and G. Wang, "A fast clustering-based feature subset selection algorithm for high-dimensional data," IEEE transactions on knowledge and data engineering, vol. 25, no. 1, pp. 1--14, 2013.

Digital Library

[16]

Z. Zhao, R. Zhang, J. Cox, D. Duling, and W. Sarle, "Massively parallel feature selection: an approach based on variance preservation," Machine Learning, vol. 92, pp. 195--220, Jul. 2013.

Digital Library

[17]

F. G. Lopez, M. G. Torres, B. M. Batista, J. A. M. Perez, and J. M. Moreno-Vega, "Solving feature subset selection problem by a parallel scatter search," European Journal of Operational Research, vol. 169, pp. 477--489, Mar. 2006.

[18]

Y. Zhou, U. Porwal, C. Zhang, H. Ngo, X. Nguyen, C. Re, and V. Govindaraju, "Parallel feature selection inspired by group testing," Advances in Neural Information Processing Systems, vol. 27, pp. 3554--3562, 2014.

Digital Library

[19]

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg et al., "Scikit-learn: Machine learning in python," Journal of Machine Learning Research, vol. 12, pp. 2825--2830, Oct. 2011.

Digital Library

Recommendations

Variable selection in model-based clustering: A general variable role modeling

The currently available variable selection procedures in model-based clustering assume that the irrelevant clustering variables are all independent or are all linked with the relevant clustering variables. A more versatile variable selection model is ...
Variable selection for ad prediction
ADKDD '08: Proceedings of the 2nd International Workshop on Data Mining and Audience Intelligence for Advertising

We consider the problem of predicting the probability of a click for an advertisement when the outcome of a click or no-click is expressed by means of a set of a large number of variables. Many, if not most, of these variables are very weakly related to ...
Effective input variable selection for function approximation
ICANN'06: Proceedings of the 16th international conference on Artificial Neural Networks - Volume Part I

Input variable selection is a key preprocess step in any I/O modelling problem. Normally, better generalization performance is obtained when unneeded parameters coming from irrelevant or redundant variables are eliminated. Information theory provides a ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

CCGrid '17: Proceedings of the 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing

May 2017

1167 pages

ISBN:9781509066100

Sponsors

Publisher

IEEE Press

Publication History

Published: 14 May 2017

Check for updates

Qualifiers

Tutorial
Research
Refereed limited

Conference

CCGrid '17

Sponsor:

SIGARCH

CCGrid '17: 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing

May 14 - 17, 2017

Madrid, Spain

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
45
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)0

Reflects downloads up to 19 Nov 2024

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents