Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2588555.2593680acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

A comparison of platforms for implementing and running very large scale machine learning algorithms

Published: 18 June 2014 Publication History

Abstract

We describe an extensive benchmark of platforms available to a user who wants to run a machine learning (ML) inference algorithm over a very large data set, but cannot find an existing implementation and thus must "roll her own" ML code. We have carefully chosen a set of five ML implementation tasks that involve learning relatively complex, hierarchical models. We completed those tasks on four different computational platforms, and using 70,000 hours of Amazon EC2 compute time, we carefully compared running times, tuning requirements, and ease-of-programming of each.

References

[1]
A. Agarwal, O. Chapelle, M. Dudık, and J. Langford. A reliable effective terascale linear learning system. arXiv preprint arXiv:1110.4198, 2011.
[2]
C. Avery. Giraph: Large-scale graph processing infrastructure on hadoop. Proceedings of the Hadoop Summit. Santa Clara, 2011.
[3]
D. M. Blei. Probabilistic topic models. Communications of the ACM, 55(4):77--84, 2012.
[4]
Z. Cai, Z. Vagena, L. Perez, S. Arumugam, P. J. Haas, and C. Jermaine. Simulation of database-valued markov chains using simsql. In SIGMOD, pages 637--648, 2013.
[5]
S. Das, Y. Sismanis, K. S. Beyer, R. Gemulla, P. J. Haas, and J. McPherson. Ricardo: integrating r and hadoop. In SIGMOD, pages 987--998, 2010.
[6]
A. Ghoting, R. Krishnamurthy, E. Pednault, B. Reinwald, V. Sindhwani, S. Tatikonda, Y. Tian, and S. Vaithyanathan. Systemml: Declarative machine learning on mapreduce. In ICDE, pages 231--242, 2011.
[7]
W. R. Gilks, S. Richardson, and D. J. Spiegelhalter. Markov chain Monte Carlo in practice, volume 2. CRC press, 1996.
[8]
J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin. Powergraph: Distributed graph-parallel computation on natural graphs. In OSDI, pages 17--30, 2012.
[9]
J. M. Hellerstein, C. Ré, F. Schoppmann, D. Z. Wang, E. Fratkin, A. Gorajek, K. S. Ng, C. Welton, X. Feng, K. Li, et al. The madlib analytics library: or mad skills, the sql. PVLDB, 5(12):1700--1711, 2012.
[10]
T. Kraska, A. Talwalkar, J. C. Duchi, R. Griffith, M. J. Franklin, and M. I. Jordan. Mlbase: A distributed machine-learning system. In CIDR, 2013.
[11]
Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. Graphlab: A new framework for parallel machine learning. arXiv preprint arXiv:1006.4990, 2010.
[12]
G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In SIGMOD, pages 135--146, 2010.
[13]
A. McCallum. Mallet: A machine learning for language toolkit. http://mallet.cs.umass.edu/, 2002.
[14]
D. G. Murray, F. McSherry, R. Isaacs, M. Isard, P. Barham, and M. Abadi. Naiad: a timely dataflow system. In SOSP, pages 439--455, 2013.
[15]
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: a not-so-foreign language for data processing. In SIGMOD, pages 1099--1110. ACM, 2008.
[16]
C. Ordonez and P. Cereghini. Sqlem: Fast clustering in sql using the em algorithm. In SIGMOD, pages 559--570, 2000.
[17]
S. Owen, R. Anil, T. Dunning, and E. Friedman. Mahout in action. Manning, 2011.
[18]
T. Park and G. Casella. The bayesian lasso. JASA, 103(482):681--686, 2008.
[19]
A. Sujeeth, H. Lee, K. Brown, T. Rompf, H. Chafi, M. Wu, A. Atreya, M. Odersky, and K. Olukotun. Optiml: an implicitly parallel domain-specific language for machine learning. In ICML, pages 609--616, 2011.
[20]
L. G. Valiant. A bridging model for parallel computation. Communications of the ACM, 33(8):103--111, 1990.
[21]
M. Weimer, T. Condie, R. Ramakrishnan, et al. Machine learning in scalops, a higher order cloud computing language. In BigLearn, volume 9, pages 389--396, 2011.
[22]
Y. Yu, M. Isard, D. Fetterly, M. Budiu, Ú. Erlingsson, P. K. Gunda, and J. Currey. Dryadlinq: A system for general-purpose distributed data-parallel computing using a high-level language. In OSDI, volume 8, pages 1--14, 2008.
[23]
M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In USENIX, pages 2--2, 2012.
[24]
M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: cluster computing with working sets. In USENIX, pages 10--10, 2010.

Cited By

View all
  • (2022)Using machine learning to analyze longitudinal data: A tutorial guide and best‐practice recommendations for social science researchersApplied Psychology10.1111/apps.1243572:3(1339-1364)Online publication date: 5-Oct-2022
  • (2022)Shufflecast: An Optical, Data-Rate Agnostic, and Low-Power Multicast Architecture for Next-Generation Compute ClustersIEEE/ACM Transactions on Networking10.1109/TNET.2022.315889930:5(1970-1985)Online publication date: Oct-2022
  • (2021)A new framework based on features modeling and ensemble learning to predict query performancePLOS ONE10.1371/journal.pone.025843916:10(e0258439)Online publication date: 18-Oct-2021
  • Show More Cited By

Index Terms

  1. A comparison of platforms for implementing and running very large scale machine learning algorithms

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data
    June 2014
    1645 pages
    ISBN:9781450323765
    DOI:10.1145/2588555
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 18 June 2014

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tag

    1. distributed machine learning

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    SIGMOD/PODS'14
    Sponsor:

    Acceptance Rates

    SIGMOD '14 Paper Acceptance Rate 107 of 421 submissions, 25%;
    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)21
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 12 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)Using machine learning to analyze longitudinal data: A tutorial guide and best‐practice recommendations for social science researchersApplied Psychology10.1111/apps.1243572:3(1339-1364)Online publication date: 5-Oct-2022
    • (2022)Shufflecast: An Optical, Data-Rate Agnostic, and Low-Power Multicast Architecture for Next-Generation Compute ClustersIEEE/ACM Transactions on Networking10.1109/TNET.2022.315889930:5(1970-1985)Online publication date: Oct-2022
    • (2021)A new framework based on features modeling and ensemble learning to predict query performancePLOS ONE10.1371/journal.pone.025843916:10(e0258439)Online publication date: 18-Oct-2021
    • (2021)Efficient Construction of Nonlinear Models over Normalized Data2021 IEEE 37th International Conference on Data Engineering (ICDE)10.1109/ICDE51399.2021.00103(1140-1151)Online publication date: Apr-2021
    • (2020)A Survey on Large-Scale Machine LearningIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2020.3015777(1-1)Online publication date: 2020
    • (2020)Forecasting COVID-19 cases in India Using Machine Learning Models2020 International Conference on Smart Technologies in Computing, Electrical and Electronics (ICSTCEE)10.1109/ICSTCEE49637.2020.9276852(466-471)Online publication date: 9-Oct-2020
    • (2020)Towards Factorized SVM with Gaussian Kernels over Normalized Data2020 IEEE 36th International Conference on Data Engineering (ICDE)10.1109/ICDE48307.2020.00129(1453-1464)Online publication date: Apr-2020
    • (2020)Big Data Analytics—Analysis and Comparison of Various ToolsAdvances in Information Communication Technology and Computing10.1007/978-981-15-5421-6_48(491-499)Online publication date: 19-Aug-2020
    • (2019)Data Management in Machine Learning SystemsSynthesis Lectures on Data Management10.2200/S00895ED1V01Y201901DTM05714:1(1-173)Online publication date: 25-Feb-2019
    • (2019)A comparative evaluation of systems for scalable linear algebra-based analyticsProceedings of the VLDB Endowment10.14778/3275366.328496311:13(2168-2182)Online publication date: 17-Jan-2019
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media