Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3070607.3070608acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Public Access

LaraDB: A Minimalist Kernel for Linear and Relational Algebra Computation

Published: 14 May 2017 Publication History

Abstract

Analytics tasks manipulate structured data with variants of relational algebra (RA) and quantitative data with variants of linear algebra (LA). The two computational models have overlapping expressiveness, motivating a common programming model that affords unified reasoning and algorithm design. At the logical level we propose LARA, a lean algebra of three operators, that expresses RA and LA as well as relevant optimization rules. We show a series of proofs that position LARA at just the right level of expressiveness for a middleware algebra: more explicit than MapReduce but more general than RA or LA. At the physical level we find that the LARA operators afford efficient implementations using a single primitive that is available in a variety of backend engines: range scans over partitioned sorted maps.
To evaluate these ideas, we implemented the LARA operators as range iterators in Apache Accumulo, a popular implementation of Google's BigTable. First we show how LARA expresses a sensor quality control task, and we measure the performance impact of optimizations LARA admits on this task. Second we show that the LARADB implementation outperforms Accumulo's native MapReduce integration on a core task involving join and aggregation in the form of matrix multiply, especially at smaller scales that are typically a poor fit for scale-out approaches. We find that LARADB offers a conceptually lean framework for optimizing mixed-abstraction analytics tasks, without giving up fast record-level updates and scans.

References

[1]
S. M. Aji and R. McEliece. The generalized distributive law. Transactions on Information Theory, 46(2):325--343, 2000.
[2]
A. Alexandrov, R. Bergmann, S. Ewen, J.-C. Freytag, F. Hueske, A. Heise, O. Kao, M. Leich, U. Leser, V. Markl, et al. The stratosphere platform for big data analytics. The VLDB Journal, 23(6):939--964, 2014.
[3]
Array of things. https://arrayofthings.github.io/.
[4]
D. Bader, K. Madduri, J. Gilbert, V. Shah, J. Kepner, T. Meuse, and A. Krishnamurthy. Designing scalable synthetic compact applications for benchmarking high productivity computing systems. Cyberinfrastructure Technology Watch, 2:1--10, 2006.
[5]
Y. Bu, B. Howe, M. Balazinska, and M. D. Ernst. The HaLoop approach to large-scale iterative data analysis. VLDB Journal, 21(2):169--190, 2012.
[6]
A. Buluc and J. R. Gilbert. On the representation and multiplication of hypersparse matrices. In International Symposium on Parallel and Distributed Processing (IPDPS). IEEE, 2008.
[7]
P. Buneman, S. Naqvi, V. Tannen, and L. Wong. Principles of programming with complex objects and collection types. Theoretical Computer Science, 149(1):3--48, 1995.
[8]
R. Chaiken, B. Jenkins, P.-Å. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. Scope: easy and efficient parallel processing of massive data sets. Proceedings of the VLDB Endowment, 1(2), 2008.
[9]
F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A distributed storage system for structured data. ACM Transactions on Computer Systems (TOCS), 26(2):4, 2008.
[10]
B. Chattopadhyay, L. Lin, W. Liu, S. Mittal, P. Aragonda, V. Lychagina, Y. Kwon, and M. Wong. Tenzing a SQL implementation on the mapreduce framework. PVLDB, 4:1318--1327, 2011.
[11]
A. Crotty, A. Galakatos, K. Dursun, T. Kraska, U. Cetintemel, and S. B. Zdonik. Tupleware: "big" data, big analytics, small clusters. In Conference on Innovative Data Systems Research (CIDR), 2015.
[12]
T. Elgamal, S. Luo, M. Boehm, A. V. Evfimievski, S. Tatikonda, B. Reinwald, and P. Sen. Spoof: Sum-product optimization and operator fusion for large-scale machine learning. In Conference on Innovative Data Systems Research (CIDR), Jan. 2017.
[13]
L. Fegaras, C. Li, and U. Gupta. An optimization framework for map-reduce queries. In EDBT. ACM, 2012.
[14]
Apache flume. https://flume.apache.org/.
[15]
V. Gadepally, J. Bolewski, D. Hook, D. Hutchison, B. Miller, and J. Kepner. Graphulo: Linear algebra graph kernels for NoSQL databases. In International Parallel & Distributed Processing Symposium Workshops. IEEE, 2015.
[16]
V. Gadepally and J. Kepner. Big data dimensional analysis. In High Performance Extreme Computing Conference (HPEC). IEEE, 2014.
[17]
V. Gadepally and J. Kepner. Using a power law distribution to describe big data. In High Performance Extreme Computing Conference (HPEC). IEEE, 2015.
[18]
J. M. Hellerstein, C. Ré, F. Schoppmann, D. Z. Wang, E. Fratkin, A. Gorajek, K. S. Ng, C. Welton, X. Feng, K. Li, et al. The madlib analytics library: or mad skills, the sql. Proceedings of the VLDB Endowment, 5(12):1700--1711, 2012.
[19]
D. Hutchison, B. Howe, and D. Suciu. Lara: A key-value algebra underlying arrays and relations. arXiv preprint arXiv:1604.03607, 2016.
[20]
D. Hutchison, J. Kepner, V. Gadepally, and A. Fuchs. Graphulo implementation of server-side sparse matrix multiply in the Accumulo database. In High Performance Extreme Computing Conference (HPEC). IEEE, 9 2015.
[21]
D. Hutchison, J. Kepner, V. Gadepally, and B. Howe. From NoSQL Accumulo to NewSQL Graphulo: Design and utility of graph algorithms inside a BigTable database. In High Performance Extreme Computing Conference (HPEC). IEEE, 9 2016.
[22]
J. Kepner, P. Aaltonen, D. Bader, A. Buluç, F. Franchetti, J. Gilbert, D. Hutchison, M. Kumar, A. Lumsdaine, H. Meyerhenke, S. McMillan, J. Moreira, J. D. Owens, C. Yang, M. Zalewski, and T. Mattson. Mathematical foundations of the GraphBLAS. In HPEC. IEEE, 9 2016.
[23]
J. Kepner, V. Gadepally, D. Hutchison, H. Jananthan, T. Mattson, S. Samsi, and A. Reuther. Associative array model of SQL, NoSQL, and NewSQL databases. In High Performance Extreme Computing Conference (HPEC). IEEE, 9 2016.
[24]
A. Kunft, A. Alexandrov, A. Katsifodimos, and V. Markl. Bridging the gap: towards optimization across linear and relational algebra. In Proceedings of the 3rd SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond. ACM, 2016.
[25]
R. Maas, J. Hyrkas, O. G. Telford, M. Balazinska, A. Connolly, and B. Howe. Gaussian mixture models use-case: in-memory analysis with myria. In Proceedings of the 3rd VLDB Workshop on In-Memory Data Mangement and Analytics. ACM, 2015.
[26]
B. Marker, M. Schatz, D. Matthews, I. Dillig, R. van de Geijn, and D. Batory. Dxter: An extensible tool for optimal dataflow program generation. Technical report, Technical Report TR-15-03, The University of Texas at Austin, 2015.
[27]
X. Meng, J. Bradley, B. Yavuz, E. Sparks, S. Venkataraman, D. Liu, J. Freeman, D. Tsai, M. Amde, S. Owen, et al. Mllib: Machine learning in apache spark. Journal of Machine Learning Research, 17(34):1--7, 2016.
[28]
S. Palkar, J. J. Thomas, A. Shanbhag, D. Narayanan, H. Pirk, M. Schwarzkopf, S. Amarasinghe, M. Zaharia, and S. InfoLab. Weld: A common runtime for high performance data analytics. In Conference on Innovative Data Systems Research (CIDR), Jan. 2017.
[29]
A. Rheinländer, A. Heise, F. Hueske, U. Leser, and F. Naumann. SOFA: An extensible logical optimizer for udf-heavy data flows. Information Systems, 52, 2015.
[30]
D. Shukla, S. Thota, K. Raman, M. Gajendran, A. Shah, S. Ziuzin, K. Sundaram, M. G. Guajardo, A. Wawrzyniak, S. Boshra, R. Ferreira, M. Nassar, M. Koltachev, J. Huang, S. Sengupta, J. J. Levandoski, and D. B. Lomet. Schema-agnostic indexing with azure documentdb. PVLDB, 8:1668--1679, 2015.
[31]
M. Spight and V. Tropashko. First steps in relational lattice. arXiv preprint cs/0603044, 2006.
[32]
J. Wang, T. Baker, M. Balazinska, D. Halperin, B. Haynes, B. Howe, D. Hutchison, S. Jain, R. Maas, P. Mehta, et al. The Myria big data management and analytics system and cloud service. Jan. 2017.
[33]
M. M. Wolf, J. W. Berry, and D. T. Stark. A task-based linear algebra building blocks approach for scalable graph analytics. In High Performance Extreme Computing Conference (HPEC). IEEE, 2015.
[34]
M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: cluster computing with working sets. HotCloud, 10:10--10, 2010.

Cited By

View all
  • (2024)On Efficient Large Sparse Matrix Chain MultiplicationProceedings of the ACM on Management of Data10.1145/36549592:3(1-27)Online publication date: 30-May-2024
  • (2024)TensorTable: Extending PyTorch for mixed relational and linear algebra pipelinesBenchCouncil Transactions on Benchmarks, Standards and Evaluations10.1016/j.tbench.2024.1001614:1(100161)Online publication date: Mar-2024
  • (2023)Declarative Sub-Operators for Universal Data ProcessingProceedings of the VLDB Endowment10.14778/3611479.361153916:11(3461-3474)Online publication date: 24-Aug-2023
  • Show More Cited By

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
BeyondMR'17: Proceedings of the 4th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond
May 2017
76 pages
ISBN:9781450350198
DOI:10.1145/3070607
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 May 2017

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

SIGMOD/PODS'17
Sponsor:

Acceptance Rates

BeyondMR'17 Paper Acceptance Rate 9 of 17 submissions, 53%;
Overall Acceptance Rate 19 of 36 submissions, 53%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)123
  • Downloads (Last 6 weeks)23
Reflects downloads up to 16 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)On Efficient Large Sparse Matrix Chain MultiplicationProceedings of the ACM on Management of Data10.1145/36549592:3(1-27)Online publication date: 30-May-2024
  • (2024)TensorTable: Extending PyTorch for mixed relational and linear algebra pipelinesBenchCouncil Transactions on Benchmarks, Standards and Evaluations10.1016/j.tbench.2024.1001614:1(100161)Online publication date: Mar-2024
  • (2023)Declarative Sub-Operators for Universal Data ProcessingProceedings of the VLDB Endowment10.14778/3611479.361153916:11(3461-3474)Online publication date: 24-Aug-2023
  • (2023)Accelerating Machine Learning Queries with Linear Algebra Query ProcessingProceedings of the 35th International Conference on Scientific and Statistical Database Management10.1145/3603719.3603726(1-12)Online publication date: 10-Jul-2023
  • (2023)Optimizing Tensor Programs on Flexible StorageProceedings of the ACM on Management of Data10.1145/35887171:1(1-27)Online publication date: 30-May-2023
  • (2023)Givens rotations for QR decomposition, SVD and PCA over database joinsThe VLDB Journal10.1007/s00778-023-00818-933:4(1013-1037)Online publication date: 23-Nov-2023
  • (2023)Learning System for Relational AlgebraLearning and Collaboration Technologies10.1007/978-3-031-34411-4_5(54-63)Online publication date: 9-Jun-2023
  • (2022)Serving deep learning models with deduplication from relational databasesProceedings of the VLDB Endowment10.14778/3547305.354732515:10(2230-2243)Online publication date: 7-Sep-2022
  • (2022)Functional collection programming with semi-ring dictionariesProceedings of the ACM on Programming Languages10.1145/35273336:OOPSLA1(1-33)Online publication date: 29-Apr-2022
  • (2022)End-to-end Optimization of Machine Learning Prediction QueriesProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3526141(587-601)Online publication date: 10-Jun-2022
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media