Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2723372.2735365acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

WANalytics: Geo-Distributed Analytics for a Data Intensive World

Published: 27 May 2015 Publication History

Abstract

Many large organizations collect massive volumes of data each day in a geographically distributed fashion, at data centers around the globe. Despite their geographically diverse origin the data must be processed and analyzed as a whole to extract insight. We call the problem of supporting large-scale geo-distributed analytics Wide-Area Big Data (WABD). To the best of our knowledge, WABD is currently addressed by copying all the data to a central data center where the analytics are run. This approach consumes expensive cross-data center bandwidth and is incompatible with data sovereignty restrictions that are starting to take shape. We instead propose WANalytics, a system that solves the WABD problem by orchestrating distributed query execution and adjusting data replication across data centers in order to minimize bandwidth usage, while respecting sovereignty requirements. WANalytics achieves an up to 360x reduction in data transfer cost when compared to the centralized approach on both real Microsoft production workloads and standard synthetic benchmarks, including TPC-CH and Berkeley Big-Data. In this demonstration, attendees will interact with a live geo-scale multi-data center deployment of WANalytics, allowing them to experience the data transfer reduction our system achieves, and to explore how it dynamically adapts execution strategy in response to changes in the workload and environment.

References

[1]
Cloudera Impala. http://bit.ly/1eRUDeA.
[2]
Greenplum. http://basho.com/riak/.
[3]
R. Cole et al. The mixed workload CH-benCHmark. In DBTest '11, DBTest '11, pages 8:1--8:6, New York, NY, USA, 2011. ACM.
[4]
B. F. Cooper, R. Ramakrishnan, U. Srivastava, A. Silberstein, P. Bohannon, H.-A. Jacobsen, N. Puz, D. Weaver, and R. Yerneni. Pnuts: Yahoo!'s hosted data serving platform. Proc. VLDB Endow., 1(2):1277--1288, Aug. 2008.
[5]
J. C. Corbett, J. Dean, M. Epstein, A. Fikes, C. Frost, J. Furman, S. Ghemawat, A. Gubarev, C. Heiser, P. Hochschild, W. Hsieh, S. Kanthak, E. Kogan, H. Li, A. Lloyd, S. Melnik, D. Mwaura, D. Nagle, S. Quinlan, R. Rao, L. Rolig, Y. Saito, M. Szymaniak, C. Taylor, R. Wang, and D. Woodford. Spanner: Google globally-distributed database. In 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI 12), pages 261--264, Hollywood, CA, Oct. 2012. USENIX Association.
[6]
D. DeWitt and J. Gray. Parallel database systems: The future of high performance database systems. Communications of the ACM, 35(6):85--98, June 1992.
[7]
A. Greenberg, J. Hamilton, D. A. Maltz, and P. Patel. The cost of a Cloud: Research problems in data center networks. ACM CCR, 39(1), 2008.
[8]
A. Gupta, F. Yang, J. Govig, A. Kirsch, K. Chan, K. Lai, S. Wu, S. Dhoot, A. Kumar, A. Agiwal, S. Bhansali, M. Hong, J. Cameron, M. Siddiqi, D. Jones, J. Shute, A. Gubarev, S. Venkataraman, and D. Agrawal. Mesa: Geo-replicated, near real-time, scalable data warehousing. In VLDB, 2014.
[9]
D. Kossmann. The state of the art in distributed query processing. ACM Computing Surveys, 32(4):422--469, Dec. 2000.
[10]
N. Laoutaris, M. Sirivianos, X. Yang, and P. Rodriguez. Inter-datacenter bulk transfers with Netstitcher. In SIGCOMM 2011.
[11]
S. Madden. Database abstractions for managing sensor network data. Proc. of the IEEE, 98(11):1879--1886, 2010.
[12]
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: a not-so-foreign language for data processing. In SIGMOD 2008.
[13]
M. T. Özsu and P. Valduriez. Principles of distributed database systems. Springer, 2011.
[14]
A. Rabkin, M. Arye, S. Sen, V. S. Pai, and M. J. Freedman. Aggregation and degradation in jetstream: Streaming analytics in the wide area. In 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI 14), pages 275--288, Seattle, WA, Apr. 2014. USENIX Association.
[15]
A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Antony, H. Liu, and R. Murthy. Hive - a petabyte scale data warehouse using hadoop. In ICDE 2010.
[16]
A. Vulimiri, C. Curino, B. Godfrey, K. Karanasos, and G. Varghese. WANalytics: Analytics for a geo-distributed data-intensive world. CIDR, 2015.
[17]
A. Vulimiri, C. Curino, B. Godfrey, J. Padhye, and G. Varghese. Global analytics in the face of bandwidth and regulatory constraints. In 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI 15), Oakland, CA, May 2015. USENIX Association.
[18]
R. S. Xin, J. Rosen, M. Zaharia, M. J. Franklin, S. Shenker, and I. Stoica. Shark: Sql and rich analytics at scale. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, SIGMOD '13, pages 13--24, New York, NY, USA, 2013. ACM.

Cited By

View all
  • (2025)Slark: A Performance Robust Decentralized Inter-Datacenter Deadline-Aware Coflows Scheduling Framework With Local InformationIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.350827536:2(197-211)Online publication date: Feb-2025
  • (2024)ResLake: Towards Minimum Job Latency and Balanced Resource Utilization in Geo-Distributed Job SchedulingProceedings of the VLDB Endowment10.14778/3685800.368581717:12(3934-3946)Online publication date: 1-Aug-2024
  • (2024)Optimal Query Plans for Geo-distributed Data Analytics at ScaleProceedings of the 7th Joint International Conference on Data Science & Management of Data (11th ACM IKDD CODS and 29th COMAD)10.1145/3632410.3632424(247-251)Online publication date: 4-Jan-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data
May 2015
2110 pages
ISBN:9781450327589
DOI:10.1145/2723372
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 May 2015

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. analytics
  2. federation
  3. geo-distribution
  4. olap
  5. sovereignty

Qualifiers

  • Research-article

Funding Sources

  • Alfred P. Sloan Research Fellowship

Conference

SIGMOD/PODS'15
Sponsor:
SIGMOD/PODS'15: International Conference on Management of Data
May 31 - June 4, 2015
Victoria, Melbourne, Australia

Acceptance Rates

SIGMOD '15 Paper Acceptance Rate 106 of 415 submissions, 26%;
Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)31
  • Downloads (Last 6 weeks)3
Reflects downloads up to 14 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2025)Slark: A Performance Robust Decentralized Inter-Datacenter Deadline-Aware Coflows Scheduling Framework With Local InformationIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.350827536:2(197-211)Online publication date: Feb-2025
  • (2024)ResLake: Towards Minimum Job Latency and Balanced Resource Utilization in Geo-Distributed Job SchedulingProceedings of the VLDB Endowment10.14778/3685800.368581717:12(3934-3946)Online publication date: 1-Aug-2024
  • (2024)Optimal Query Plans for Geo-distributed Data Analytics at ScaleProceedings of the 7th Joint International Conference on Data Science & Management of Data (11th ACM IKDD CODS and 29th COMAD)10.1145/3632410.3632424(247-251)Online publication date: 4-Jan-2024
  • (2024)Intent-Driven Multi-Engine Observability Dataflows for Heterogeneous Geo-Distributed Clouds2024 IEEE 17th International Conference on Cloud Computing (CLOUD)10.1109/CLOUD62652.2024.00014(30-41)Online publication date: 7-Jul-2024
  • (2024)GSelf-MapReduce: A Method for Enhancing Mapreduce Performance in Distributed Heterogeneous Data CentersIEEE Access10.1109/ACCESS.2024.348793612(159503-159518)Online publication date: 2024
  • (2023)Resource scheduling techniques in cloud from a view of coordination: a holistic survey从协同视角论云资源调度技术:综述Frontiers of Information Technology & Electronic Engineering10.1631/FITEE.210029824:1(1-40)Online publication date: 23-Jan-2023
  • (2023)Demystifying the QoS and QoE of Edge-hosted Video Streaming Applications in the Wild with SNESetProceedings of the ACM on Management of Data10.1145/36267231:4(1-29)Online publication date: 12-Dec-2023
  • (2023)PlexusProceedings of the 2023 ACM Symposium on Cloud Computing10.1145/3620678.3624643(1-16)Online publication date: 30-Oct-2023
  • (2023)SDTP: Accelerating Wide-Area Data Analytics With Simultaneous Data Transfer and ProcessingIEEE Transactions on Cloud Computing10.1109/TCC.2021.311999111:1(911-926)Online publication date: 1-Jan-2023
  • (2023)Multi-Stage Geo-Distributed Data Aggregation With Coordinated Computation and Communication in Edge Compute First NetworkingJournal of Lightwave Technology10.1109/JLT.2022.323284041:8(2289-2300)Online publication date: 15-Apr-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media