Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Dalton: Learned Partitioning for Distributed Data Streams

Published: 01 November 2022 Publication History

Abstract

To sustain the input rate of high-throughput streams, modern stream processing systems rely on parallel execution. However, skewed data yield imbalanced load assignments and create stragglers that hinder scalability Deciding on a static partitioning for a given set of "hot" keys is not sufficient as these keys are not known in advance, and even worse, the data distribution can change unpredictably. Existing algorithms either optimize for a specific distribution or, in order to adapt, assume a centralized partitioner that processes every incoming tuple and observes the whole workload. However, this is not realistic in a distributed environment, where multiple parallel upstream operators exist, as the centralized partitioner itself becomes the bottleneck and limits scalability
In this work, we propose Dalton: a lightweight, adaptive, yet scalable partitioning operator that relies on reinforcement learning. By memoizing state and dynamically keeping track of recent experience, Dalton: i) adjusts its policy at runtime and quickly adapts to the workload, ii) avoids redundant computations and minimizes the per-tuple partitioning overhead, and iii) efficiently scales out to multiple instances that learn cooperatively and converge to a joint policy Our experiments indicate that Dalton scales regardless of the input data distribution and sustains 1.3X - 6.7X higher throughput than existing approaches.

References

[1]
Ahmed S. Abdelhamid and Walid G. Aref. 2020. PartLy: Learning Data Partitioning for Distributed Data Stream Processing. In Proceedings of the Third International Workshop on Exploiting Artificial Intelligence Techniques for Data Management (Portland, Oregon) (aiDM '20). Association for Computing Machinery, New York, NY, USA, Article 6, 4 pages.
[2]
Ahmed S. Abdelhamid, Ahmed R. Mahmood, Anas Daghistani, and Walid G. Aref. 2020. Prompt: Dynamic Data-Partitioning for Distributed Micro-Batch Stream Processing Systems. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (Portland, OR, USA) (SIGMOD '20). Association for Computing Machinery, New York, NY, USA, 2455--2469.
[3]
Tyler Akidau, Robert Bradshaw, Craig Chambers, Slava Chernyak, Rafael Fernández-Moctezuma, Reuven Lax, Sam McVeety, Daniel Mills, Frances Perry, Eric Schmidt, and Sam Whittle. 2015. The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing. Proc. VLDB Endow. 8, 12 (2015), 1792--1803.
[4]
Ali Asghari, Mohammad Karim Sohrabi, and Farzin Yaghmaee. 2020. Online Scheduling of Dependent Tasks of Cloud's Workflows to Enhance Resource Utilization and Reduce the Makespan Using Multiple Reinforcement Learning-Based Agents. Soft Comput. 24, 21 (nov 2020), 16177--16199.
[5]
Ali Asghari, Mohammad Karim Sohrabi, and Farzin Yaghmaee. 2021. Task Scheduling, Resource Provisioning, and Load Balancing on Scientific Workflows Using Parallel SARSA Reinforcement Learning Agents and Genetic Algorithm. J. Supercomput. 77, 3 (mar 2021), 2800--2828.
[6]
Marcos Assuncao, Alexandre Veith, and Rajkumar Buyya. 2017. Distributed Data Stream Processing and Edge Computing: A Survey on Resource Elasticity and Future Directions. Journal of Network and Computer Applications 103 (12 2017).
[7]
Joanna Berlinska and Maciej Drozdowski. 2018. Comparing load-balancing algorithms for MapReduce under Zipfian data skews. Parallel Comput. 72 (2018), 14--28.
[8]
Djallel Bouneffouf and Irina Rish. 2019. A Survey on Practical Applications of Multi-Armed and Contextual Bandits.
[9]
Badrish Chandramouli, Jonathan Goldstein, Mike Barnett, Robert DeLine, John C. Platt, James F. Terwilliger, and John Wernsing. 2014. Trill: A High-Performance Incremental Query Processor for Diverse Analytics. Proc. VLDB Endow. 8, 4 (2014), 401--412.
[10]
Aaron Clauset, Cosma Shalizi, and Mark Newman. 2007. Power-Law Distributions in Empirical Data. SIAM Rev. 51 (06 2007).
[11]
Graham Cormode and Senthilmurugan Muthukrishnan. 2004. An Improved Data Stream Summary: The Count-Min Sketch and Its Applications. 29--38.
[12]
Micha Elsner and Warren Schudy. 2009. Bounding and Comparing Methods for Correlation Clustering beyond ILP. In Proceedings of the Workshop on Integer Linear Programming for Natural Langauge Processing (Boulder, Colorado) (ILP '09). Association for Computational Linguistics, USA, 19--27.
[13]
Junhua Fang, Rong Zhang, Tom Z.J. Fu, Zhenjie Zhang, Aoying Zhou, and Junhua Zhu. 2017. Parallel Stream Processing Against Workload Skewness and Variance. In Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing (Washington, DC, USA) (HPDC '17). Association for Computing Machinery, New York, NY, USA, 15--26.
[14]
Alex Filatov. 2020. 2016 USA Presidential election tweets. https://data.world/alexfilatov/2016-usa-presidential-election-tweets. Accessed: 2022-09-10.
[15]
Buğra Gedik. 2014. Partitioning Functions for Stateful Data Parallelism in Stream Processing. The VLDB Journal 23, 4 (aug 2014), 517--539.
[16]
Anja Gruenheid, Xin Luna Dong, and Divesh Srivastava. 2014. Incremental Record Linkage. Proc. VLDB Endow. 7, 9 (may 2014), 697--708.
[17]
Thomas Heinze, Lars Roediger, Andreas Meister, Yuanzhen Ji, Zbigniew Jerzak, and Christof Fetzer. 2015. Online parameter optimization for elastic data stream processing. In Proceedings of the Sixth ACM Symposium on Cloud Computing, SoCC 2015, Kohala Coast, Hawaii, USA, August 27--29, 2015, Shahram Ghandeharizadeh, Sumita Barahmand, Magdalena Balazinska, and Michael J. Freedman (Eds.). ACM, 276--287.
[18]
Thomas Heinze, Mariam Zia, Robert Krahn, Zbigniew Jerzak, and Christof Fetzer. 2015. An adaptive replication scheme for elastic data stream processing systems. In Proceedings of the 9th ACM International Conference on Distributed Event-Based Systems, DEBS '15, Oslo, Norway, June 29 - July 3, 2015, Frank Eliassen and Roman Vitenberg (Eds.). ACM, 150--161.
[19]
Jeyhun Karimov, Tilmann Rabl, and Volker Markl. 2019. Astream: Ad-hoc shared stream processing. In Proceedings of the 2019 International Conference on Management of Data. 607--622.
[20]
Asterios Katsifodimos and Sebastian Schelter. 2016. Apache Flink: Stream Analytics at Scale. In 2016 IEEE International Conference on Cloud Engineering Workshop (IC2EW). 193--193.
[21]
Nikos R. Katsipoulakis, Alexandros Labrinidis, and Panos K. Chrysanthis. 2017. A Holistic View of Stream Partitioning Costs. Proc. VLDB Endow. 10, 11 (aug 2017), 1286--1297.
[22]
Mohamed Khafagy, Ahmed Wahdan, and Hesham Hefny. 2014. Comparative Study Load Balance Algorithms for Map Reduce Environment. International Journal of Applied Information Systems 7 (11 2014), 41--50.
[23]
Lars Kolb, Andreas Thor, and Erhard Rahm. 2012. Load Balancing for MapReduce-based Entity Resolution. In 2012 IEEE 28th International Conference on Data Engineering. 618--629.
[24]
Alexandros Koliousis, Matthias Weidlich, Raul Castro Fernandez, Alexander L. Wolf, Paolo Costa, and Peter R. Pietzuch. 2016. SABER: Window-Based Hybrid Stream Processing for Heterogeneous Architectures. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, June 26 - July 01, 2016, Fatma Özcan, Georgia Koutrika, and Sam Madden (Eds.). ACM, 555--569.
[25]
YongChul Kwon, Magdalena Balazinska, Bill Howe, and Jerome Rolia. 2012. SkewTune: Mitigating Skew in Mapreduce Applications. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (Scottsdale, Arizona, USA) (SIGMOD '12). Association for Computing Machinery, New York, NY, USA, 25--36.
[26]
Tyler Lu, Dávid Pál, and Martin Pal. 2010. Contextual Multi-Armed Bandits. Journal of Machine Learning Research - Proceedings Track 9, 485--492.
[27]
Fatma M. Talaat, Mohamed Sabry, Ahmed Saleh, Hesham Ali, and Shereen Ali. 2020. A load balancing and optimization strategy (LBOS) using reinforcement learning in fog computing environment. Journal of Ambient Intelligence and Humanized Computing 11 (11 2020).
[28]
Luo Mai, Kai Zeng, Rahul Potharaju, Le Xu, Steve Suh, Shivaram Venkataraman, Paolo Costa, Terry Kim, Saravanan Muthukrishnan, Vamsi Kuppa, Sudheer Dhulipalla, and Sriram Rao. 2018. Chi: A Scalable and Programmable Control Plane for Distributed Stream Processing Systems. Proc. VLDB Endow. 11, 10 (jun 2018), 1303--1316.
[29]
Derek Gordon Murray, Frank McSherry, Rebecca Isaacs, Michael Isard, Paul Barham, and Martín Abadi. 2013. Naiad: a timely dataflow system. In ACM SIGOPS 24th Symposium on Operating Systems Principles, SOSP '13, Farmington, PA, USA, November 3--6, 2013, Michael Kaminsky and Mike Dahlin (Eds.). ACM, 439--455.
[30]
Anis Nasir, Gianmarco Morales, Nicolas Kourtellis, and Marco Serafini. 2016. When two choices are not enough: Balancing at scale in Distributed Stream Processing. 589--600.
[31]
Muhammad Anis Uddin Nasir, Gianmarco De Francisci Morales, David García-Soriano, Nicolas Kourtellis, and Marco Serafini. 2015. The power of both choices: Practical load balancing for distributed stream processing engines. In 2015 IEEE 31st International Conference on Data Engineering. 137--148.
[32]
Anil Pacaci and M. Tamer Özsu. 2018. Distribution-Aware Stream Partitioning for Distributed Stream Processing Systems. In Proceedings of the 5th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond, BeyondMR@SIGMOD 2018, Houston, TX, USA, June 15, 2018, Foto N. Afrati, Jacek Sroka, Ke Yi, and Jan Hidders (Eds.). ACM, 6:1--6:10.
[33]
Nicolo Rivetti, Leonardo Querzoni, Emmanuelle Anceaume, Yann Busnel, and Bruno Sericola. 2015. Efficient key grouping for near-optimal load balancing in stream processing systems. In Proceedings of the 9th ACM International Conference on Distributed Event-Based Systems, DEBS '15, Oslo, Norway, June 29 - July 3, 2015, Frank Eliassen and Roman Vitenberg (Eds.). ACM, 80--91.
[34]
Henriette Röger and Ruben Mayer. 2019. A Comprehensive Survey on Parallelization and Elasticity in Stream Processing. ACM Comput. Surv. 52, 2 (2019), 36:1--36:37.
[35]
Mehul A. Shah, Joseph M. Hellerstein, Sirish Chandrasekaran, and Michael J. Franklin. 2003. Flux: An Adaptive Partitioning Operator for Continuous Query Systems. In Proceedings of the 19th International Conference on Data Engineering, March 5--8, 2003, Bangalore, India, Umeshwar Dayal, Krithi Ramamritham, and T. M. Vijayaraman (Eds.). IEEE Computer Society, 25--36.
[36]
Richard S. Sutton and Andrew G. Barto. 2018. Reinforcement Learning: An Introduction. A Bradford Book, Cambridge, MA, USA.
[37]
Kanat Tangwongsan, Martin Hirzel, and Scott Schneider. 2019. Sliding-Window Aggregation Algorithms. In Encyclopedia of Big Data Technologies, Sherif Sakr and Albert Y. Zomaya (Eds.). Springer.
[38]
Khin Me Me Thein. 2014. Apache kafka: Next generation distributed messaging system. International Journal of Scientific Engineering and Technology Research 3, 47 (2014), 9478--9483.
[39]
Zhao Tong, Xiaomei Deng, Hongjian Chen, Jing Mei, and Hong Liu. 2020. QL-HEFT: A Novel Machine Learning Scheduling Scheme Base on Cloud Computing Environment. Neural Comput. Appl. 32, 10 (may 2020), 5553--5570.
[40]
Ankit Toshniwal, Siddarth Taneja, Amit Shukla, Karthikeyan Ramasamy, Jignesh M. Patel, Sanjeev Kulkarni, Jason Jackson, Krishna Gade, Maosong Fu, Jake Donham, Nikunj Bhagat, Sailesh Mittal, and Dmitriy V. Ryaboy. 2014. Storm@twitter. In International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, June 22--27, 2014, Curtis E. Dyreson, Feifei Li, and M. Tamer Özsu (Eds.). ACM, 147--156.
[41]
Lucia Vadicamo, Fabio Carrara, Andrea Cimino, Stefano Cresci, Felice Dell'Orletta, Fabrizio Falchi, and Maurizio Tesconi. 2017. Cross-Media Learning for Image Sentiment Analysis in the Wild. In 2017 IEEE International Conference on Computer Vision Workshops (ICCVW). 308--317.
[42]
Christopher J. C. H. Watkins and Peter Dayan. 1992. Q-learning. Machine Learning 8, 3 (1992), 279--292.
[43]
Matei Zaharia, Reynold S .Xin, Patrick Wendell, Tathagata Das, Michael Armbrust, Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman, Michael J. Franklin, Ali Ghodsi, Joseph Gonzalez, Scott Shenker, and Ion Stoica. 2016. Apache Spark: A Unified Engine for Big Data Processing. Commun. ACM 59, 11 (oct 2016), 56--65.
[44]
Steffen Zeuch, Bonaventura Del Monte, Jeyhun Karimov, Clemens Lutz, Manuel Renz, Jonas Traub, Sebastian Breß, Tilmann Rabl, and Volker Markl. 2019. Analyzing Efficient Stream Processing on Modern Hardware. Proc. VLDB Endow. 12, 5 (jan 2019), 516--530.

Cited By

View all
  • (2024)FlexSP:(1 + β)-Choice based Flexible Stream Partitioning for Stateful OperatorsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673157(732-741)Online publication date: 12-Aug-2024
  • (2024)Last Night in Sweden: A Vision for Resource-Intelligent Stream ReasoningProceedings of the 18th ACM International Conference on Distributed and Event-based Systems10.1145/3629104.3666035(103-109)Online publication date: 24-Jun-2024
  • (2023)Zero-Shot Cost Models for Parallel Stream ProcessingProceedings of the Sixth International Workshop on Exploiting Artificial Intelligence Techniques for Data Management10.1145/3593078.3593934(1-5)Online publication date: 18-Jun-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 16, Issue 3
November 2022
181 pages
ISSN:2150-8097
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 November 2022
Published in PVLDB Volume 16, Issue 3

Badges

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)90
  • Downloads (Last 6 weeks)8
Reflects downloads up to 10 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)FlexSP:(1 + β)-Choice based Flexible Stream Partitioning for Stateful OperatorsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673157(732-741)Online publication date: 12-Aug-2024
  • (2024)Last Night in Sweden: A Vision for Resource-Intelligent Stream ReasoningProceedings of the 18th ACM International Conference on Distributed and Event-based Systems10.1145/3629104.3666035(103-109)Online publication date: 24-Jun-2024
  • (2023)Zero-Shot Cost Models for Parallel Stream ProcessingProceedings of the Sixth International Workshop on Exploiting Artificial Intelligence Techniques for Data Management10.1145/3593078.3593934(1-5)Online publication date: 18-Jun-2023
  • (2023)HKS: Efficient Data Partitioning for Stateful StreamingBig Data Analytics and Knowledge Discovery10.1007/978-3-031-39831-5_35(386-391)Online publication date: 28-Aug-2023

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media