Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3307681.3325406acmconferencesArticle/Chapter ViewAbstractPublication PageshpdcConference Proceedingsconference-collections
research-article
Public Access

UMR-EC: A Unified and Multi-Rail Erasure Coding Library for High-Performance Distributed Storage Systems

Published: 17 June 2019 Publication History

Abstract

Distributed storage systems typically need data to be stored redundantly to guarantee data durability and reliability. While the conventional approach towards this objective is to store multiple replicas, today's unprecedented data growth rates encourage modern distributed storage systems to employ Erasure Coding (EC) techniques, which can achieve better storage efficiency. Various hardware-based EC schemes have been proposed in the community to leverage the advanced compute capabilities on modern data center and cloud environments. Currently, there is no unified and easy way for distributed storage systems to fully exploit multiple devices such as CPUs, GPUs, and network devices (i.e., multi-rail support) to perform EC operations in parallel; thus, leading to the under-utilization of the available compute power. In this paper, we first introduce an analytical model to analyze the design scope of efficient EC schemes in distributed storage systems. Guided by the performance model, we propose UMR-EC, a Unified and Multi-Rail Erasure Coding library that can fully exploit heterogeneous EC coders. Our proposed interface is complemented by asynchronous semantics with optimized metadata-free scheme and EC rate-aware task scheduling that can enable a highly-efficient I/O pipeline. To show the benefits and effectiveness of UMR-EC, we re-design HDFS 3.x write/read pipelines based on the guidelines observed in the proposed performance model. Our performance evaluations show that our proposed designs can outperform the write performance of replication schemes and the default HDFS EC coder by 3.7x - 6.1x and 2.4x - 3.3x, respectively, and can improve the performance of read with failure recoveries up to 5.1x compared with the default HDFS EC coder. Compared with the fastest available CPU coder (i.e., ISA-L), our proposed designs have an improvement of up to 66.0% and 19.4% for write and read with failure recoveries, respectively.

References

[1]
Atul Adya, William J Bolosky, Miguel Castro, Gerald Cermak, Ronnie Chaiken, John R Douceur, Jon Howell, Jacob R Lorch, Marvin Theimer, and Roger P Wattenhofer. 2002. FARSITE: Federated, Available, and Reliable Storage for an Incompletely Trusted Environment . ACM SIGOPS Operating Systems Review, Vol. 36, SI (2002), 1--14.
[2]
Ceph. 2016. Ceph Erasure Coding . http://docs.ceph.com/docs/ master/rados/operations/erasure-code/.
[3]
Yu Lin Chen, Shuai Mu, Jinyang Li, Cheng Huang, Jin Li, Aaron Ogus, and Douglas Phillips. 2017. Giza: Erasure Coding Objects across Global Data Centers. In Proc. USENIX Annu. Tech. Conf.(USENIX ATC) .
[4]
Matthew Curry, Anthony Skjellum, H Lee Ward, and Ron Brightwell. 2011. Gibraltar: A Reed-Solomon Coding Library for Storage Applications on Programmable Graphics Processors. In Concurrency and Computation: Practice and Experience, Vol. 23. 2477--2495.
[5]
Alexandros G Dimakis, P Brighten Godfrey, Yunnan Wu, Martin J Wainwright, and Kannan Ramchandran. 2010. Network Coding for Distributed Storage Systems . IEEE transactions on information theory, Vol. 56, 9 (2010), 4539--4551.
[6]
Dan Dobre, Paolo Viotti, and Marko Vukolić. 2014. Hybris: Robust Hybrid Cloud Storage. In Proceedings of the ACM Symposium on Cloud Computing. ACM, 1--14.
[7]
Facebook. 2010. Facebook's Erasure Coded Hadoop Distributed File System (HDFS-RAID) . https://github.com/facebookarchive/hadoop-20 .
[8]
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. 2003. The Google File System. In ACM SIGOPS operating systems review, Vol. 37. ACM, 29--43.
[9]
Google. 2012. Colossus: Successor to the Google File System (GFS) . https://www.systutorials.com/3202/colossus-successor-to-google-file-system-gfs/.
[10]
Apache Hadoop. 2017. Apache Hadoop 3.0.0-alpha2 . http://hadoop.apache.org/docs/r3.0.0-alpha2/.
[11]
Yuchong Hu, Henry CH Chen, Patrick PC Lee, and Yang Tang. 2012. NCCloud: Applying Network Coding for the Storage Repair in a Cloud-of-Clouds. In FAST . 21.
[12]
Cheng Huang, Huseyin Simitci, Yikang Xu, Aaron Ogus, Brad Calder, Parikshit Gopalan, Jin Li, Sergey Yekhanin, et almbox. 2012. Erasure Coding in Windows Azure Storage. In Usenix Annual Technical Conference. Boston, MA, 15--26.
[13]
Intel. 2011. Introduction to Intel® Advanced Vector Extensions . https://software.intel.com/en-us/articles/introduction-to-intel-advanced-vector-extensions .
[14]
Intel. 2016a. Intel Intelligent Storage Acceleration Library (Intel ISA-L) . https://software.intel.com/en-us/storage/ISA-L .
[15]
Intel. 2016b. Using Intel® Streaming SIMD Extensions and Intel® Integrated Performance Primitives to Accelerate Algorithms . https://software.intel.com/en-us/articles/.
[16]
John Kubiatowicz, David Bindel, Yan Chen, Steven Czerwinski, Patrick Eaton, Dennis Geels, Ramakrishna Gummadi, Sean Rhea, Hakim Weatherspoon, Westley Weimer, et almbox. 2000. Oceanstore: An Architecture for Global-Scale Persistent Storage. In ACM SIGARCH Computer Architecture News, Vol. 28. ACM, 190--201.
[17]
Chunbo Lai, Song Jiang, Liqiong Yang, Shiding Lin, Guangyu Sun, Zhenyu Hou, Can Cui, and Jason Cong. 2015. Atlas: Baidu's Key-value Storage System for Cloud Data. In Mass Storage Systems and Technologies (MSST), 2015 31st Symposium on. IEEE, 1--14.
[18]
Min Li, Sudharshan S Vazhkudai, Ali R Butt, Fei Meng, Xiaosong Ma, Youngjae Kim, Christian Engelmann, and Galen Shipman. 2010. Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures. In High Performance Computing, Networking, Storage and Analysis (SC), 2010 International Conference for. IEEE, 1--12.
[19]
Runhui Li, Xiaolu Li, Patrick PC Lee, and Qun Huang. 2017a. Repair Pipelining for Erasure-coded Storage. In Proceedings of the 2017 USENIX Annual Technical Conference (USENIX ATC'17). 567--579.
[20]
Shenglong Li, Quanlu Zhang, Zhi Yang, and Yafei Dai. 2017b. BCStore: Bandwidth-Efficient In-memory KV-store with Batch Coding. In Proc. of IEEE MSST .
[21]
Jiuxing Liu, Abhinav Vishnu, and Dhabaleswar K Panda. 2004. Building Multirail InfiniBand Clusters: MPI-Level Design and Performance Evaluation. In SC'04: Proceedings of the 2004 ACM/IEEE conference on Supercomputing. IEEE, 33--33.
[22]
Yuanwei Lu, Guo Chen, Bojie Li, Kun Tan, Yongqiang Xiong, Peng Cheng, Jiansong Zhang, Enhong Chen, and Thomas Moscibroda. 2018. Multi-Path Transport for RDMA in Datacenters. In 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18). 357--371.
[23]
Aleksei Marov and Andrey Fedorov. 2016. Optimization of RAID Erasure Coding Algorithms for Intel Xeon Phi. In Networking, Architecture and Storage (NAS), 2016 IEEE International Conference on. IEEE, 1--4.
[24]
Mellanox. 2016a. HDFS Erasure Coding Offload Plugin . https://github.com/Mellanox/EC/tree/master/HDFS .
[25]
Mellanox. 2016b. Understanding Erasure Coding Offload . https://community.mellanox.com/docs/DOC-2414 .
[26]
Mellanox. 2018. Multi-Path RDMA . https://www.openfabrics.org/downloads/Media/Monterey_2015/Tuesday/tuesday_04.pdf .
[27]
Subrata Mitra, Rajesh Panta, Moo-Ryong Ra, and Saurabh Bagchi. 2016. Partial-Parallel-Repair (PPR): A Distributed Technique for Repairing Erasure Coded Storage. In Proceedings of the Eleventh European Conference on Computer Systems. ACM, 30.
[28]
Subramanian Muralidhar, Wyatt Lloyd, Sabyasachi Roy, Cory Hill, Ernest Lin, Weiwen Liu, Satadru Pan, Shiva Shankar, Viswanath Sivakumar, Linpeng Tang, et almbox. 2014. f4: Facebook's Warm BLOB Storage System. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation. USENIX Association, 383--398.
[29]
OpenStack. 2014. liberasurecode . https://github.com/openstack/liberasurecode .
[30]
Michael Ovsiannikov, Silvius Rus, Damian Reeves, Paul Sutter, Sriram Rao, and Jim Kelly. 2013. The Quantcast File System . Proceedings of the VLDB Endowment 11 (2013), 1092--1101.
[31]
James S Plank. 2005. Optimizing Cauchy Reed-Solomon Codes for Fault-tolerant Storage Applications . University of Tennessee, Tech. Rep. CS-05--569 (2005).
[32]
James S. Plank, Kevin M. Greenan, and Ethan L. Miller. 2013. Screaming Fast Galois Field Arithmetic Using Intel SIMD Instructions. In 11th USENIX Conference on File and Storage Technologies (FAST 13). USENIX Association, San Jose, CA, 298--306.
[33]
James S Plank, Jianqiang Luo, Catherine D Schuman, Lihao Xu, Zooko Wilcox-O'Hearn, et almbox. 2009. A Performance Evaluation and Examination of Open-Source Erasure Coding Libraries for Storage. In Proccedings of the 7th Conference on File and Storage Technologies (FAST '09). USENIX Association, Berkeley, CA, USA, 253--265. http://dl.acm.org/citation.cfm?id=1525908.1525927
[34]
James S Plank, Scott Simmerman, and Catherine D Schuman. 2008. Jerasure: A Library in C/C
[35]
Facilitating Erasure Coding for Storage Applications . (2008).
[36]
KV Rashmi, Mosharaf Chowdhury, Jack Kosaian, Ion Stoica, and Kannan Ramchandran. 2016. EC-Cache: Load-Balanced, Low-Latency Cluster Caching with Online Erasure Coding. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16) . USENIX Association.
[37]
KV Rashmi, Preetum Nakkiran, Jingyan Wang, Nihar B Shah, and Kannan Ramchandran. 2015. Having Your Cake and Eating It Too: Jointly Optimal Erasure Codes for I/O, Storage, and Network-bandwidth. In FAST. 81--94.
[38]
KV Rashmi, Nihar B Shah, Dikang Gu, Hairong Kuang, Dhruba Borthakur, and Kannan Ramchandran. 2013. A Solution to the Network Challenges of Data Recovery in Erasure-coded Distributed Storage Systems: A Study on the Facebook Warehouse Cluster. In HotStorage .
[39]
K.V. Rashmi, Nihar B. Shah, Dikang Gu, Hairong Kuang, Dhruba Borthakur, and Kannan Ramchandran. 2014. A "Hitchhiker's" Guide to Fast and Efficient Data Reconstruction in Erasure-coded Data Centers . Proceedings of the 2014 ACM Conference on SIGCOMM, Vol. 44, 4 (Aug. 2014), 331--342.
[40]
Irving S Reed and Gustave Solomon. 1960. Polynomial Codes Over Certain Finite Fields . J. Soc. Indust. Appl. Math., Vol. 8, 2 (1960), 300--304.
[41]
Rodrigo Rodrigues and Barbara Liskov. 2005. High Availability in DHTs: Erasure Coding vs. Replication. In International Workshop on Peer-to-Peer Systems. Springer, 226--239.
[42]
Maheswaran Sathiamoorthy, Megasthenis Asteris, Dimitris Papailiopoulos, Alexandros G. Dimakis, Ramkumar Vadali, Scott Chen, and Dhruba Borthakur. 2013a. XORing Elephants: Novel Erasure Codes for Big Data . Proceedings of the VLDB Endowment, Vol. 6, 5 (March 2013), 325--336.
[43]
Maheswaran Sathiamoorthy, Megasthenis Asteris, Dimitris Papailiopoulos, Alexandros G. Dimakis, Ramkumar Vadali, Scott Chen, and Dhruba Borthakur. 2013b. XORing Elephants: Novel Erasure Codes for Big Data . Proceedings of the VLDB Endowment, Vol. 6, 5 (March 2013), 325--336.
[44]
Dipti Shankar, Xiaoyi Lu, and D. K. Panda. 2017. High-Performance and Resilient Key-Value Store with Online Erasure Coding for Big Data Workloads. In Proceedings of the 37th IEEE International Conference on Distributed Computing Systems (ICDCS) .
[45]
Haiyang Shi, Xiaoyi Lu, Dipti Shankar, and Dhabaleswar K Panda. 2018. High-Performance Multi-Rail Erasure Coding Library over Modern Data Center Architectures: Early Experiences. In Proceedings of the ACM Symposium on Cloud Computing. ACM, 530--531.
[46]
Rong Shi, Sreeram Potluri, Khaled Hamidouche, Xiaoyi Lu, Karen Tomko, and Dhabaleswar K Panda. 2013. A Scalable and Portable Approach to Accelerate Hybrid HPL on Heterogeneous CPU-GPU Clusters. In 2013 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 1--8.
[47]
Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. 2010. The Hadoop Distributed File System. In Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on. IEEE, 1--10.
[48]
Fengguang Song, Stanimire Tomov, and Jack Dongarra. 2012. Enabling and Scaling Matrix Computations on Heterogeneous Multi-core and Multi-GPU Systems. In Proceedings of the 26th ACM international conference on Supercomputing. ACM, 365--376.
[49]
Hakim Weatherspoon and John D Kubiatowicz. 2002. Erasure Coding vs. Replication: A Quantitative Comparison. In International Workshop on Peer-to-Peer Systems. Springer, 328--337.
[50]
Sage A Weil, Scott A Brandt, Ethan L Miller, Darrell DE Long, and Carlos Maltzahn. 2006. Ceph: A Scalable, High-Performance Distributed File System. In Proceedings of the 7th symposium on Operating systems design and implementation. USENIX Association, 307--320.
[51]
Matt Welsh, David Culler, and Eric Brewer. 2001. SEDA: An Architecture for Well-Conditioned, Scalable Internet Services. In ACM SIGOPS Operating Systems Review, Vol. 35. ACM, 230--243.
[52]
Mingyuan Xia, Mohit Saxena, Mario Blaum, and David A. Pease. 2015. A Tale of Two Erasure Codes in HDFS. In 13th USENIX Conference on File and Storage Technologies (FAST 15). USENIX Association, Santa Clara, CA, 213--226. https://www.usenix.org/conference/fast15/technical-sessions/presentation/xia
[53]
Heng Zhang, Mingkai Dong, and Haibo Chen. 2016. Efficient and Available In-memory KV-Store with Hybrid Erasure Coding and Replication. In 14th USENIX Conference on File and Storage Technologies (FAST 16) . USENIX Association, Santa Clara, CA, 167--180.

Cited By

View all
  • (2021)HatRPCProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476191(1-14)Online publication date: 14-Nov-2021
  • (2021)F-Write: Fast RDMA-supported Writes in Erasure-coded In-memory Clusters2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS49936.2021.00091(817-826)Online publication date: May-2021
  • (2020)INEC: Fast and Coherent In-Network Erasure CodingSC20: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41405.2020.00070(1-17)Online publication date: Nov-2020

Index Terms

  1. UMR-EC: A Unified and Multi-Rail Erasure Coding Library for High-Performance Distributed Storage Systems

          Recommendations

          Comments

          Please enable JavaScript to view thecomments powered by Disqus.

          Information & Contributors

          Information

          Published In

          cover image ACM Conferences
          HPDC '19: Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing
          June 2019
          278 pages
          ISBN:9781450366700
          DOI:10.1145/3307681
          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Sponsors

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          Published: 17 June 2019

          Permissions

          Request permissions for this article.

          Check for updates

          Author Tags

          1. distributed storage systems
          2. high performance
          3. multi-rail erasure coding

          Qualifiers

          • Research-article

          Funding Sources

          Conference

          HPDC '19
          Sponsor:

          Acceptance Rates

          HPDC '19 Paper Acceptance Rate 22 of 106 submissions, 21%;
          Overall Acceptance Rate 166 of 966 submissions, 17%

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • Downloads (Last 12 months)270
          • Downloads (Last 6 weeks)28
          Reflects downloads up to 18 Dec 2024

          Other Metrics

          Citations

          Cited By

          View all
          • (2021)HatRPCProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476191(1-14)Online publication date: 14-Nov-2021
          • (2021)F-Write: Fast RDMA-supported Writes in Erasure-coded In-memory Clusters2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS49936.2021.00091(817-826)Online publication date: May-2021
          • (2020)INEC: Fast and Coherent In-Network Erasure CodingSC20: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41405.2020.00070(1-17)Online publication date: Nov-2020

          View Options

          View options

          PDF

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          Login options

          Media

          Figures

          Other

          Tables

          Share

          Share

          Share this Publication link

          Share on social media