research-article

Scaling embedded in-situ indexing with deltaFS

Authors:

Charles D. Cranor,

Gregory R. Ganger,

George Amvrosiadis,

Garth A. Gibson,

Bradley W. Settlemyer,

Fan GuoAuthors Info & Claims

SC '18: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis

Article No.: 3, Pages 1 - 15

Published: 11 November 2018 Publication History

Abstract

Analysis of large-scale simulation output is a core element of scientific inquiry, but analysis queries may experience significant I/O overhead when the data is not structured for efficient retrieval. While in-situ processing allows for improved time-to-insight for many applications, scaling in-situ frameworks to hundreds of thousands of cores can be difficult in practice. The DeltaFS in-situ indexing is a new approach for in-situ processing of massive amounts of data to achieve efficient point and small-range queries. This paper describes the challenges and lessons learned when scaling this in-situ processing function to hundreds of thousands of cores. We propose techniques for scalable all-to-all communication that is memory and bandwidth efficient, concurrent indexing, and specialized LSM-Tree formats. Combining these techniques allows DeltaFS to control the cost of in-situ processing while maintaining 3 orders of magnitude query speedup when scaling alongside the popular VPIC particle-in-cell code to 131,072 cores.

References

[1]

Exascale computing project (ECP), https://www.exascaleproject.org/.

[2]

ANL aurora, https://www.alcf.anl.gov/alcf-aurora-2021-early-science-program-data-and-learning-call-proposals.

[3]

Q. Zheng, G. Amvrosiadis, S. Kadekodi, G. A. Gibson, C. D. Cranor, B. W. Settlemyer, G. Grider, and F. Guo, "Software-defined storage for fast trajectory queries using a deltafs indexed massive directory," in Proceedings of the 2nd Joint International Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems (PDSW-DISCS 17), 2017, pp. 7--12.

Digital Library

[4]

APEX workflows, https://www.nersc.gov/assets/apex-workflows-v2.pdf, Mar. 2016.

[5]

LANL trinity, http://www.lanl.gov/projects/trinity/.

[6]

K. J. Bowers, B. J. Albright, L. Yin, B. Bergen, and T. J. T. Kwan, "Ultrahigh performance three-dimensional electromagnetic relativistic kinetic plasma simulation," Physics of Plasmas, vol. 15, no. 5, p. 7, 2008.

[7]

F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber, "Bigtable: A distributed storage system for structured data," in Proceedings of the 7th Symposium on Operating Systems Design and Implementation (OSDI 06), 2006, pp. 205--218.

Digital Library

[8]

A. Lakshman and P. Malik, "Cassandra: A decentralized structured storage system," SIGOPS Oper. Syst. Rev., vol. 44, no. 2, pp. 35--40, Apr. 2010.

Digital Library

[9]

P. O'Neil, E. Cheng, D. Gawlick, and E. O'Neil, "The log-structured merge-tree (LSM-tree)," Acta Inf., vol. 33, no. 4, pp. 351--385, Jun. 1996.

Digital Library

[10]

S. Byna, J. Chou, O. Rübel, Prabhat, H. Karimabadi, W. S. Daughton, V. Roytershteyn, E. W. Bethel, M. Howison, K.-J. Hsu, K.-W. Lin, A. Shoshani, A. Uselton, and K. Wu, "Parallel i/o, analysis, and visualization of a trillion particle simulation," in Proceedings of the 2012 International Conference on High Performance Computing, Networking, Storage, and Analysis (SC 12), 2012, 59:1--59:12.

Digital Library

[11]

J. H. Chen, A. Choudhary, B. De Supinski, M. DeVries, E. R. Hawkes, S. Klasky, W.-K. Liao, K.-L. Ma, J. Mellor-Crummey, N. Podhorszki, et al., "Terascale direct numerical simulations of turbulent combustion using s3d," Computational Science & Discovery, vol. 2, no. 1, p. 015001, 2009.

[12]

S. Lang, P. Carns, R. Latham, R. Ross, K. Harms, and W. Allcock, "I/O performance challenges at leadership scale," in Proceedings of the 2009 International Conference for High Performance Computing, Networking, Storage, and Analysis (SC 09), 2009, 40:1--40:12.

Digital Library

[13]

J. Lofstead, M. Polte, G. Gibson, S. Klasky, K. Schwan, R. Oldfield, M. Wolf, and Q. Liu, "Six degrees of scientific data: Reading patterns for extreme scale science io," in Proceedings of the 20th International Symposium on High Performance Distributed Computing (HPDC 11), 2011, pp. 49--60.

Digital Library

[14]

J. Chou, M. Howison, B. Austin, K. Wu, J. Qiang, E. W. Bethel, A. Shoshani, O. Rübel, Prabhat, and R. D. Ryne, "Parallel index and query for large scale data analysis," in Proceedings of the 2011 International Conference for High Performance Computing, Networking, Storage, and Analysis (SC 11), 2011, 30:1--30:11.

Digital Library

[15]

J. Chou, K. Wu, and Prabhat, "Fastquery: A parallel indexing system for scientific data," in Proceedings of the 2011 IEEE International Conference on Cluster Computing (CLUSTER 11), 2011, pp. 455--464.

Digital Library

[16]

N. Liu, J. Cope, P. Carns, C. Carothers, R. Ross, G. Grider, A. Crume, and C. Maltzahn, "On the role of burst buffers in leadership-class storage systems," in Proceedings of the 2012 IEEE Conference on Massive Storage Systems and Technologies (MSST 12), 2012, pp. 1--11.

[17]

J Bent, G Grider, B Kettering, A Manzanares, M McClelland, A. Torres, and A. Torrez, "Storage challenges at los alamos national lab," in Proceedings of the 2012 IEEE Conference on Massive Storage Systems and Technologies (MSST 12), 2012, pp. 1--5.

[18]

J. Bent, B. Settlemyer, and G. Grider, "Serving data to the lunatic fringe: The evolution of HPC storage," USENIX ;login:, vol. 41, no. 2, Jun. 2016.

[19]

A. Nisar, W. k. Liao, and A. Choudhary, "Scaling parallel i/o performance through i/o delegate and caching system," in Proceedings of the 2008 International Conference for High Performance Computing, Networking, Storage, and Analysis (SC 08), 2008, pp. 1--12.

Digital Library

[20]

F. Zheng, H. Abbasi, C. Docan, J. Lofstead, Q. Liu, S. Klasky, M. Parashar, N. Podhorszki, K. Schwan, and M. Wolf, "Predata - preparatory data analytics on peta-scale machines," in Proceedings of the 2010 IEEE International Symposium on Parallel Distributed Processing (IPDPS 10), 2010, pp. 1--12.

[21]

V. Vishwanath, M. Hereld, and M. E. Papka, "Toward simulation-time data analysis and i/o acceleration on leadership-class systems," in Proceedings of the 2011 IEEE Symposium on Large Data Analysis and Visualization (LDAV 11), 2011, pp. 9--14.

[22]

J. C. Bennett, H. Abbasi, P. T. Bremer, R. Grout, A. Gyulassy, T. Jin, S. Klasky, H. Kolla, M. Parashar, V. Pascucci, P. Pebay, D. Thompson, H. Yu, F. Zhang, and J. Chen, "Combining in-situ and in-transit processing to enable extreme-scale scientific analysis," in Proceedings of the 2012 International Conference for High Performance Computing, Networking, Storage, and Analysis (SC 12), 2012, pp. 1--9.

Digital Library

[23]

J. Bent, S. Faibish, J. Ahrens, G. Grider, J. Patchett, P. Tzelnic, and J. Woodring, "Jitter-free co-processing on a prototype exascale storage stack," in Proceedings of the 2012 IEEE Conference on Massive Storage Systems and Technologies (MSST 12), 2012, pp. 1--5.

[24]

J. Lofstead and R. Ross, "Insights for exascale IO apis from building a petascale IO api," in Proceedings of the 2013 International Conference on High Performance Computing, Networking, Storage, and Analysis (SC 13), 2013, 87:1--87:12.

Digital Library

[25]

R. B. Ross, From file systems to services: Changing the data management model in hpc, Presented at the Salishan Conference on High-Speed Computing, http://www.mcs.anl.gov/research/projects/mochi/files/2016/11/ross_salishan-2016-16x9.pdf, 2016.

[26]

P. Carns, Building blocks for user-level hpc storage systems, Presented at Dagstuhl Seminar: Challenges and Opportunities of User-Level File Systems for HPC, http://www.mcs.anl.gov/research/projects/mochi/files/2016/11/carns-dagstuhl-2017-v3.pdf, 2017.

[27]

J. Bent, G. Gibson, G. Grider, B. McClelland, P. Nowoczynski, J. Nunez, M. Polte, and M. Wingate, "PLFS: A checkpoint filesystem for parallel applications," in Proceedings of the 2009 International Conference for High Performance Computing, Networking, Storage, and Analysis (SC 09), 2009, 21:1--21:12.

Digital Library

[28]

S. Byna, A Uselton, D. K. Prabhat, and Y He, "Trillion particles, 120,000 cores, and 350 tbs: Lessons learned from a hero i/o run on hopper," in Cray User Group (CUG), https://cug.org/proceedings/cug2013_proceedings/includes/files/pap107-file2.pdf, 2013.

[29]

S. Byna, R. Sisneros, K. Chadalavada, and Q. Koziol, "Tuning parallel i/o on blue waters for writing 10 trillion particles," in Cray User Group (CUG), https://cug.org/proceedings/cug2015_proceedings/includes/files/pap120-file2.pdf, 2015.

[30]

P. Schwan, "Lustre: Building a file system for 1000-node clusters," in Proceedings of the 2003 Ottawa Linux Symposium (OLS 03), 2003, pp. 380--386.

[31]

I. F. Haddad, "Pvfs: A parallel virtual file system for linux clusters," Linux J., vol. 2000, no. 80es, Nov. 2000.

Digital Library

[32]

B. Welch, M. Unangst, Z. Abbasi, G. Gibson, B. Mueller, J. Small, J. Zelenka, and B. Zhou, "Scalable performance of the panasas parallel file system," in Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST 08), 2008, 2:1--2:17.

Digital Library

[33]

F. B. Schmuck and R. L. Haskin, "GPFS: A shared-disk file system for large computing clusters," in Proceedings of the 1st USENIX Conference on File and Storage Technologies (FAST 02), 2002, pp. 231--244.

Digital Library

[34]

V. Vishwanath, M. Hereld, V. Morozov, and M. E. Papka, "Topology-aware data movement and staging for i/o acceleration on blue gene/p supercomputing systems," in Proceedings of the 2011 International Conference for High Performance Computing, Networking, Storage, and Analysis (SC 11), 2011, pp. 1--11.

Digital Library

[35]

R. A. Oldfield, G. D. Sjaardema, G. F. Lofstead II, and T. Kordenbrock, "Trilinos i/o support trios," Sci. Program., vol. 20, no. 2, pp. 181--196, Apr. 2012.

Digital Library

[36]

M. Doner, G. Antoniu, F. Cappello, M. Snir, and L. Orf, "Daman's: How to efficiently leverage multicore parallelism to achieve scalable, jitter-free i/o," in Proceedings of the 2012 IEEE International Conference on Cluster Computing (CLUSTER 12), 2012, pp. 155--163.

Digital Library

[37]

M. Li, S. S. Vazhkudai, A. R. Butt, F. Meng, X. Ma, Y. Kim, C. Engelmann, and G. Shipman, "Functional partitioning to optimize end-to-end performance on many-core architectures," in Proceedings of the 2010 International Conference for High Performance Computing, Networking, Storage, and Analysis (SC 10), 2010, pp. 1--12.

Digital Library

[38]

F. Zheng, H. Yu, C. Hantas, M. Wolf, G. Eisenhauer, K. Schwan, H. Abbasi, and S. Klasky, "Goldrush: Resource efficient in situ scientific data analytics using fine-grained interference aware execution," in Proceedings of the 2013 International Conference for High Performance Computing, Networking, Storage, and Analysis (SC 13), 2013, pp. 1--12.

Digital Library

[39]

F. Zheng, H. Zou, G. Eisenhauer, K. Schwan, M. Wolf, J. Dayal, T. A. Nguyen, J. Cao, H. Abbasi, S. Klasky, N. Podhorszki, and H. Yu, "Flexio: I/o middleware for location-flexible scientific data analytics," in Proceedings of the 2013 IEEE International Symposium on Parallel and Distributed Processing (IPDPS 13), 2013, pp. 320--331.

Digital Library

[40]

J. Lofstead, F. Zheng, S. Klasky, and K. Schwan, "Adaptable, metadata rich io methods for portable high performance io," in Proceedings of the 2009 IEEE International Symposium on Parallel and Distributed Processing (IPDPS 09), 2009, pp. 1--10.

Digital Library

[41]

U. Ayachit, A. Bauer, E. P. N. Duque, G. Eisenhauer, N. Ferrier, J. Gu, K. E. Jansen, B. Loring, Z. Lukić, S. Menon, D. Morozov, P. O'Leary, R. Ranjan, M. Rasquin, C. P. Stone, V. Vishwanath, G. H. Weber, B. Whitlock, M. Wolf, K. J. Wu, and E. W. Bethel, "Performance analysis, design considerations, and applications of extreme-scale in situ infrastructures," in Proceedings of the 2016 International Conference for High Performance Computing, Networking, Storage, and Analysis (SC 16), 2016, 79:1--79:12.

Digital Library

[42]

K. Wu, E. J. Otoo, and A. Shoshani, "Optimizing bitmap indices with efficient compression," ACM Trans. Database Syst., vol. 31, no. 1, pp. 1--38, Mar. 2006.

Digital Library

[43]

J. Kim, H. Abbasi, L. Chacón, C. Docan, S. Klasky, Q. Liu, N. Podhorszki, A. Shoshani, and K. Wu, "Parallel in situ indexing for data-intensive computing," in Proceedings of the 2011 IEEE Symposium on Large Data Analysis and Visualization (LDAV 11), 2011, pp. 65--72.

[44]

L. Lu, T. S. Pillai, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau, "Wisckey: Separating keys from values in ssd-conscious storage," in Proceedings of the 14th Usenix Conference on File and Storage Technologies (FAST 16), 2016, pp. 133--148.

Digital Library

[45]

N. Dayan, M. Athanassoulis, and S. Idreos, "Monkey: Optimal navigable key-value store," in Proceedings of the 2017 ACM International Conference on Management of Data (SIGMOD 17), 2017, pp. 79--94.

Digital Library

[46]

K. Ren, Q. Zheng, J. Arulraj, and G. Gibson, "Slimdb: A space-efficient key-value storage engine for semi-sorted data," Proc. VLDB Endow., vol. 10, no. 13, pp. 2037--2048, Sep. 2017.

Digital Library

[47]

X. Wu, Y. Xu, Z. Shao, and S. Jiang, "Lsm-trie: An lsm-tree-based ultra-large key-value store for small data," in Proceedings of the 2015 USENIX Conference on Usenix Annual Technical Conference (USENIX ATC 15), 2015, pp. 71--82.

Digital Library

[48]

H. V. Jagadish, P. P. S. Narayan, S. Seshadri, S. Sudarshan, and R. Kanneganti, "Incremental organization for data recording and warehousing," in Proceedings of the 23rd International Conference on Very Large Data Bases (VLDB 97), 1997, pp. 16--25.

Digital Library

[49]

P. Shetty, R. Spillane, R. Malpani, B. Andrews, J. Seyster, and E. Zadok, "Building workload-independent storage with vt-trees," in Proceedings of the 11th USENIX Conference on File and Storage Technologies (FAST 13), 2013, pp. 17--30.

Digital Library

[50]

H. N. Greenberg, J. Bent, and G. Grider, "MDHIM: A parallel key/value framework for HPC," in Proceedings of the 7th USENIX Conference on Hot Topics in Storage and File Systems (HotStorage 15), 2015, pp. 10--10.

Digital Library

[51]

J. Soumagne, D. Kimpe, J. Zounmevo, M. Chaarawi, Q. Koziol, A. Afsahi, and R. Ross, "Mercury: Enabling remote procedure call for high-performance computing," in Proceedings of the 2013 IEEE International Conference on Cluster Computing (CLUSTER 13), 2013, pp. 1--8.

[52]

A. D. Birrell and B. J. Nelson, "Implementing remote procedure calls," in Proceedings of the Ninth ACM Symposium on Operating Systems Principles (SOSP 83), 1983, pp. 3-.

Digital Library

[53]

P. Carns, W. Ligon, R. Ross, and P. Wyckoff, "Bmi: A network abstraction layer for parallel i/o," in Proceedings of the 2005 IEEE International Symposium on Parallel and Distributed Processing (IPDPS 05), 2005, pp. 1--8.

Digital Library

[54]

S. Atchley, D. Dillow, G. Shipman, P. Geoffray, J. M. Squyres, G. Bosilca, and R. Minnich, "The common communication interface (cci)," in Proceedings of the 2011 IEEE Annual Symposium on High Performance Interconnects (HOTI 11), 2011, pp. 51--60.

Digital Library

[55]

P. Grun, S. Hefty, S. Sur, D. Goodell, R. D. Russell, H. Pritchard, and J. M. Squyres, "A brief introduction to the openfabrics interfaces - a new network api for maximizing high performance application efficiency," in Proceedings of the 2015 IEEE Annual Symposium on High-Performance Interconnects (HOTI 15), 2015, pp. 34--39.

Digital Library

[56]

Mochi ch-placement, https://xgitlab.cels.anl.gov/codes/ch-placement.

[57]

S. A. Weil, K. T. Pollack, S. A. Brandt, and E. L. Miller, "Dynamic metadata management for petabyte-scale file systems," in Proceedings of the 2004 International Conference for High Performance Computing, Networking, Storage, and Analysis (SC 04), 2004, pp. 4-.

Digital Library

[58]

D. Karger, E. Lehman, T. Leighton, R. Panigrahy, M. Levine, and D. Lewin, "Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the world wide web," in Proceedings of the Twenty-ninth Annual ACM Symposium on Theory of Computing (STOC 97), 1997, pp. 654--663.

Digital Library

[59]

G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels, "Dynamo: Amazon's highly available key-value store," in Proceedings of Twenty-first ACM SIGOPS Symposium on Operating Systems Principles (SOSP 07), 2007, pp. 205--220.

Digital Library

[60]

B. Alverson, E. Froese, L. Kaplan, and D. Roweth, "Cray xc series network," Cray Inc., Tech. Rep. WP-Aries01--1112, Nov. 2012, http://www.cray.com/sites/default/files/resources/CrayXCNetwork.pdf.

[61]

S. A. Weil, A. W. Leung, S. A. Brandt, and C. Maltzahn, "RADOS: A scalable, reliable storage service for petabyte-scale storage clusters," in Proceedings of the 2Nd International Workshop on Petascale Data Storage (PDSW 07), 2007, pp. 35--44.

Digital Library

[62]

K. Ren, Q. Zheng, S. Patil, and G. Gibson, "IndexFS: Scaling file system metadata performance with stateless caching and bulk insertion," in Proceedings of the 2014 International Conference for High Performance Computing, Networking, Storage, and Analysis (SC 14), 2014, pp. 237--248.

Digital Library

[63]

Q. Zheng, K. Ren, and G. Gibson, "BatchFS: Scaling the file system control plane with client-funded metadata servers," in Proceedings of the 9th Parallel Data Storage Workshop (PDSW 14), 2014, pp. 1--6.

Digital Library

[64]

Q. Zheng, K. Ren, G. Gibson, B. W. Settlemyer, and G. Grider, "DeltaFS: Exascale file systems scale better without dedicated servers," in Proceedings of the 10th Parallel Data Storage Workshop (PDSW 15), 2015, pp. 1--6.

Digital Library

[65]

L. Xiao, K. Ren, Q. Zheng, and G. A. Gibson, "ShardFS vs. IndexFS: Replication vs. caching strategies for distributed metadata management in cloud storage systems," in Proceedings of the Sixth ACM Symposium on Cloud Computing (SoCC 15), 2015, pp. 236--249.

Digital Library

[66]

F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber, "Bigtable: A distributed storage system for structured data," ACM Trans. Comput. Syst., vol. 26, no. 2, 4:1--4:26, Jun. 2008.

Digital Library

[67]

B. H. Bloom, "Space/time trade-offs in hash coding with allowable errors," Commun. ACM, vol. 13, no. 7, pp. 422--426, Jul. 1970.

Digital Library

[68]

Leveldb, https://github.com/google/leveldb/.

[69]

K. Ren and G. Gibson, "TABLEFS: Enhancing metadata efficiency in the local file system," in Proceedings of the 2013 USENIX Annual Technical Conference (USENIX ATC 13), 2013, pp. 145--156.

Digital Library

[70]

S. Li, Y. Lu, J. Shu, Y. Hu, and T. Li, "Locofs: A loosely-coupled metadata service for distributed file systems," in Proceedings of the 2017 International Conference for High Performance Computing, Networking, Storage, and Analysis (SC 17), 2017, 4:1--4:12.

Digital Library

[71]

M. Rosenblum and J. K. Ousterhout, "The design and implementation of a log-structured file system," in Proceedings of the Thirteenth ACM Symposium on Operating Systems Principles (SOSP 91), 1991, pp. 1--15.

Digital Library

Recommendations

Streaming Data Reorganization at Scale with DeltaFS Indexed Massive Directories
Special Section on Computational Storage and Regular Papers

Complex storage stacks providing data compression, indexing, and analytics help leverage the massive amounts of data generated today to derive insights. It is challenging to perform this computation, however, while fully utilizing the underlying storage ...
Software-defined storage for fast trajectory queries using a deltaFS indexed massive directory
PDSW-DISCS '17: Proceedings of the 2nd Joint International Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems

In this paper we introduce the Indexed Massive Directory, a new technique for indexing data within DeltaFS. With its design as a scalable, server-less file system for HPC platforms, DeltaFS scales file system metadata performance with application scale. ...
DeltaFS: exascale file systems scale better without dedicated servers
PDSW '15: Proceedings of the 10th Parallel Data Storage Workshop

High performance computing fault tolerance depends on scalable parallel file system performance. For more than a decade scalable bandwidth has been available from the object storage systems that underlie modern parallel file systems, and recently we ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '18: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis

November 2018

932 pages

Sponsors

SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing

In-Cooperation

IEEE CS

Publisher

IEEE Press

Publication History

Published: 11 November 2018

Check for updates

Badges

Artifacts Available

Qualifiers

Research-article

Conference

SC18

Sponsor:

SIGHPC

SC18: The International Conference for High Performance Computing, Networking, Storage and Analysis

November 11 - 16, 2018

Texas, Dallas

Acceptance Rates

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
190
Total Downloads

Downloads (Last 12 months)5
Downloads (Last 6 weeks)0

Reflects downloads up to 21 Sep 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents