Nothing Special   »   [go: up one dir, main page]

skip to main content
10.5555/3291656.3291660acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Scaling embedded in-situ indexing with deltaFS

Published: 11 November 2018 Publication History

Abstract

Analysis of large-scale simulation output is a core element of scientific inquiry, but analysis queries may experience significant I/O overhead when the data is not structured for efficient retrieval. While in-situ processing allows for improved time-to-insight for many applications, scaling in-situ frameworks to hundreds of thousands of cores can be difficult in practice. The DeltaFS in-situ indexing is a new approach for in-situ processing of massive amounts of data to achieve efficient point and small-range queries. This paper describes the challenges and lessons learned when scaling this in-situ processing function to hundreds of thousands of cores. We propose techniques for scalable all-to-all communication that is memory and bandwidth efficient, concurrent indexing, and specialized LSM-Tree formats. Combining these techniques allows DeltaFS to control the cost of in-situ processing while maintaining 3 orders of magnitude query speedup when scaling alongside the popular VPIC particle-in-cell code to 131,072 cores.

References

[1]
Exascale computing project (ECP), https://www.exascaleproject.org/.
[2]
ANL aurora, https://www.alcf.anl.gov/alcf-aurora-2021-early-science-program-data-and-learning-call-proposals.
[3]
Q. Zheng, G. Amvrosiadis, S. Kadekodi, G. A. Gibson, C. D. Cranor, B. W. Settlemyer, G. Grider, and F. Guo, "Software-defined storage for fast trajectory queries using a deltafs indexed massive directory," in Proceedings of the 2nd Joint International Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems (PDSW-DISCS 17), 2017, pp. 7--12.
[4]
APEX workflows, https://www.nersc.gov/assets/apex-workflows-v2.pdf, Mar. 2016.
[5]
LANL trinity, http://www.lanl.gov/projects/trinity/.
[6]
K. J. Bowers, B. J. Albright, L. Yin, B. Bergen, and T. J. T. Kwan, "Ultrahigh performance three-dimensional electromagnetic relativistic kinetic plasma simulation," Physics of Plasmas, vol. 15, no. 5, p. 7, 2008.
[7]
F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber, "Bigtable: A distributed storage system for structured data," in Proceedings of the 7th Symposium on Operating Systems Design and Implementation (OSDI 06), 2006, pp. 205--218.
[8]
A. Lakshman and P. Malik, "Cassandra: A decentralized structured storage system," SIGOPS Oper. Syst. Rev., vol. 44, no. 2, pp. 35--40, Apr. 2010.
[9]
P. O'Neil, E. Cheng, D. Gawlick, and E. O'Neil, "The log-structured merge-tree (LSM-tree)," Acta Inf., vol. 33, no. 4, pp. 351--385, Jun. 1996.
[10]
S. Byna, J. Chou, O. Rübel, Prabhat, H. Karimabadi, W. S. Daughton, V. Roytershteyn, E. W. Bethel, M. Howison, K.-J. Hsu, K.-W. Lin, A. Shoshani, A. Uselton, and K. Wu, "Parallel i/o, analysis, and visualization of a trillion particle simulation," in Proceedings of the 2012 International Conference on High Performance Computing, Networking, Storage, and Analysis (SC 12), 2012, 59:1--59:12.
[11]
J. H. Chen, A. Choudhary, B. De Supinski, M. DeVries, E. R. Hawkes, S. Klasky, W.-K. Liao, K.-L. Ma, J. Mellor-Crummey, N. Podhorszki, et al., "Terascale direct numerical simulations of turbulent combustion using s3d," Computational Science & Discovery, vol. 2, no. 1, p. 015001, 2009.
[12]
S. Lang, P. Carns, R. Latham, R. Ross, K. Harms, and W. Allcock, "I/O performance challenges at leadership scale," in Proceedings of the 2009 International Conference for High Performance Computing, Networking, Storage, and Analysis (SC 09), 2009, 40:1--40:12.
[13]
J. Lofstead, M. Polte, G. Gibson, S. Klasky, K. Schwan, R. Oldfield, M. Wolf, and Q. Liu, "Six degrees of scientific data: Reading patterns for extreme scale science io," in Proceedings of the 20th International Symposium on High Performance Distributed Computing (HPDC 11), 2011, pp. 49--60.
[14]
J. Chou, M. Howison, B. Austin, K. Wu, J. Qiang, E. W. Bethel, A. Shoshani, O. Rübel, Prabhat, and R. D. Ryne, "Parallel index and query for large scale data analysis," in Proceedings of the 2011 International Conference for High Performance Computing, Networking, Storage, and Analysis (SC 11), 2011, 30:1--30:11.
[15]
J. Chou, K. Wu, and Prabhat, "Fastquery: A parallel indexing system for scientific data," in Proceedings of the 2011 IEEE International Conference on Cluster Computing (CLUSTER 11), 2011, pp. 455--464.
[16]
N. Liu, J. Cope, P. Carns, C. Carothers, R. Ross, G. Grider, A. Crume, and C. Maltzahn, "On the role of burst buffers in leadership-class storage systems," in Proceedings of the 2012 IEEE Conference on Massive Storage Systems and Technologies (MSST 12), 2012, pp. 1--11.
[17]
J Bent, G Grider, B Kettering, A Manzanares, M McClelland, A. Torres, and A. Torrez, "Storage challenges at los alamos national lab," in Proceedings of the 2012 IEEE Conference on Massive Storage Systems and Technologies (MSST 12), 2012, pp. 1--5.
[18]
J. Bent, B. Settlemyer, and G. Grider, "Serving data to the lunatic fringe: The evolution of HPC storage," USENIX ;login:, vol. 41, no. 2, Jun. 2016.
[19]
A. Nisar, W. k. Liao, and A. Choudhary, "Scaling parallel i/o performance through i/o delegate and caching system," in Proceedings of the 2008 International Conference for High Performance Computing, Networking, Storage, and Analysis (SC 08), 2008, pp. 1--12.
[20]
F. Zheng, H. Abbasi, C. Docan, J. Lofstead, Q. Liu, S. Klasky, M. Parashar, N. Podhorszki, K. Schwan, and M. Wolf, "Predata - preparatory data analytics on peta-scale machines," in Proceedings of the 2010 IEEE International Symposium on Parallel Distributed Processing (IPDPS 10), 2010, pp. 1--12.
[21]
V. Vishwanath, M. Hereld, and M. E. Papka, "Toward simulation-time data analysis and i/o acceleration on leadership-class systems," in Proceedings of the 2011 IEEE Symposium on Large Data Analysis and Visualization (LDAV 11), 2011, pp. 9--14.
[22]
J. C. Bennett, H. Abbasi, P. T. Bremer, R. Grout, A. Gyulassy, T. Jin, S. Klasky, H. Kolla, M. Parashar, V. Pascucci, P. Pebay, D. Thompson, H. Yu, F. Zhang, and J. Chen, "Combining in-situ and in-transit processing to enable extreme-scale scientific analysis," in Proceedings of the 2012 International Conference for High Performance Computing, Networking, Storage, and Analysis (SC 12), 2012, pp. 1--9.
[23]
J. Bent, S. Faibish, J. Ahrens, G. Grider, J. Patchett, P. Tzelnic, and J. Woodring, "Jitter-free co-processing on a prototype exascale storage stack," in Proceedings of the 2012 IEEE Conference on Massive Storage Systems and Technologies (MSST 12), 2012, pp. 1--5.
[24]
J. Lofstead and R. Ross, "Insights for exascale IO apis from building a petascale IO api," in Proceedings of the 2013 International Conference on High Performance Computing, Networking, Storage, and Analysis (SC 13), 2013, 87:1--87:12.
[25]
R. B. Ross, From file systems to services: Changing the data management model in hpc, Presented at the Salishan Conference on High-Speed Computing, http://www.mcs.anl.gov/research/projects/mochi/files/2016/11/ross_salishan-2016-16x9.pdf, 2016.
[26]
P. Carns, Building blocks for user-level hpc storage systems, Presented at Dagstuhl Seminar: Challenges and Opportunities of User-Level File Systems for HPC, http://www.mcs.anl.gov/research/projects/mochi/files/2016/11/carns-dagstuhl-2017-v3.pdf, 2017.
[27]
J. Bent, G. Gibson, G. Grider, B. McClelland, P. Nowoczynski, J. Nunez, M. Polte, and M. Wingate, "PLFS: A checkpoint filesystem for parallel applications," in Proceedings of the 2009 International Conference for High Performance Computing, Networking, Storage, and Analysis (SC 09), 2009, 21:1--21:12.
[28]
S. Byna, A Uselton, D. K. Prabhat, and Y He, "Trillion particles, 120,000 cores, and 350 tbs: Lessons learned from a hero i/o run on hopper," in Cray User Group (CUG), https://cug.org/proceedings/cug2013_proceedings/includes/files/pap107-file2.pdf, 2013.
[29]
S. Byna, R. Sisneros, K. Chadalavada, and Q. Koziol, "Tuning parallel i/o on blue waters for writing 10 trillion particles," in Cray User Group (CUG), https://cug.org/proceedings/cug2015_proceedings/includes/files/pap120-file2.pdf, 2015.
[30]
P. Schwan, "Lustre: Building a file system for 1000-node clusters," in Proceedings of the 2003 Ottawa Linux Symposium (OLS 03), 2003, pp. 380--386.
[31]
I. F. Haddad, "Pvfs: A parallel virtual file system for linux clusters," Linux J., vol. 2000, no. 80es, Nov. 2000.
[32]
B. Welch, M. Unangst, Z. Abbasi, G. Gibson, B. Mueller, J. Small, J. Zelenka, and B. Zhou, "Scalable performance of the panasas parallel file system," in Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST 08), 2008, 2:1--2:17.
[33]
F. B. Schmuck and R. L. Haskin, "GPFS: A shared-disk file system for large computing clusters," in Proceedings of the 1st USENIX Conference on File and Storage Technologies (FAST 02), 2002, pp. 231--244.
[34]
V. Vishwanath, M. Hereld, V. Morozov, and M. E. Papka, "Topology-aware data movement and staging for i/o acceleration on blue gene/p supercomputing systems," in Proceedings of the 2011 International Conference for High Performance Computing, Networking, Storage, and Analysis (SC 11), 2011, pp. 1--11.
[35]
R. A. Oldfield, G. D. Sjaardema, G. F. Lofstead II, and T. Kordenbrock, "Trilinos i/o support trios," Sci. Program., vol. 20, no. 2, pp. 181--196, Apr. 2012.
[36]
M. Doner, G. Antoniu, F. Cappello, M. Snir, and L. Orf, "Daman's: How to efficiently leverage multicore parallelism to achieve scalable, jitter-free i/o," in Proceedings of the 2012 IEEE International Conference on Cluster Computing (CLUSTER 12), 2012, pp. 155--163.
[37]
M. Li, S. S. Vazhkudai, A. R. Butt, F. Meng, X. Ma, Y. Kim, C. Engelmann, and G. Shipman, "Functional partitioning to optimize end-to-end performance on many-core architectures," in Proceedings of the 2010 International Conference for High Performance Computing, Networking, Storage, and Analysis (SC 10), 2010, pp. 1--12.
[38]
F. Zheng, H. Yu, C. Hantas, M. Wolf, G. Eisenhauer, K. Schwan, H. Abbasi, and S. Klasky, "Goldrush: Resource efficient in situ scientific data analytics using fine-grained interference aware execution," in Proceedings of the 2013 International Conference for High Performance Computing, Networking, Storage, and Analysis (SC 13), 2013, pp. 1--12.
[39]
F. Zheng, H. Zou, G. Eisenhauer, K. Schwan, M. Wolf, J. Dayal, T. A. Nguyen, J. Cao, H. Abbasi, S. Klasky, N. Podhorszki, and H. Yu, "Flexio: I/o middleware for location-flexible scientific data analytics," in Proceedings of the 2013 IEEE International Symposium on Parallel and Distributed Processing (IPDPS 13), 2013, pp. 320--331.
[40]
J. Lofstead, F. Zheng, S. Klasky, and K. Schwan, "Adaptable, metadata rich io methods for portable high performance io," in Proceedings of the 2009 IEEE International Symposium on Parallel and Distributed Processing (IPDPS 09), 2009, pp. 1--10.
[41]
U. Ayachit, A. Bauer, E. P. N. Duque, G. Eisenhauer, N. Ferrier, J. Gu, K. E. Jansen, B. Loring, Z. Lukić, S. Menon, D. Morozov, P. O'Leary, R. Ranjan, M. Rasquin, C. P. Stone, V. Vishwanath, G. H. Weber, B. Whitlock, M. Wolf, K. J. Wu, and E. W. Bethel, "Performance analysis, design considerations, and applications of extreme-scale in situ infrastructures," in Proceedings of the 2016 International Conference for High Performance Computing, Networking, Storage, and Analysis (SC 16), 2016, 79:1--79:12.
[42]
K. Wu, E. J. Otoo, and A. Shoshani, "Optimizing bitmap indices with efficient compression," ACM Trans. Database Syst., vol. 31, no. 1, pp. 1--38, Mar. 2006.
[43]
J. Kim, H. Abbasi, L. Chacón, C. Docan, S. Klasky, Q. Liu, N. Podhorszki, A. Shoshani, and K. Wu, "Parallel in situ indexing for data-intensive computing," in Proceedings of the 2011 IEEE Symposium on Large Data Analysis and Visualization (LDAV 11), 2011, pp. 65--72.
[44]
L. Lu, T. S. Pillai, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau, "Wisckey: Separating keys from values in ssd-conscious storage," in Proceedings of the 14th Usenix Conference on File and Storage Technologies (FAST 16), 2016, pp. 133--148.
[45]
N. Dayan, M. Athanassoulis, and S. Idreos, "Monkey: Optimal navigable key-value store," in Proceedings of the 2017 ACM International Conference on Management of Data (SIGMOD 17), 2017, pp. 79--94.
[46]
K. Ren, Q. Zheng, J. Arulraj, and G. Gibson, "Slimdb: A space-efficient key-value storage engine for semi-sorted data," Proc. VLDB Endow., vol. 10, no. 13, pp. 2037--2048, Sep. 2017.
[47]
X. Wu, Y. Xu, Z. Shao, and S. Jiang, "Lsm-trie: An lsm-tree-based ultra-large key-value store for small data," in Proceedings of the 2015 USENIX Conference on Usenix Annual Technical Conference (USENIX ATC 15), 2015, pp. 71--82.
[48]
H. V. Jagadish, P. P. S. Narayan, S. Seshadri, S. Sudarshan, and R. Kanneganti, "Incremental organization for data recording and warehousing," in Proceedings of the 23rd International Conference on Very Large Data Bases (VLDB 97), 1997, pp. 16--25.
[49]
P. Shetty, R. Spillane, R. Malpani, B. Andrews, J. Seyster, and E. Zadok, "Building workload-independent storage with vt-trees," in Proceedings of the 11th USENIX Conference on File and Storage Technologies (FAST 13), 2013, pp. 17--30.
[50]
H. N. Greenberg, J. Bent, and G. Grider, "MDHIM: A parallel key/value framework for HPC," in Proceedings of the 7th USENIX Conference on Hot Topics in Storage and File Systems (HotStorage 15), 2015, pp. 10--10.
[51]
J. Soumagne, D. Kimpe, J. Zounmevo, M. Chaarawi, Q. Koziol, A. Afsahi, and R. Ross, "Mercury: Enabling remote procedure call for high-performance computing," in Proceedings of the 2013 IEEE International Conference on Cluster Computing (CLUSTER 13), 2013, pp. 1--8.
[52]
A. D. Birrell and B. J. Nelson, "Implementing remote procedure calls," in Proceedings of the Ninth ACM Symposium on Operating Systems Principles (SOSP 83), 1983, pp. 3-.
[53]
P. Carns, W. Ligon, R. Ross, and P. Wyckoff, "Bmi: A network abstraction layer for parallel i/o," in Proceedings of the 2005 IEEE International Symposium on Parallel and Distributed Processing (IPDPS 05), 2005, pp. 1--8.
[54]
S. Atchley, D. Dillow, G. Shipman, P. Geoffray, J. M. Squyres, G. Bosilca, and R. Minnich, "The common communication interface (cci)," in Proceedings of the 2011 IEEE Annual Symposium on High Performance Interconnects (HOTI 11), 2011, pp. 51--60.
[55]
P. Grun, S. Hefty, S. Sur, D. Goodell, R. D. Russell, H. Pritchard, and J. M. Squyres, "A brief introduction to the openfabrics interfaces - a new network api for maximizing high performance application efficiency," in Proceedings of the 2015 IEEE Annual Symposium on High-Performance Interconnects (HOTI 15), 2015, pp. 34--39.
[56]
Mochi ch-placement, https://xgitlab.cels.anl.gov/codes/ch-placement.
[57]
S. A. Weil, K. T. Pollack, S. A. Brandt, and E. L. Miller, "Dynamic metadata management for petabyte-scale file systems," in Proceedings of the 2004 International Conference for High Performance Computing, Networking, Storage, and Analysis (SC 04), 2004, pp. 4-.
[58]
D. Karger, E. Lehman, T. Leighton, R. Panigrahy, M. Levine, and D. Lewin, "Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the world wide web," in Proceedings of the Twenty-ninth Annual ACM Symposium on Theory of Computing (STOC 97), 1997, pp. 654--663.
[59]
G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels, "Dynamo: Amazon's highly available key-value store," in Proceedings of Twenty-first ACM SIGOPS Symposium on Operating Systems Principles (SOSP 07), 2007, pp. 205--220.
[60]
B. Alverson, E. Froese, L. Kaplan, and D. Roweth, "Cray xc series network," Cray Inc., Tech. Rep. WP-Aries01--1112, Nov. 2012, http://www.cray.com/sites/default/files/resources/CrayXCNetwork.pdf.
[61]
S. A. Weil, A. W. Leung, S. A. Brandt, and C. Maltzahn, "RADOS: A scalable, reliable storage service for petabyte-scale storage clusters," in Proceedings of the 2Nd International Workshop on Petascale Data Storage (PDSW 07), 2007, pp. 35--44.
[62]
K. Ren, Q. Zheng, S. Patil, and G. Gibson, "IndexFS: Scaling file system metadata performance with stateless caching and bulk insertion," in Proceedings of the 2014 International Conference for High Performance Computing, Networking, Storage, and Analysis (SC 14), 2014, pp. 237--248.
[63]
Q. Zheng, K. Ren, and G. Gibson, "BatchFS: Scaling the file system control plane with client-funded metadata servers," in Proceedings of the 9th Parallel Data Storage Workshop (PDSW 14), 2014, pp. 1--6.
[64]
Q. Zheng, K. Ren, G. Gibson, B. W. Settlemyer, and G. Grider, "DeltaFS: Exascale file systems scale better without dedicated servers," in Proceedings of the 10th Parallel Data Storage Workshop (PDSW 15), 2015, pp. 1--6.
[65]
L. Xiao, K. Ren, Q. Zheng, and G. A. Gibson, "ShardFS vs. IndexFS: Replication vs. caching strategies for distributed metadata management in cloud storage systems," in Proceedings of the Sixth ACM Symposium on Cloud Computing (SoCC 15), 2015, pp. 236--249.
[66]
F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber, "Bigtable: A distributed storage system for structured data," ACM Trans. Comput. Syst., vol. 26, no. 2, 4:1--4:26, Jun. 2008.
[67]
B. H. Bloom, "Space/time trade-offs in hash coding with allowable errors," Commun. ACM, vol. 13, no. 7, pp. 422--426, Jul. 1970.
[68]
Leveldb, https://github.com/google/leveldb/.
[69]
K. Ren and G. Gibson, "TABLEFS: Enhancing metadata efficiency in the local file system," in Proceedings of the 2013 USENIX Annual Technical Conference (USENIX ATC 13), 2013, pp. 145--156.
[70]
S. Li, Y. Lu, J. Shu, Y. Hu, and T. Li, "Locofs: A loosely-coupled metadata service for distributed file systems," in Proceedings of the 2017 International Conference for High Performance Computing, Networking, Storage, and Analysis (SC 17), 2017, 4:1--4:12.
[71]
M. Rosenblum and J. K. Ousterhout, "The design and implementation of a log-structured file system," in Proceedings of the Thirteenth ACM Symposium on Operating Systems Principles (SOSP 91), 1991, pp. 1--15.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '18: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis
November 2018
932 pages

Sponsors

In-Cooperation

  • IEEE CS

Publisher

IEEE Press

Publication History

Published: 11 November 2018

Check for updates

Badges

Qualifiers

  • Research-article

Conference

SC18
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 190
    Total Downloads
  • Downloads (Last 12 months)5
  • Downloads (Last 6 weeks)0
Reflects downloads up to 21 Sep 2024

Other Metrics

Citations

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media