Abstract
With the explosion of data, we have an urgent demand for data throughput in high performance computing systems. Data-intensive applications are becoming increasingly common in HPC environments. As data scale increases faster than systems, it’s time to fully utilize resources in every aspect, including computing power, storage capacity and data throughput. We can no longer ignore data preprocessing since it’s an important procedure, especially when dealing with large amount of data. How to efficiently perform data preprocessing in current HPC systems? How to make full use of system resources on data-intensive applications? What should be valued when designing new HPC architectures? All these questions need answers. In this paper, we drew a sketch for procedure of data-intensive applications, which lead to an adaptive resource allocation scheme according to procedure requirements. We analyzed characters of preprocessing and designed a preprocessing model for data-intensive applications in HPC systems. It has not only fulfilled the demand for computing but also meet the need of throughput, with cooperative work in storage system and storage management system. Experiments were done on Sunway TaihuLight, one of the world’s fastest supercomputers. The whole procedure of preprocessing at Petabytes can be done in hours without interfering other ongoing applications.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Chodorow, K.: MongoDB: The Definitive Guide: Powerful and Scalable Data Storage. O’Reilly. Media Inc., Newton (2013)
Fu, H., et al.: The sunway taihulight supercomputer: system and applications. Sci. China Inf. Sci. 59(7), 072001 (2016)
Huang, H., Lin, J., Chen, C., Fan, M.: Review of outlier detection. Appl. Res. Comput. 8, 002 (2006)
Islam, N.S., Lu, X., Wasi-ur Rahman, M., Shankar, D., Panda, D.K.: Triple-h: a hybrid approach to accelerate hdfs on hpc clusters with heterogeneous storage architecture. In: 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), pp. 101–110. IEEE (2015)
Islam, N.S., Shankar, D., Lu, X., Wasi-Ur-Rahman, M., Panda, D.K.: Accelerating I/O performance of big data analytics on HPC clusters through RDMA-based key-value store. In: 44th International Conference on Parallel Processing (ICPP), pp. 280–289. IEEE (2015)
Jian, Z., Jin, X.: Research on data preprocess in data mining and its application. Appl. Res. Comput. 7(117–118), 157 (2004)
Kalmegh, P., Navathe, S.B.: Graph database design challenges using hpc platforms. In: High Performance. Computing, Networking, Storage and Analysis (SCC), SC Companion, pp. 1306–1309. IEEE (2012)
Miller, J.J.: Graph database applications and concepts with neo4j. In: Proceedings of the Southern Association for Information Systems Conference, Atlanta, GA, USA, vol. 2324, p. 36 (2013)
Miyoshi, T., Kondo, K., Terasaki, K.: Big ensemble data assimilation in numerical weather prediction. Computer 48(11), 15–21 (2015)
Miyoshi, T., et al.: “Big data assimilation” revolutionizing severe weather prediction. Bull. Am. Meteorol. Soc. 97(8), 1347–1354 (2016)
Wenguang, C.: Big data and high performance computing, 003, pp. 1–6 (2015)
Team at the University of Wisconsin Madison, H.: High Throughput Computing, June 2015. http://research.cs.wisc.edu/htcondor/htc.html
Yi, Z., Peng, Z., Xuebin, C., Tie, N., Zongyan, C.: A brief view on requirements and development of high performance computing application. J. Comput. Res. Dev. 10, 001 (2007)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Sun, R., Zhang, L., Wang, X. (2018). Cooperative Preprocessing at Petabytes on High Performance Computing System. In: Vaidya, J., Li, J. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2018. Lecture Notes in Computer Science(), vol 11335. Springer, Cham. https://doi.org/10.1007/978-3-030-05054-2_16
Download citation
DOI: https://doi.org/10.1007/978-3-030-05054-2_16
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-05053-5
Online ISBN: 978-3-030-05054-2
eBook Packages: Computer ScienceComputer Science (R0)