Abstract
Big Data refers to the massive amounts of structured and unstructured data being produced every day from a wide range of sources. Big Data is difficult to work with and needs a large number of machines to process it, as well as software capable of running in a distributed environment. MapReduce is a recent programming model that simplifies writing distributed programs on distributed systems. For MapReduce to work it needs to divide work amongst computers in a network. Consequently, the performance of MapReduce is dependent on how evenly it distributes the workload. This paper proposes an adaptive sampling mechanism for total order partitioning that can reduce memory consumption whilst partitioning with a trie-based sampling mechanism (ATrie). The performance of the proposed algorithm is compared to a state of the art trie-based partitioning system (ETrie). Experiments show the proposed mechanism is more adaptive and more memory efficient than previous implementations. Since ATrie is adaptive, its performance depended on the type of data used. A performance evaluation of a 2-level ATrie shows it uses 2.43 times less memory for case insensitive email addresses, and uses 1,024 times less memory for birthdates compared to that of a 2-level ETrie. These results show the potential of the proposed method.
Similar content being viewed by others
References
Apache Software Foundation, “Hadoop”, http://hadoop.apache.org/core
Candan, K., Kim, J.W., Nagarkar, P., Nagendra, M., Yu, R.: RanKloud: scalable multimedia data processing in server clusters. IEEE MultiMed. 18, 64–77 (2011)
Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.E.: Bigtable: a distributed storage system for structured data. In: 7th USENIX Symposium on Operating Systems Design and Implementation, pp. 205–218 (2006)
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51, 107–113 (2008)
Fadika, Z., Govindaraju, M.: Delma: dynamically elastic mapreduce framework for cpu-intensive applications. In: IEEE/ACM International Symposium on Cluster, Cloud and Grid, Computing, pp.454–463 (2011)
Ghemawat, S., Gobioff, H., Leung, S.-T.: The google file system. In: ACM SIGOPS Operating Systems Review, ACM, pp. 29–43 (2003)
Groot, S., Kitsuregawa, M.: Jumbo: Beyond MapReduce for Workload Balancing. VLDB, Phd Workshop (2010)
Heinz, S., Zobel, J., Williams, H.: Burst tries: a fast, efficient data structure for string keys. Trans. Inf. Syst. (TOIS) 20(12), 192–223 (2002)
Jiang, W., Agrawal, G.: Ex-mate: data intensive computing with large reduction objects and its application to graph mining. In: Cluster, Cloud and Grid Computing (CCGrid): 11th IEEE/ACM International Symposium on, IEEE 2011, pp. 475–484 (2011)
Jin, C., Vecchiola, C., Buyya, R.: MRPGA: an extension of MapReduce for parallelizing genetic algorithms. In: IEEE Fourth International Conference on eScience pp. 214–221 (2008)
Kavulya, S., Tan, J., Gandhi, R., Narasimhan, P.: An analysis of traces from a production mapreduce cluster. In: Cluster, Cloud and Grid Computing (CCGrid), 2010 10th IEEE/ACM International Conference, pp. 94–103 (2010)
Krishnan, A.: GridBLAST: a globus-based high-throughput implementation of BLAST in a Grid computing framework. Concurr. Comput. Pract. Exp. 17(13), 1607–1623 (2005)
Liu, H., Orban, D.: Cloud mapreduce: a mapreduce implementation on top of a cloud operating system. In: 11th IEEE/ACM International Symposium, pp. 464–474 (2011)
Lynden, S., Tanimura, Y., Kojima, I., Matono, A.: Dynamic data redistribution for MapReduce joins. In: IEEE International Conference on Coud Computing Technology and Science, pp. 713–717 (2011)
Matsunaga, A., Tsugawa, M., Fortes, J.: “Programming abstractions for data intensive computing on clouds and grids. In: IEEE Fourth International Conference on eScience, pp. 489–493 (2008)
Miceli, C., Miceli, M., Jha, S., Kaiser, H., Merzky, A.: Programming abstractions for data intensive computing on clouds and grids. In: IEEE/ACM International Symposium on Cluster Computing and the Grid, pp. 480–483 (2009)
O’Malley, O.: TeraByte Sort on Apache Hadoop (2008)
Panda, B., Riedewald, M., Fink, D.: The model-summary problem and a solution for trees. In: Data Engineering, International Conference on Data, Engineering, pp. 452–455 (2010)
Papadimitriou, S., Sun, J.: Disco: distributed co-clustering with map-reduce: a case study towards petabyte-scale end-to-end mining. In: IEEE International Conference on Data Mining, p. 519 (2008)
Shafer, J., Rixner, S., Cox, A.L.: The Hadoop distributed filesystem: balancing portability and performance. In: IEEE International Symposium on Performance Analysis of System and Software(ISPASS), p. 123 (2010)
Slagter, K., Hsu, C.-H., Chung, Y.-C., Zhang, D.: An improved partitioning mechanism for optimizing massive data analysis using MapReduce. J. Supercomput. 66(1), 539–555 (2013)
Stockinger, H., Pagni, M., Cerutti, L., Falquet, L.: Grid approach to embarrassingly parallel CPU-intensive bioinformatics problems. In: IEEE International Conference on e-Science and Grid Computing (2006)
Tan, J., Pan, X., Kavulya, S., Gandhi, R., Narasimhan, P.: Mochi: visual log-analysis based tools for debugging Hadoop. In: USENIX Workshop on Hot Topics in Cloud Computing (HotCloud) (2009)
Vashishtha, H., Smit, M., Stroulia, E.: Moving text analysis tools to the cloud. In: IEEE World Congress on Services, pp. 110–112 (2010)
Verma, A., Llora, X., Goldberg, D.E., Campbell, R.H.: Scaling genetic algorithms using mapreduce. In: Intelligent Systems Design and Applications (2009)
White, T.: “Hadoop the definitive guide 2nd edition”, Published Oreilly (2010)
Xu, W., Huang, L., Fox, A., Patterson, D., Jordan, M.I.: Detecting large-scale system problems by mining console logs. In: Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles (2009)
Acknowledgments
We would like to thank the various colleagues in the System Software Laboratory at National Tsing Hua University as well as my colleagues at the Department of Computer Science and Information Engineering in Chung Hua University for their support and for their help on earlier drafts of this paper.
Author information
Authors and Affiliations
Corresponding authors
Rights and permissions
About this article
Cite this article
Slagter, K., Hsu, CH. & Chung, YC. An Adaptive and Memory Efficient Sampling Mechanism for Partitioning in MapReduce. Int J Parallel Prog 43, 489–507 (2015). https://doi.org/10.1007/s10766-013-0288-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10766-013-0288-z